In spite of continuous improvement of computational power of multi/many-core processors, the memory access performance of the processors has not been improved sufficiently, and thus the overall performance of recent processors is often restricted by the delay of off-chip memory accesses. Low-delay data compression for last level cache (LLC) would be effective to improve the processor performance because the compression increases the effective size of LLC, and thus reduces the number of off-chip memory accesses. This paper proposes a novel data compression method suitable for high-speed parallel decoding in the LLC. Since cache line data often have periodicity of certain lengths, such as 32- or 64-bit instructions, 32-bit integers, and 64-bit floating point numbers, an information word is encoded as a base pattern and a differential pattern between the original word and the base pattern. Evaluation using a GPU simulator shows that the compression ratio of the proposed coding is comparable to LZSS coding and X-Match Pro and superior to other conventional compression algorithms for cache memories. Also this paper presents an experimental decoder designed for ASIC, and the synthesized result shows that the decoder can decompress cache line data of length 32bytes in four clock cycles. Evaluation of the IPC on the GPU simulator shows that, for several benchmark programs, the IPC achieved by the proposed coding is higher than that by the conventional BΔI coding, where the maximum improvement of the IPC is 20%.
Lei GUO Yuhua TANG Yong DOU Yuanwu LEI Meng MA Jie ZHOU
The effective bandwidth of the dynamic random-access memory (DRAM) for the alternate row-wise/column-wise matrix access (AR/CMA) mode, which is a basic characteristic in scientific and engineering applications, is very low. Therefore, we propose the window memory layout scheme (WMLS), which is a matrix layout scheme that does not require transposition, for AR/CMA applications. This scheme maps one row of a logical matrix into a rectangular memory window of the DRAM to balance the bandwidth of the row- and column-wise matrix access and to increase the DRAM IO bandwidth. The optimal window configuration is theoretically analyzed to minimize the total number of no-data-visit operations of the DRAM. Different WMLS implementationsare presented according to the memory structure of field-programmable gata array (FPGA), CPU, and GPU platforms. Experimental results show that the proposed WMLS can significantly improve DRAM bandwidth for AR/CMA applications. achieved speedup factors of 1.6× and 2.0× are achieved for the general-purpose CPU and GPU platforms, respectively. For the FPGA platform, the WMLS DRAM controller is custom. The maximum bandwidth for the AR/CMA mode reaches 5.94 GB/s, which is a 73.6% improvement compared with that of the traditional row-wise access mode. Finally, we apply WMLS scheme for Chirp Scaling SAR application, comparing with the traditional access approach, the maximum speedup factors of 4.73X, 1.33X and 1.56X can be achieved for FPGA, CPU and GPU platform, respectively.
The main contribution of this paper is to show optimal parallel algorithms to compute the sum, the prefix-sums, and the summed area table on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, and the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. We then go on to show that $Omega({nover w}+{nlover p}+llog n)$ time units are necessary to compute the sum. We also present a parallel algorithm that computes the prefix-sums of n numbers in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. Finally, we show that the summed area table of size $sqrt{n} imessqrt{n}$ can be computed in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. Since the computation of the prefix-sums and the summed area table is at least as hard as the sum computation, these parallel algorithms are also optimal.
Michael Joseph TAN Yuichi KAJI
Novel flash codes with small average write deficiency are proposed. A flash code is a coding scheme for avoiding the wearing of cells in flash memory. One approach to develop flash codes with large parameters is to make use of slices which are small groups of cells. Preliminary study shows that using small slices brings several favorable characteristics, but naive use of small slices induces a certain overhead. In this study, a new structure which is called a cluster is devised to develop a good slice-based flash code. Two different slice encoding schemes are used in a cluster, which decreases the overhead of using small slices while retaining its advantage. The proposed flash codes show much smaller write deficiency compared to another slice-based flash code.
The objective of this research is to design a high-performance NAND flash memory system with a data buffer. The proposed buffer system in the NAND flash memory consists of two parts, i.e., a fully associative temporal buffer for temporal locality and a fully associative spatial buffer for spatial locality. We propose a new operating mechanism for reducing overhead of flash memory, that is, erase and write operations. According to our simulation results, the proposed buffer system can reduce the write and erase operations by about 73% and 79% for spec application respectively, compared with a fully associative buffer with two times more space. Futhermore, the average memory access time can improve by about 60% compared with other large buffer systems.
Takahiro OTA Hiroyoshi MORITA Adriaan J. de Lind van WIJNGAARDEN
This paper presents a real-time and memory-efficient arrhythmia detection system with binary classification that uses antidictionary coding for the analysis and classification of electrocardiograms (ECGs). The measured ECG signals are encoded using a lossless antidictionary encoder, and the system subsequently uses the compression rate to distinguish between normal beats and arrhythmia. An automated training data procedure is used to construct the automatons, which are probabilistic models used to compress the ECG signals, and to determine the threshold value for detecting the arrhythmia. Real-time computer simulations with samples from the MIT-BIH arrhythmia database show that the averages of sensitivity and specificity of the proposed system are 97.8% and 96.4% for premature ventricular contraction detection, respectively. The automatons are constructed using training data and comprise only 11 kilobytes on average. The low complexity and low memory requirements make the system particularly suitable for implementation in portable ECG monitors.
Akihiko KASAGI Koji NAKANO Yasuaki ITO
The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of the shared memory access of GPUs. Bank conflicts should be avoided for maximizing the bandwidth of the shared memory access. Offline permutation of an array is a task to copy all elements in array a into array b along a permutation given in advance. The main contribution of this paper is to implement a conflict-free permutation algorithm on the DMM in a GPU. We have also implemented straightforward permutation algorithms on the GPU. The experimental results for 1024 double (64-bit) numbers on NVIDIA GeForce GTX-680 show that the straightforward permutation algorithm takes 247.8 ns for the random permutation and 1684ns for the worst permutation that involves the maximum bank conflicts. Our conflict-free permutation algorithm runs in 167ns for any permutation including the random permutation and the worst permutation, although it performs more memory accesses. It follows that our conflict-free permutation is 1.48 times faster for the random permutation and 10.0 times faster for the worst permutation.
Yansheng WANG Leibo LIU Shouyi YIN Min ZHU Peng CAO Jun YANG Shaojun WEI
RCP (Reconfigurable Computing Processor) is intended to fill the gap between ASIC and GPP (General Purpose processor), which achieves much higher energy efficiency than GPP, while is much more flexible than ASIC. In this paper, one organization of on-chip data memory called LIBODM (LIfetime Based On-chip Data Memory) is proposed to reduce the reference delay for data and on-chip data memory size in RCP. In the LIBODM, the allocation of data is based on the data dependency. The data with low data dependency are stored off-chip to save the storage costs, while the data with high data dependency are stored on-chip to reduce the reference delay. Besides, in the LIBODM, the on-chip data are classified into two types, and the classification is based on the lifetime of data. For short lifetime data, they are preferred to be stored into FIFO to increase the reuse ratio of memory space naturally. For long lifetime data, they are preferred to be stored into RAM for several time references. The LIBODM has been testified in one CGRA (Coarse Grained Reconfigurable Architecture) called RPU (Reconfigurable Processing Unit), and two RPUs has been integrated in a RCP-REMUS_HP (High Performance version of Reconfigurable MUlti-media System) focused on video decoding. Thanks to the LIBODM, although the size of on-chip data memory in REMUS_HP is small, a high performance can still be achieved. Compared with XPP and ADRES, in REMUS_HP, the on-chip data memory size at same performance level is only 23.9% and 14.8%. REMUS_HP is implemented on a 48.9mm2 silicon with TSMC 65nm technology. Simulation shows that 1920*1088 @30fps can be achieved for H.264 high-profile decoding when exploiting a 200MHz working frequency. Compared with the high performance version of XPP, the performance is 150% boosted, while the energy efficiency is 17.59x boosted.
NAND Flash memories are widely used as data storages today. The memories are not intrinsically error free because they are affected by several physical disturbances. Technology scaling and introduction of multi-level cell (MLC) has improved data density, but it has made error effect more significant. Error control codes (ECC) are essential to improve reliability of NAND Flash memories. Efficiency of codes depends on error characteristic of systems, and codes are required to be designed to reflect this characteristic. In MLC Flash memories, errors tend to direct values to neighborhood. These errors are a class of M-ary asymmetric symbol error. Some codes which reflect the asymmetric property were proposed. They are designed to correct only 1 level shift errors because almost all of the errors in the memories are in such errors. But technology scaling, increase of program/erase (P/E) cycles, and MLC storing the large number of bits can cause multiple-level shift. This paper proposes single error control codes which can correct an error of more than 1 levels shift. Because the number of levels to be corrected is selectable, we can fit it into noise magnitude. Furthermore, it is possible to add error detecting function for error of the larger shift. Proposed codes are equivalent to a conventional integer codes, which can correct 1 level shift, on a certain parameter. Therefore, the codes are said to be generalization of conventional integer codes. Evaluation results show information lengths to respective check symbol lengths are larger than nonbinary Hamming codes and other M-ary asymmetric symbol error correcting codes.
Duc-Hung LE Katsumi INOUE Cong-Kha PHAM
A CAM-based matching system for fast exact pattern matching is implemented on a hardware system with FPGA and ASIC. The system has a simple structure, and does not employ any Central Processor Unit (CPU) as well as complicated computations. We take advantage of Content Addressable Memory (CAM) which has an ability of parallel multi-match mode for designing the system. The system is applied to fast pattern matching with various required search patterns without using search principles. In this paper, the authors present a CAM-based system for fast exact pattern matching on 2-D data.
Kun-Lin TSAI I-Jui TUNG Feipei LAI
Content addressable memory is widely used for fast lookup table data searching, but it often consumes considerable power. Moreover, designing the suitable content addressable memory architecture for a specific application also consumes lots of time, since the behavioral simulation is often done in the transistor level. SystemC is a system-level modeling language and simulation platform, providing high simulation efficiency for hardware software co-design. Unfortunately, SystemC does not provide the function for estimating power dissipation of a structure design. In this paper, a SystemC-based fast content addressable memory power estimation method is presented for estimating the power dissipation of the match-line circuit, the search-line circuit, and the storage cell array of content addressable memory in the early design stage. The mathematical equations and behavioral patterns are used as the inputs of power estimation model. The simulation results based on 10 Mibench benchmarks show that the simulation time of the proposed method is in average 1233 times faster than that of HSPICE simulator with only 3.51% error rate.
We investigate the utilization of vector registers (VRs) on reducing memory references for single instruction multiple data fast Fourier transform calculation. We propose to group the butterfly computations in several consecutive stages to maximize utilization of the available VRs and take the advantage of the symmetries in twiddle factors. All the butterflies sharing identical twiddle factors are clustered and computed together to further improve performance. The relationship between the number of fused stages and the number of available VRs is then examined. Experimental results on different platforms show that the proposed method is effective.
NAND multi-level cell (MLC) flash memories are widely used due to low cost and high capacity. However, the increased number of levels in MLC results in larger interference and errors. The errors in MLC flash memories tend to be directional and limited-magnitude. Many related works focus on asymmetric errors, but bidirectional errors also occur because of the bidirectional interference and the adjustment of the hard-decision reference voltages. To take advantage of the characteristics, we propose t bidirectional (lu,ld) limited-magnitude error correction codes, which can reduce errors more effectively. The proposed code is systematic, and can correct t bidirectional errors with upward and downward magnitude of lu and ld, respectively. The proposed method is advantageous in that the parity size is reduced, and it has lower bit error rate than conventional error correction codes with the same code rate.
Hisashi IWAMOTO Yuji YANO Yasuto KURODA Koji YAMAMOTO Shingo ATA Kazunari INOUE
Network traffic keeps increasing due to the increasing popularity of video streaming services. Routers and switches in wire-line networks require guaranteed line rates as high as 20 Gbp/s as well as advanced quality of service (QoS). Hybrid SRAM and DRAM architecture previously presented with the benefit of high-speed and high-density, but it requires complex memory management. As a result, it has hardly supported large numbers of queue, which is an effective approach to satisfying the QoS requirements. This paper proposes an intelligent memory management unit (MMU) which is based on the hybrid architecture, where over 16k multi queues are integrated. The performance examined by the system board is zero-packet loss under the seamless traffic with 60–1.5 kByte packet-length (deterministic manner). Noticeable feature in this paper's architecture is eliminating the need for any premium memories but only low-cost commodity SRAMs and DRAMs are used. The intelligent MMU employs the head buffer architecture, which is suitable for supporting a large numbers of FIFO queues. An experimental board based on this architecture is embedded into a Router system to evaluate the performance. Using 16k queues at 20 Gbps, zero-packet loss is examined with 64-Byte to 1,500-Byte packet-length.
Masashi TAWADA Masao YANAGISAWA Nozomu TOGAWA
Recently, multi-core processors are used in embedded systems very often. Since application programs is much limited running on embedded systems, there must exists an optimal cache memory configuration in terms of power and area. Simulating application programs on various cache configurations is one of the best options to determine the optimal one. Multi-core cache configuration simulation, however, is much more complicated and takes much more time than single-core cache configuration simulation. In this paper, we propose a very fast dual-core L1 cache configuration simulation algorithm. We first propose a new data structure where just a single data structure represents two or more multi-core cache configurations with different cache associativities. After that, we propose a new multi-core cache configuration simulation algorithm using our new data structure associated with new theorems. Experimental results demonstrate that our algorithm obtains exact simulation results but runs 20 times faster than a conventional approach.
The performance of a mobile database management system (DBMS) in which most queries are made up of random data accesses if the NAND flash memory is used as storage media of the DBMS is degraded. The reason for this is that the performance of NAND flash memory is good for writing sequentially but poor when writing randomly. Thus, a new storage structure and querying policies are needed in mobile DBMS when flash memory is used as the storage media. In this letter, we propose a new policy of database page management to enhance the frequent random update performance, and then evaluate the performance experimentally.
HyunMin SEUNG Jong-Dae LEE Chang-Hwan KIM Jea-Gun PARK
In summary, we successfully fabricated the nonvolatile hybrid polymer 4F2 memory-cell. It was based on bistable state, which was observed in PS layer that is containing a Ni nanocrystals capped with NiO tunneling barrier sandwiched by Al electrodes. The current conduction mechanism for polymer memory-cell was demonstrated by fitting the I-V curves. The electrons were charged and discharged on Ni nanocrystals by tunneling through the NiO tunneling barrier. In addition, the memory-cell showed a good and reproducible nonvolatile memory-cell characteristic. Its memory margin is about 1.410. The retention-time is more than 105 seconds and the endurance cycles of program-and-erase is more than 250 cycles. Furthermore, Thefore, polymer memory-cell would be good candidates for nonvolatile 4F2 cross-bar memory-cell.
Akio OHTA Katsunori MAKIHARA Mitsuhisa IKEDA Hideki MURAKAMI Seiichiro HIGASHI Seiichi MIYAZAKI
We have investigated the impact of O2 annealing after SiOx deposition on the switching behavior to gain a better understanding of the resistance switching mechanism, especially the role of oxygen deficiency in the SiOx network. Although resistive random access memories (ReRAMs) with SiOx after 300 annealing sandwiched with Pt electrodes showed uni-polar type resistance switching characteristics, the switching behaviors were barely detectable for the samples after annealing at temperatures over 500. Taking into account of the average oxygen content in the SiOx films evaluated by XPS measurements, oxygen vacancies in SiOx play an important role in resistance switching. Also, the results of conductive AFM measurements suggest that the formation and disruption of a conducting filament path are mainly responsible for the resistance switching behavior of SiOx.
Woo Young CHOI Min Su HAN Boram HAN Dongsun SEO Il Hwan CHO
A modified modeling of residue effect on nano-electro-mechanical nonvolatile memory (NEMory) is presented for considering wet etching process. The effect of a residue under the cantilever is investigated for the optimization. The feasibility of the proposed model is investigated by finite element analysis simulations.
Motoki FUKUSIMA Akio OHTA Katsunori MAKIHARA Seiichi MIYAZAKI
We have fabricated Pt/Si-rich oxide (SiOx)/TiN stacked MIM diodes and studied an impact of the structural asymmetry on their resistive switching characteristics. XPS analyses show that a TiON interfacial layer was formed during the SiOx deposition on TiN by RF-sputtering in an Ar + O2 gas mixture. After the fabrication of Pt top electrodes on the SiOx layer, and followed by an electro-forming process, distinct bi-polar type resistive switching was confirmed. For the resistive switching from high to low resistance states so called SET process, there is no need to set the current compliance. Considering higher dielectric constant of TiON than SiOx, the interfacial TiON layer can contribute to regulate the current flow through the diode. The clockwise resistive switching, in which the reduction and oxidation (Red-Ox) reactions can occur near the TiN bottom electrode, shows lower RESET voltages and better switching endurance than the counter-clockwise switching where the Red-Ox reaction can take place near the top Pt electrode. The result implies a good repeatable nature of Red-Ox reactions at the interface between SiOx and TiON/TiN in consideration of relatively high diffusibility of oxygen atoms through Pt.