Ing-Chao LIN Yen-Han LEE Sheng-Wei WANG
Ternary content addressable memory (TCAM), which can store 0, 1, or X in its cells, is widely used to store routing tables in network routers. Negative bias temperature instability (NBTI) and positive bias temperature instability (PBTI), which increase Vth and degrade transistor switching speed, have become major reliability challenges. This study analyzes the signal probability of routing tables. The results show that many cells retain static stress and suffer significant degradation caused by NBTI and PBTI effects. The bit flipping technique is improved and proactive power gating recovery is proposed to mitigate NBTI and PBTI effects. In order to maintain the functionality of TCAM after bit flipping, a novel TCAM cell design is proposed. Simulation results show that compared to the original architecture, the bit flipping technique improves read static noise margin (SNM) for data and mask cells by 16.84% and 29.94%, respectively, and reduces search time degradation by 12.95%. The power gating technique improves read SNM for data and mask cells by 12.31% and 20.92%, respectively, and reduces search time degradation by 17.57%. When both techniques are used, read SNM for data and mask cells is improved by 17.74% and 30.53%, respectively, and search time degradation is reduced by 21.01%.
Seon Hwan KIM Ju Hee CHOI Jong Wook KWAK
In this letter, we propose a novel garbage collection technique for index structures based on flash memory systems, called Proxy Block-based Garbage Collection (PBGC). Many index structures have been proposed for flash memory systems. They exploit buffers and logs to resolve the update propagation problem, one of the a main cause of performance degradation of the index structures. However, these studies overlooked the fact that not only the record operation but also garbage collection induces the update propagation problem. The proposal, PBGC, exploits a proxy block and a block mapping table to solve the update propagation problem, which is caused by the changes in the page and block caused by garbage collection. Experiments show that PBGC decreased the execution time of garbage collection by up to 39%, compared with previous garbage collection techniques.
Dongchul PARK Biplob DEBNATH David H.C. DU
The Flash Translation Layer (FTL) is a firmware layer inside NAND flash memory that allows existing disk-based applications to use it without any significant modifications. Since the FTL has a critical impact on the performance and reliability of flash-based storage, a variety of FTLs have been proposed. The existing FTLs, however, are designed to perform well for either a read intensive workload or a write intensive workload, not for both due to their internal address mapping schemes. To overcome this limitation, we propose a novel hybrid FTL scheme named Convertible Flash Translation Layer (CFTL). CFTL is adaptive to data access patterns with the help of our unique hot data identification design that adopts multiple bloom filters. Thus, CFTL can dynamically switch its mapping scheme to either page-level mapping or block-level mapping to fully exploit the benefits of both schemes. In addition, we design a spatial locality-aware caching mechanism and adaptive cache partitioning to further improve CFTL performance. Consequently, both the adaptive switching scheme and the judicious caching mechanism empower CFTL to achieve good read and write performance. Our extensive evaluations demonstrate that CFTL outperforms existing FTLs. In particular, our specially designed caching mechanism remarkably improves the cache hit ratio, by an average of 2.4×, and achieves much higher hit ratios (up to 8.4×) especially for random read intensive workloads.
Nobutaro SHIBATA Yoshinori GOTOH Takako ISHIHARA
Two-port SRAMs are frequently installed in gate-array VLSIs to implement smart functions. This paper presents a new high-density 10T CMOS base cell for gate-array-based two-port SRAM applications. Using the single base cell alone, we can implement a two-port memory cell whose bitline contacts are shared with the memory cell adjacent to one of two dedicated sides, resulting in greatly reduced parasitic capacitance in bitlines. To throw light on the total performance derived from the base cell, a plain two-port SRAM macro was designed and fabricated with a 0.35-µm low cost, logic process. Each of two 10-bit power-saved address decoders was formed with 36% fewer base cells by employing complex gates and a subdecoder. The new sense amplifier with a complementary sensing scheme had a fine sensitivity of 35 mVpp, and so we successfully reduced the required read bitline signal from 250 to 70 mVpp. With the macro with 1024 memory cells per bitline, the address access time under typical conditions of a 2.5-V power supply and 25°C was 4.0 ns (equal to that obtained with full-custom style design) and the power consumption at 200-MHz simultaneous operations of two ports was 6.7 mW for an I/O-data width of 1 bit.
Sungjun KIM Min-Hwi KIM Seongjae CHO Byung-Gook PARK
In this work, the bias polarity dependent resistive switching behaviors in Cu/Si3N4/p+ Si RRAM memory cell have been closely studied. Different switching characteristics in both unipolar and bipolar modes after the positive forming are investigated. The bipolar switching did not need a forming process and showed better characteristics including endurance cycling, uniformity of switching parameters, and on/off resistance ratio. Also, the resistive switching characteristics by both positive and negative forming switching are compared. It has been confirmed that both unipolar and bipolar modes after the negative forming exhibits inferior resistive switching performances due to high forming voltage and current.
Hirofumi TAKISHITA Shuhei TANAKAMARU Sheyang NING Ken TAKEUCHI
Storage-Class Memory (SCM) and NAND flash hybrid Solid-State Drive (SSD) has advantages of high performance and low power consumption compared with NAND flash only SSD. In this paper, first, three SSD configurations are investigated. Three different SCMs are used with 0.1 µs, 1 µs and 10 µs read/write latencies, respectively, and the required SCM/NAND flash capacity ratios are analyzed to maintain the same SSD performance. Next, by using the three SSD configurations, the variation of SSD reliability, performance and cost are analyzed by changing error correction strengths. The SSD reliability of acceptable SCM and NAND flash Bit Error Rates (BERs) is limited by achieving specified SSD performance with error correction, and/or limited by SCM and NAND flash parity size and SSD cost. Lastly, the SSD replacement cost is also analyzed by considering the limitation of NAND flash write/erase cycles. The purpose of this paper is to provide a design guideline for obtaining high performance, highly reliable and cost-effective SCM/NAND hybrid structure SSD with ECC.
Seon Hwan KIM Ju Hee CHOI Jong Wook KWAK
In this letter, we propose a novel wear leveling technique we call Hidden cold block-aware Wear Leveling (HaWL) using a bit-set threshold. HaWL prolongs the lifetime of flash memory devices by using a bit array table in wear leveling. The bit array table saves the histories of block erasures for a period and distinguishes cold blocks from all blocks. In addition, HaWL can reduce the size of the bit array table by using a one-to-many mode, where one bit is related to many blocks. Moreover, to prevent degradation of wear leveling in the one-to-many mode, HaWL uses bit-set threshold (BST) and increases the accuracy of the cold block information. The performance results illustrate that HaWL prolongs the lifetime of flash memory by up to 48% compared with previous wear leveling techniques in our experiments.
Nobutaro SHIBATA Takako ISHIHARA
Cache memories are the major application of high-speed SRAMs, and they are frequently installed in high performance logic VLSIs including microprocessors. This paper presents a 4-way set-associative, SOI cache-tag memory. To obtain higher operating speed with less power dissipation, we devised an I/O-separated memory cell with a dual-rail wordline, which is used to transmit complementary selection signals. The address decoding delay was shortened using CMOS dual-rail logic. To enhance the maximum operating frequency, bitline's recovery operations after writing data were eliminated using a memory array configuration without half-selected cells. Moreover, conventional, sensitive but slow differential amplifiers were successfully removed from the data I/O circuitry with a hierarchical bitline scheme. As regards the stored data management, we devised a new hardware-oriented LRU-data replacement algorithm on the basis of 6-bit directed graph. With the experimental results obtained with a test chip fabricated with a 0.25-µm CMOS/SIMOX process, the core of the cache-tag memory with a 1024-set configuration can achieve a 1.5-ns address access time under typical conditions of a 2-V power supply and 25°C. The power dissipation during standby was less than 14 µW, and that at the 500-MHz operation was 13-83 mW, depending on the bit-stream data pattern.
Youngjoo LEE Jaehwan JUNG In-Cheol PARK
This paper presents a novel low-power decoder architecture for the (36420, 32778) binary LDPC code targeting energy-efficient NAND-flash-based mobile devices. The proposed energy-scalable decoding algorithm reduces the operating bit-width of decoding function units at the early-use stage where the channel condition is good enough to lower the precision of computation. Based on a flexible adder structure, the decoding energy of the proposed LDPC decoder can be reduced by freezing the unnecessary parts of hardware resources. A prototype 4KB LDPC decoder is designed in a 65nm CMOS technology, which achieves an average decoding throughput of 8.13Gb/s with 1.2M equivalent gates. The power consumption of the decoder ranges from 397mW to 563mW depending on operating conditions.
Dynamic instruction window resizing (DIWR) is a scheme that effectively exploits both memory-level parallelism and instruction-level parallelism by configuring the instruction window size appropriately for exploiting each parallelism. Although a previous study has shown that the DIWR processor achieves a significant speedup, power consumption has not been explored. The power consumption is increased in DIWR because the instruction window resources are enlarged in memory-intensive phases. If the power consumption exceeds the power budget determined by certain requirements, the DIWR processor must save power and thus, the performance previously presented cannot be achieved. In this paper, we explore to what extent the DIWR processor can achieve improved performance for a given power budget, assuming that dynamic voltage and frequency scaling (DVFS) is introduced as a power saving technique. Evaluation results using the SPEC2006 benchmark programs show that the DIWR processor, even with a constrained power budget, achieves a speedup over the conventional processor over a wide range of given power budgets. At the most important power budget point, i.e., when the power a conventional processor consumes without any power constraint is supplied, DIWR achieves a 16% speedup.
Tatsuro KOJO Masashi TAWADA Masao YANAGISAWA Nozomu TOGAWA
Data stored in non-volatile memories may be destructed due to crosstalk and radiation but we can restore their data by using error-correcting codes. However, non-volatile memories consume a large amount of energy in writing. How to reduce maximum writing bits even using error-correcting codes is one of the challenges in non-volatile memory design. In this paper, we first propose Doughnut code which is based on state encoding limiting maximum and minimum Hamming distances. After that, we propose a code expansion method, which improves maximum and minimum Hamming distances. When we apply our code expansion method to Doughnut code, we can obtain a code which reduces maximum-flipped bits and has error-correcting ability equal to Hamming code. Experimental results show that the proposed code efficiently reduces the number of maximum-writing bits.
Masashi TAWADA Shinji KIMURA Masao YANAGISAWA Nozomu TOGAWA
Non-volatile memory has many advantages such as high density and low leakage power but it consumes larger writing energy than SRAM. It is quite necessary to reduce writing energy in non-volatile memory design. In this paper, we propose write-reduction codes based on error correcting codes and reduce writing energy in non-volatile memory by decreasing the number of writing bits. When a data is written into a memory cell, we do not write it directly but encode it into a codeword. In our write-reduction codes, every data corresponds to an information vector in an error-correcting code and an information vector corresponds not to a single codeword but a set of write-reduction codewords. Given a writing data and current memory bits, we can deterministically select a particular write-reduction codeword corresponding to the data to be written, where the maximum number of flipped bits are theoretically minimized. Then the number of writing bits into memory cells will also be minimized. Experimental results demonstrate that we have achieved writing-bits reduction by an average of 51% and energy reduction by an average of 33% compared to non-encoded memory.
In 1973, Arimoto proved the strong converse theorem for the discrete memoryless channels stating that when transmission rate R is above channel capacity C, the error probability of decoding goes to one as the block length n of code word tends to infinity. He proved the theorem by deriving the exponent function of error probability of correct decoding that is positive if and only if R>C. Subsequently, in 1979, Dueck and Körner determined the optimal exponent of correct decoding. Arimoto's bound has been said to be equal to the bound of Dueck and Körner. However its rigorous proof has not been presented so far. In this paper we give a rigorous proof of the equivalence of Arimoto's bound to that of Dueck and Körner.
Shuping ZHANG Jinjia ZHOU Dajiang ZHOU Shinji KIMURA Satoshi GOTO
Motion estimation (ME) is a key encoding component of almost all modern video coding standards. ME contributes significantly to video coding efficiency, but, it also consumes the most power of any component in a video encoder. In this paper, an ME processor with 3D stacked memory architecture is proposed to reduce memory and core power consumption. First, a memory die is designed and stacked with ME die. By adding face-to-face (F2F) pads and through-silicon-via (TSV) definitions, 2D electronic design automation (EDA) tools can be extended to support the proposed 3D stacking architecture. Moreover, a special memory controller is applied to control data transmission and timing between the memory die and the ME processor die. Finally, a 3D physical design is completed for the entire system. This design includes TSV/F2F placement, floor plan optimization, and power network generation. Compared to 2D technology, the number of input/output (IO) pins is reduced by 77%. After optimizing the floor plan of the processor die and memory die, the routing wire lengths are reduced by 13.4% and 50%, respectively. The stacking static random access memory contributes the most power reduction in this work. The simulation results show that the design can support real-time 720p @ 60fps encoding at 8MHz using less than 65mW in power, which is much better compared to the state-of-the-art ME processor.
In recent years, applications of complex-valued neural networks have become wide spread. Quaternions are an extension of complex numbers, and neural networks with quaternions have been proposed. Because quaternion algebra is non-commutative algebra, we can consider two orders of multiplication to calculate weighted input. However, both orders provide almost the same performance. We propose hybrid quaternionic Hopfield neural networks, which have both orders of multiplication. Using computer simulations, we show that these networks outperformed conventional quaternionic Hopfield neural networks in noise tolerance. We discuss why hybrid quaternionic Hopfield neural networks improve noise tolerance from the standpoint of rotational invariance.
For systems with a delay in the input, the predictor method has been often used in state feedback controllers for system stabilization or regulation. In this letter, we show that for a chain of integrators with even an unknown input delay, a much simpler and memoryless controller is a good candidate for system regulation. With an adaptive gain-scaling factor, the proposed state feedback controller can deal with an unknown time-varying delay in the input. An example is given for illustration.
Akio OHTA Chong LIU Takashi ARAI Daichi TAKEUCHI Hai ZHANG Katsunori MAKIHARA Seiichi MIYAZAKI
Ni nanodots (NDs) used as nano-scale top electrodes were formed on a 10-nm-thick Si-rich oxide (SiO$_{mathrm{x}}$)/Ni bottom electrode by exposing a 2-nm-thick Ni layer to remote H$_{2}$-plasma (H$_{2}$-RP) without external heating, and the resistance-switching behaviors of SiO$_{mathrm{x}}$ were investigated from current-voltage ( extit{I--V}) curves. Atomic force microscope (AFM) analyses confirmed the formation of electrically isolated Ni NDs as a result of surface migration and agglomeration of Ni atoms promoted by the surface recombination of H radicals. From local extit{I--V} measurements performed by contacting a single Ni ND as a top electrode with a Rh coated Si cantilever, a distinct uni-polar type resistance switching behavior was observed repeatedly despite an average contact area between the Ni ND and the SiO$_{mathrm{x}}$ as small as $sim$ 1.9 $ imes$ 10$^{-12}$cm$^{2}$. This local extit{I--V} measurement technique is quite a simple method to evaluate the size scalability of switching properties.
Qiao YU Shujuan JIANG Yingqi LIU
Memory leak occurs when useless objects cannot be released for a long time during program execution. Memory leaked objects may cause memory overflow, system performance degradation and even cause the system to crash when they become serious. This paper presents a dynamic approach for detecting and measuring memory leaked objects in Java programs. First, our approach tracks the program by JDI and records heap information to find out the potentially leaked objects. Second, we present memory leaking confidence to measure the influence of these objects on the program. Finally, we select three open-source programs to evaluate the efficiency of our approach. Furthermore, we choose ten programs from DaCapo 9.12 benchmark suite to reveal the time overhead of our approach. The experimental results show that our approach is able to detect and measure memory leaked objects efficiently.
Atsushi OOKA Shingo ATA Kazunari INOUE Masayuki MURATA
Content-centric networking (CCN) is an innovative network architecture that is being considered as a successor to the Internet. In recent years, CCN has received increasing attention from all over the world because its novel technologies (e.g., caching, multicast, aggregating requests) and communication based on names that act as addresses for content have the potential to resolve various problems facing the Internet. To implement these technologies, however, requires routers with performance far superior to that offered by today's Internet routers. Although many researchers have proposed various router components, such as caching and name lookup mechanisms, there are few router-level designs incorporating all the necessary components. The design and evaluation of a complete router is the primary contribution of this paper. We provide a concrete hardware design for a router model that uses three basic tables — forwarding information base (FIB), pending interest table (PIT), and content store (CS) — and incorporates two entities that we propose. One of these entities is the name lookup entity, which looks up a name address within a few cycles from content-addressable memory by use of a Bloom filter; the other is the interest count entity, which counts interest packets that require certain content and selects content worth caching. Our contributions are (1) presenting a proper algorithm for looking up and matching name addresses in CCN communication, (2) proposing a method to process CCN packets in a way that achieves high throughput and very low latency, and (3) demonstrating feasible performance and cost on the basis of a concrete hardware design using distributed content-addressable memory.
Masafumi KINOSHITA Osamu TAKADA Izumi MIZUTANI Takafumi KOIKE Kenji LEIBNITZ Masayuki MURATA
In the big data era, messaging systems are required to process large volumes of message traffic with high scalability and availability. However, conventional systems have two issues regarding availability. The first issue is that failover processing itself has a risk of failure. The second issue is to find a trade-off between consistency and availability. We propose a resilient messaging system based on a distributed in-memory key-value store (KVS). Its servers are interconnected with each other and messages are distributed to multiple servers in normal processing state. This architecture can continue messaging services wherever in the messaging system server/process failures occur without using failover processing. Furthermore, we propose two methods for improved resilience: the round-robin method with a slowdown KVS exclusion and the two logical KVS counter-rotating rings to provide short-term-availability in the messaging system. Evaluation results demonstrate that the proposed system can continue service without failover processing. Compared with the conventional method, our proposed distribution method reduced 92% of error responses to clients caused by server failures.