IEICE global.ieice.org Site

Keyword Search Result

[Keyword] cache memory(15hit)

1-15hit

An Efficient Reference Image Sharing Method for the Image-Division Parallel Video Encoding Architecture
Ken NAKAMURA Yuya OMORI Daisuke KOBAYASHI Koyo NITTA Kimikazu SANO Masayuki SATO Hiroe IWASAKI Hiroaki KOBAYASHI

PAPER

Pubricized:
2022/11/29
Vol:
E106-C No:6
Page(s):
312-320
This paper proposes an efficient reference image sharing method for the image-division parallel video encoding architecture. This method efficiently reduces the amount of data transfer by using pre-transfer with area prediction and on-demand transfer with a transfer management table. Experimental results show that the data transfer can be reduced to 19.8-35.3% of the conventional method on average without major degradation of coding performance. This makes it possible to reduce the required bandwidth of the inter-chip transfer interface by saving the amount of data transfer.
A Conflict-Aware Capacity Control Mechanism for Deep Cache Hierarchy
Jiaheng LIU Ryusuke EGAWA Hiroyuki TAKIZAWA

PAPER-Computer System

Pubricized:
2022/03/09
Vol:
E105-D No:6
Page(s):
1150-1163
As the number of cores on a processor increases, cache hierarchies contain more cache levels and a larger last level cache (LLC). Thus, the power and energy consumption of the cache hierarchy becomes non-negligible. Meanwhile, because the cache usage behaviors of individual applications can be different, it is possible to achieve higher energy efficiency of the computing system by determining the appropriate cache configurations for individual applications. This paper proposes a cache control mechanism to improve energy efficiency by adjusting a cache hierarchy to each application. Our mechanism first bypasses and disables a less-significant cache level, then partially disables the LLC, and finally adjusts the associativity if it suffers from a large number of conflict misses. The mechanism can achieve significant energy saving at the sacrifice of small performance degradation. The evaluation results show that our mechanism improves energy efficiency by 23.9% and 7.0% on average over the baseline and the cache-level bypassing mechanisms, respectively. In addition, even if the LLC resource contention occurs, the proposed mechanism is still effective for improving energy efficiency.
Packet Processing Architecture with Off-Chip Last Level Cache Using Interleaved 3D-Stacked DRAM Open Access
Tomohiro KORIKAWA Akio KAWABATA Fujun HE Eiji OKI

PAPER-Network System

Pubricized:
2020/08/06
Vol:
E104-B No:2
Page(s):
149-157
The performance of packet processing applications is dependent on the memory access speed of network systems. Table lookup requires fast memory access and is one of the most common processes in various packet processing applications, which can be a dominant performance bottleneck. Therefore, in Network Function Virtualization (NFV)-aware environments, on-chip fast cache memories of a CPU of general-purpose hardware become critical to achieve high performance packet processing speeds of over tens of Gbps. Also, multiple types of applications and complex applications are executed in the same system simultaneously in carrier network systems, which require adequate cache memory capacities as well. In this paper, we propose a packet processing architecture that utilizes interleaved 3 Dimensional (3D)-stacked Dynamic Random Access Memory (DRAM) devices as off-chip Last Level Cache (LLC) in addition to several levels of dedicated cache memories of each CPU core. Entries of a lookup table are distributed in every bank and vault to utilize both bank interleaving and vault-level memory parallelism. Frequently accessed entries in 3D-stacked DRAM are also cached in on-chip dedicated cache memories of each CPU core. The evaluation results show that the proposed architecture reduces the memory access latency by 57%, and increases the throughput by 100% while reducing the blocking probability but about 10% compared to the architecture with shared on-chip LLC. These results indicate that 3D-stacked DRAM can be practical as off-chip LLC in parallel packet processing systems.
Analysis on Hybrid SSD Configuration with Emerging Non-Volatile Memories Including Quadruple-Level Cell (QLC) NAND Flash Memory and Various Types of Storage Class Memories (SCMs)
Yoshiki TAKAI Mamoru FUKUCHI Chihiro MATSUI Reika KINOSHITA Ken TAKEUCHI

PAPER-Integrated Electronics

Vol:
E103-C No:4
Page(s):
171-180
This paper analyzes the optimal SSD configuration including emerging non-volatile memories such as quadruple-level cell (QLC) NAND flash memory [1] and storage class memories (SCMs). First, SSD performance and SSD endurance lifetime of hybrid SSD are evaluated in four configurations: 1) single-level cell (SLC)/QLC NAND flash, 2) SCM/QLC NAND flash, 3) SCM/triple-level cell (TLC)/QLC NAND flash and 4) SCM/TLC NAND flash. Furthermore, these four configurations are compared in limited cost. In case of cold workloads or high total SSD cost assumption, SCM/TLC NAND flash hybrid configuration is recommended in both SSD performance and endurance lifetime. For hot workloads with low total SSD cost assumption, however, SLC/QLC NAND flash hybrid configuration is recommended with emphasis on SSD endurance lifetime. Under the same conditions as above, SCM/TLC/QLC NAND flash tri-hybrid is the best configuration in SSD performance considering cost. In particular, for prxy_0 (write-hot workload), SCM/TLC/QLC NAND flash tri-hybrid achieves 67% higher IOPS/cost than SCM/TLC NAND flash hybrid. Moreover, the configurations with the highest IOPS/cost in each workload and cost limit are picked up and analyzed with various types of SCMs. For all cases except for the case of prxy_1 with high total SSD cost assumption, middle-end SCM (write latency: 1us, read latency: 1us) is recommended in performance considering cost. However, for prxy_1 (read-hot workload) with high total SSD cost assumption, high-end SCM (write latency: 100ns, read latency: 100ns) achieves the best performance.
On-Chip Cache Architecture Exploiting Hybrid Memory Structures for Near-Threshold Computing
Hongjie XU Jun SHIOMI Tohru ISHIHARA Hidetoshi ONODERA

PAPER

Vol:
E102-A No:12
Page(s):
1741-1750
This paper focuses on power-area trade-off axis to memory systems. Compared with the power-performance-area trade-off application on the traditional high performance cache, this paper focuses on the edge processing environment which is becoming more and more important in the Internet of Things (IoT) era. A new power-oriented trade-off is proposed for on-chip cache architecture. As a case study, this paper exploits a good energy efficiency of Standard-Cell Memory (SCM) operating in a near-threshold voltage region and a good area efficiency of Static Random Access Memory (SRAM). A hybrid 2-level on-chip cache structure is first introduced as a replacement of 6T-SRAM cache as L0 cache to save the energy consumption. This paper proposes a method for finding the best capacity combination for SCM and SRAM, which minimizes the energy consumption of the hybrid cache under a specific cache area constraint. The simulation result using a 65-nm process technology shows that up to 80% energy consumption is reduced without increasing the die area by replacing the conventional SRAM instruction cache with the hybrid 2-level cache. The result shows that energy consumption can be reduced if the area constraint for the proposed hybrid cache system is less than the area which is equivalent to a 8kB SRAM. If the target operating frequency is less than 100MHz, energy reduction can be achieved, which implies that the proposed cache system is suitable for low-power systems where a moderate processing speed is required.
Towards Ultra-High-Speed Cryogenic Single-Flux-Quantum Computing Open Access
Koki ISHIDA Masamitsu TANAKA Takatsugu ONO Koji INOUE

INVITED PAPER

Vol:
E101-C No:5
Page(s):
359-369
CMOS microprocessors are limited in their capacity for clock speed improvement because of increasing computing power, i.e., they face a power-wall problem. Single-flux-quantum (SFQ) circuits offer a solution with their ultra-fast-speed and ultra-low-power natures. This paper introduces our contributions towards ultra-high-speed cryogenic SFQ computing. The first step is to design SFQ microprocessors. From qualitatively and quantitatively evaluating past-designed SFQ microprocessors, we have found that revisiting the architecture of SFQ microprocessors and on-chip caches is the first critical challenge. On the basis of cross-layer discussions and analysis, we came to the conclusion that a bit-parallel gate-level pipeline architecture is the best solution for SFQ designs. This paper summarizes our current research results targeting SFQ microprocessors and on-chip cache architectures.
Improvement of Data Utilization Efficiency for Cache Memory by Compressing Frequent Bit Sequences
Ryotaro KOBAYASHI Ikumi KANEKO Hajime SHIMADA

PAPER

Vol:
E99-C No:8
Page(s):
936-946
In the most recent processor designs, memory access latency is shortened by adopting a memory hierarchy. In this configuration, the memory consists of a main memory, which comprises dynamic random-access memory (DRAM), and a cache memory, which consists of static random-access memory (SRAM). A cache memory, which is now used in increasingly large volumes, accounts for a vast proportion of the energy consumption of the overall processor. There are two ways to reduce the energy consumption of the cache memory: by decreasing the number of accesses, and by minimizing the energy consumed per access. In this study, we reduce the size of the L1 cache by compressing frequent bit sequences, thus cutting the energy consumed per access. A “frequent bit sequence” is a specific bit pattern that often appears in high-order bits of data retained in the cache memory. Our proposed mechanism, which is based on measurements using a software simulator, cuts energy consumption by 41.0% on average as compared with conventional mechanisms.
MVP-Cache: A Multi-Banked Cache Memory for Energy-Efficient Vector Processing of Multimedia Applications
Ye GAO Masayuki SATO Ryusuke EGAWA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI

PAPER-Computer System

Pubricized:
2014/08/22
Vol:
E97-D No:11
Page(s):
2835-2843
Vector processors have significant advantages for next generation multimedia applications (MMAs). One of the advantages is that vector processors can achieve high data transfer performance by using a high bandwidth memory sub-system, resulting in a high sustained computing performance. However, the high bandwidth memory sub-system usually leads to enormous costs in terms of chip area, power and energy consumption. These costs are too expensive for commodity computer systems, which are the main execution platform of MMAs. This paper proposes a new multi-banked cache memory for commodity computer systems called MVP-cache in order to expand the potential of vector architectures on MMAs. Unlike conventional multi-banked cache memories, which employ one tag array and one data array in a sub-cache, MVP-cache associates one tag array with multiple independent data arrays of small-sized cache lines. In this way, MVP-cache realizes less static power consumption on its tag arrays. MVP-cache can also achieve high efficiency on short vector data transfers because the flexibility of data transfers can be improved by independently controlling the data transfers of each data array.
Periodic Pattern Coding for Last Level Cache Data Compression
Haruhiko KANEKO

PAPER-Data Compression

Vol:
E96-A No:12
Page(s):
2351-2359
In spite of continuous improvement of computational power of multi/many-core processors, the memory access performance of the processors has not been improved sufficiently, and thus the overall performance of recent processors is often restricted by the delay of off-chip memory accesses. Low-delay data compression for last level cache (LLC) would be effective to improve the processor performance because the compression increases the effective size of LLC, and thus reduces the number of off-chip memory accesses. This paper proposes a novel data compression method suitable for high-speed parallel decoding in the LLC. Since cache line data often have periodicity of certain lengths, such as 32- or 64-bit instructions, 32-bit integers, and 64-bit floating point numbers, an information word is encoded as a base pattern and a differential pattern between the original word and the base pattern. Evaluation using a GPU simulator shows that the compression ratio of the proposed coding is comparable to LZSS coding and X-Match Pro and superior to other conventional compression algorithms for cache memories. Also this paper presents an experimental decoder designed for ASIC, and the synthesized result shows that the decoder can decompress cache line data of length 32bytes in four clock cycles. Evaluation of the IPC on the GPU simulator shows that, for several benchmark programs, the IPC achieved by the proposed coding is higher than that by the conventional BΔI coding, where the maximum improvement of the IPC is 20%.
A High-Speed Trace-Driven Cache Configuration Simulator for Dual-Core Processor L1 Caches
Masashi TAWADA Masao YANAGISAWA Nozomu TOGAWA

PAPER

Vol:
E96-A No:6
Page(s):
1283-1292
Recently, multi-core processors are used in embedded systems very often. Since application programs is much limited running on embedded systems, there must exists an optimal cache memory configuration in terms of power and area. Simulating application programs on various cache configurations is one of the best options to determine the optimal one. Multi-core cache configuration simulation, however, is much more complicated and takes much more time than single-core cache configuration simulation. In this paper, we propose a very fast dual-core L1 cache configuration simulation algorithm. We first propose a new data structure where just a single data structure represents two or more multi-core cache configurations with different cache associativities. After that, we propose a new multi-core cache configuration simulation algorithm using our new data structure associated with new theorems. Experimental results demonstrate that our algorithm obtains exact simulation results but runs 20 times faster than a conventional approach.
Short Term Cell-Flipping Technique for Mitigating SNM Degradation Due to NBTI
Yuji KUNITAKE Toshinori SATO Hiroto YASUURA

PAPER

Vol:
E94-C No:4
Page(s):
520-529
Negative Bias Temperature Instability (NBTI) is one of the major reliability problems in advanced technologies. NBTI causes threshold voltage shift in a PMOS transistor. When the PMOS transistor is biased to negative voltage, threshold voltage shifts to negatively. On the other hand, the threshold voltage recovers if the PMOS transistor is positively biased. In an SRAM cell, due to NBTI, threshold voltage degrades in the load PMOS transistors. The degradation has the impact on Static Noise Margin (SNM), which is a measure of read stability of a 6-T SRAM cell. In this paper, we discuss the relationship between NBTI degradation in an SRAM cell and the dynamic stress and recovery condition. There are two important characteristics. One is a stress probability, which is defined as the rate that the PMOS transistor is negatively biased. The other is a stress and recovery cycle, which is defined as the switching interval of an SRAM value. In our observations, in order to mitigate the NBTI degradation, the stress probability should be small and the stress and recovery cycle should be shorter than 10 msec. Based on the observations, we propose a novel cell-flipping technique, which makes the stress probability close to 50%. In addition, we show results of the case studies, which apply the cell-flipping technique to register file and cache memories.
Temperature-Aware Configurable Cache to Reduce Energy in Embedded Systems
Hamid NOORI Maziar GOUDARZI Koji INOUE Kazuaki MURAKAMI

PAPER

Vol:
E91-C No:4
Page(s):
418-431
Energy consumption is a major concern in embedded computing systems. Several studies have shown that cache memories account for 40% or more of the total energy consumed in these systems. Active power used to be the primary contributor to total power dissipation of CMOS designs, but with the technology scaling, the share of leakage in total power consumption of digital systems continues to grow. Moreover, temperature is another factor that exponentially increases the leakage current. In this paper, we show the effect of temperature on the optimal (minimum-energy-consuming) cache configuration for low energy embedded systems. Our results show that for a given application and technology, the optimal cache size moves toward smaller caches at higher temperatures, due to the larger leakage. Consequently, a Temperature-Aware Configurable Cache (TACC) is an effective way to save energy in finer technologies when the embedded system is used in different temperatures. Our results show that using a TACC, up to 61% energy can be saved for instruction cache and 77% for data cache compared to a configurable cache that has been configured for only the corner-case temperature (100). Furthermore, the TACC also enhances the performance by up to 28% for the instruction cache and up to 17% for the data cache.
Address Addition and Decoding without Carry Propagation
Yung-Hei LEE Seung Ho HWANG

LETTER-Algorithm and Computational Complexity

Vol:
E80-D No:1
Page(s):
98-100
The response time of adders is mainly determined by the carry propagation delay. This letter deals with a scheme which combines the address addition and decoding together. Although addition is involved in the process, we show that it can be computed without carry propagation. Memory latency is one of the most performance limiting factors. The authors present a new decoder logic named fused add-decoder (FADEC), which performs address addition and decoding in a single process. FADEC can reduce memory latency by eliminating separate address addition cycle.
An 8-mW, 8-kB Cache Memory Using an Automatic-Power-Save Architecture for Low Power RISC Microprocessors
Yasuhisa SHIMAZAKI Katsuhiro NORISUE Koichiro ISHIBASHI Hideo MAEJIMA

PAPER

Vol:
E79-C No:12
Page(s):
1693-1698
An embedded cache memory for low power RISC microprocessors is described. An automatic-power-save architecture (APSA) enables the cache memory to operate with high speed at high frequencies, and with low power dissipation at low frequencies. A pulsed word technique (PWT) and an isolated bit line technique (IBLT) reduce the power dissipation of the cache memory effectively. Using these three techniques, the power dissipation of the cache memory is reduced to almost 60% of the conventional cache memory at 60 MHz and to 20% at a clock frequency of 10 MHz. An 8 KByte test chip using 0.5 µm CMOS technology was fabricated, and it achieves 80 MHz operation at a supply voltage of 3.1 V, and 8 mW operation at a supply voltage of 2.5 V at 10 MHz.
Hiding Data Cache Latency with Load Address Prediction
Toshinori SATO Hiroshige FUJII Seigo SUZUKI

PAPER-Computer Systems

Vol:
E79-D No:11
Page(s):
1523-1532
A new prediction method for the effective address is presented. This method works with the buffer named the address prediction buffer, and allows the data cache to be accessed speculatively. As a consequence of the trend toward increasing clock frequency, the internal cache is no longer able to fill the speed gap between the processor and the external memory, and the data cache latency degrades the processor performance. In order to hide this latency, the prediction method is proposed. By this method, the load address is predicted, and the data is fetched earlier than the memory access stage. In the case that the prediction is correct, the latency is hidden. Even if the prediction is incorrect, the performance is not degraded by any miss penalties. We have found that the prediction accuracy is 81.9% on average, and thus the performance is improved by 6.6% on average and a maximum of 12.1% for the integer programs.

Keyword Search Result

[Keyword] cache memory(15hit)

An Efficient Reference Image Sharing Method for the Image-Division Parallel Video Encoding Architecture

A Conflict-Aware Capacity Control Mechanism for Deep Cache Hierarchy

Packet Processing Architecture with Off-Chip Last Level Cache Using Interleaved 3D-Stacked DRAM Open Access

Analysis on Hybrid SSD Configuration with Emerging Non-Volatile Memories Including Quadruple-Level Cell (QLC) NAND Flash Memory and Various Types of Storage Class Memories (SCMs)

On-Chip Cache Architecture Exploiting Hybrid Memory Structures for Near-Threshold Computing

Towards Ultra-High-Speed Cryogenic Single-Flux-Quantum Computing Open Access

Improvement of Data Utilization Efficiency for Cache Memory by Compressing Frequent Bit Sequences

MVP-Cache: A Multi-Banked Cache Memory for Energy-Efficient Vector Processing of Multimedia Applications

Periodic Pattern Coding for Last Level Cache Data Compression

A High-Speed Trace-Driven Cache Configuration Simulator for Dual-Core Processor L1 Caches

Short Term Cell-Flipping Technique for Mitigating SNM Degradation Due to NBTI

Temperature-Aware Configurable Cache to Reduce Energy in Embedded Systems

Address Addition and Decoding without Carry Propagation

An 8-mW, 8-kB Cache Memory Using an Automatic-Power-Save Architecture for Low Power RISC Microprocessors

Hiding Data Cache Latency with Load Address Prediction

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles