IEICE global.ieice.org Site

Keyword Search Result

[Keyword] cache(201hit)

81-100hit(201hit)

An Adaptive Various-Width Data Cache for Low Power Design
Jiongyao YE Yu WAN Takahiro WATANABE

PAPER-Computer System

Vol:
E94-D No:8
Page(s):
1539-1546
Modern microprocessors employ caches to bridge the great speed variance between a main memory and a central processing unit, but these caches consume a larger and larger proportion of the total power consumption. In fact, many values in a processor rarely need the full-bit dynamic range supported by a cache. The narrow-width value occupies a large portion of the cache access and storage. In view of these observations, this paper proposes an Adaptive Various-width Data Cache (AVDC) to reduce the power consumption in a cache, which exploits the popularity of narrow-width value stored in the cache. In AVDC, the data storage unit consists of three sub-arrays to store data of different widths. When high sub-arrays are not used, they are closed to save its dynamic and static power consumption through the modified high-bit SRAM cell. The main advantages of AVDC are: 1) Both the dynamic and static power consumption can be reduced. 2) Low power consumption is achieved by the modification of the data storage unit with less hardware modification. 3) We exploit the redundancy of narrow-width values instead of compressed values, thus cache access latency does not increase. Experimental results using SPEC 2000 benchmarks show that our proposed AVDC can reduce the power consumption, by 34.83% for dynamic power saving and by 42.87% for static power saving on average, compared with a cache without AVDC.
Analysis before Starting an Access: A New Power-Efficient Instruction Fetch Mechanism
Jiongyao YE Yingtao HU Hongfeng DING Takahiro WATANABE

PAPER-Computer System

Vol:
E94-D No:7
Page(s):
1398-1408
Power consumption has become an increasing concern in high performance microprocessor design. Especially, Instruction Cache (I-Cache) contributes a large portion of the total power consumption in a microprocessor, since it is a complex unit and is accessed very frequently. Several studies on low-power design have been presented for the power-efficient cache design. However, these techniques usually suffer from the restrictions in the traditional Instruction Fetch Unit (IFU) architectures where the fetch address needs to be sent to I-Cache once it is available. Therefore, work to reduce the power consumption is limited after the address generation and before starting an access. In this paper, we present a new power-aware IFU architecture, named Analysis Before Starting an Access (ABSA), which aims at maximizing the power efficiency of the low-power designs by eliminating the restrictions on those low-power designs of the traditional IFU. To achieve this goal, ABSA reorganizes the IFU pipeline and carefully assigns tasks for each stages so that sufficient time and information can be provided for the low-power techniques to maximize the power efficiency before starting an access. The proposed design is fully scalable and its cost is low. Compared to a conventional IFU design, simulation results show that ABSA saves about 30.3% fetch power consumption, on average. I-Cache employed by ABSA reduces both static and dynamic power consumptions about 85.63% and 66.92%, respectively. Meanwhile the performance degradation is only about 0.97%.
A Novel Cache Replacement Policy via Dynamic Adaptive Insertion and Re-Reference Prediction
Xi ZHANG Chongmin LI Zhenyu LIU Haixia WANG Dongsheng WANG Takeshi IKENAGA

PAPER

Vol:
E94-C No:4
Page(s):
468-476
Previous research illustrates that LRU replacement policy is not efficient when applications exhibit a distant re-reference interval. Recently RRIP policy is proposed to improve the performance for such kind of workloads. However, the lack of access recency information in RRIP confuses the replacement policy to make the accurate prediction. To enhance the robustness of RRIP for recency-friendly workloads, we propose an Dynamic Adaptive Insertion and Re-reference Prediction (DAI-RRP) policy which evicts data based on both re-reference prediction value and the access recency information. DAI-RRP makes adaptive adjustment on insertion position and prediction value for different access patterns, which makes the policy robust across different workloads and different phases. Simulation results show that DAI-RRP outperforms LRU and RRIP. For a single-core processor with a 1 MB 16-way set last-level cache (LLC), DAI-RRP reduces CPI over LRU and Dynamic RRIP by an average of 8.1% and 2.7% respectively. Evaluations on quad-core CMP with a 4 MB shared LLC show that DAI-RRP outperforms LRU and Dynamic RRIP (DRRIP) on the weighted speedup metric by an average of 8.1% and 15.7% respectively. Furthermore, compared to LRU, DAI-RRP consumes the similar hardware for 16-way cache, or even less hardware for high-associativity cache. In summary, the proposed policy is practical and can be easily integrated into existing hardware approximations of LRU.
Cache Based Motion Compensation Architecture for Quad-HD H.264/AVC Video Decoder
Jinjia ZHOU Dajiang ZHOU Gang HE Satoshi GOTO

PAPER

Vol:
E94-C No:4
Page(s):
439-447
In this paper, we present a cache based motion compensation (MC) architecture for Quad-HD H.264/AVC video decoder. With the significantly increased throughput requirement, VLSI design for MC is greatly challenged by the huge area cost and power consumption. Moreover, the long memory system latency leads to performance drop of the MC pipeline. To solve these problems, three optimization schemes are proposed in this work. Firstly, a high-performance interpolator based on Horizontal-Vertical Expansion and Luma-Chroma Parallelism (HVE-LCP) is proposed to efficiently increase the processing throughput to at least over 4 times as the previous designs. Secondly, an efficient cache memory organization scheme (4S×4) is adopted to improve the on-chip memory utilization, which contributes to memory area saving of 25% and memory power saving of 3949%. Finally, by employing a Split Task Queue (STQ) architecture, the cache system is capable of tolerating much longer latency of the memory system. Consequently, the cache idle time is saved by 90%, which contributes to reducing the overall processing time by 2440%. When implemented with SMIC 90 nm process, this design costs a logic gate count and on-chip memory of 108.8 k and 3.1 kB respectively. The proposed MC architecture can support real-time processing of 3840×2160@60 fps with less than 166 MHz.
Short Term Cell-Flipping Technique for Mitigating SNM Degradation Due to NBTI
Yuji KUNITAKE Toshinori SATO Hiroto YASUURA

PAPER

Vol:
E94-C No:4
Page(s):
520-529
Negative Bias Temperature Instability (NBTI) is one of the major reliability problems in advanced technologies. NBTI causes threshold voltage shift in a PMOS transistor. When the PMOS transistor is biased to negative voltage, threshold voltage shifts to negatively. On the other hand, the threshold voltage recovers if the PMOS transistor is positively biased. In an SRAM cell, due to NBTI, threshold voltage degrades in the load PMOS transistors. The degradation has the impact on Static Noise Margin (SNM), which is a measure of read stability of a 6-T SRAM cell. In this paper, we discuss the relationship between NBTI degradation in an SRAM cell and the dynamic stress and recovery condition. There are two important characteristics. One is a stress probability, which is defined as the rate that the PMOS transistor is negatively biased. The other is a stress and recovery cycle, which is defined as the switching interval of an SRAM value. In our observations, in order to mitigate the NBTI degradation, the stress probability should be small and the stress and recovery cycle should be shorter than 10 msec. Based on the observations, we propose a novel cell-flipping technique, which makes the stress probability close to 50%. In addition, we show results of the case studies, which apply the cell-flipping technique to register file and cache memories.
Dynamic Program Behavior Identification for High Performance CMPs with Private LLCs
Xiaomin JIA Pingjing LU Caixia SUN Minxuan ZHANG

PAPER

Vol:
E93-D No:12
Page(s):
3211-3222
Chip Multi-Processors (CMPs) emerge as a mainstream architectural design alternative for high performance parallel and distributed computing. Last Level Cache (LLC) management is critical to CMPs because off-chip accesses often require a long latency. Due to its short access latency, well performance isolation and easy scalability, private cache is an attractive design alternative for LLC of CMPs. This paper proposes program Behavior Identification-based Cache Sharing (BICS) for LLC management. BICS is based on a private cache organization for the shorter access latency. Meanwhile, BICS tries to simulate a shared cache organization by allowing evicted blocks of one private LLC to be saved at peer LLCs. This technique is called spilling. BICS identifies cache behavior types of applications at runtime. When a cache block is evicted from a private LLC, cache behavior characteristics of the local application are evaluated so as to determine whether the block is to be spilled. Spilled blocks are allowed to replace some valid blocks of the peer LLCs as long as the interference is within a reasonable level. Experimental results using a full system CMP simulator show that BICS improves the overall throughput by as much as 14.5%, 12.6%, 11.0% and 11.7% (on average 8.8%, 4.8%, 4.0% and 6.8%) over private cache, shared cache, Utility-based Cache Partitioning (UCP) scheme and the baseline spilling-based organization Cooperative Caching (CC) respectively on a 4-core CMP for SPEC CPU2006 benchmarks.
PAW: A Pattern-Aware Write Policy for a Flash Non-volatile Cache
Young-Jin KIM Jihong KIM Jeong-Bae LEE Kee-Wook RIM

PAPER-Software System

Vol:
E93-D No:11
Page(s):
3017-3026
In disk-based storage systems, non-volatile write caches have been widely used to reduce write latency as well as to ensure data consistency at the level of a storage controller. Write cache policies should basically consider which data is important to cache and evict, and they should also take into account the real I/O features of a non-volatile device. However, existing work has mainly focused on improving basic cache operations, but has not considered the I/O cost of a non-volatile device properly. In this paper, we propose a pattern-aware write cache policy, PAW for a NAND flash memory in disk-based mobile storage systems. PAW is designed to face a mix of a number of sequential accesses and fewer non-sequential ones in mobile storage systems by redirecting the latter to a NAND flash memory and the former to a disk. In addition, PAW employs the synergistic effect of combining a pattern-aware write cache policy and an I/O clustering-based queuing method to strengthen the sequentiality with the aim of reducing the overall system I/O latency. For evaluations, we have built a practical hard disk simulator with a non-volatile cache of a NAND flash memory. Experimental results show that our policy significantly improves the overall I/O performance by reducing the overhead from a non-volatile cache considerably over a traditional one, achieving a high efficiency in energy consumption.
An Empirical Study of FTL Performance in Conjunction with File System Pursuing Data Integrity
In Hwan DOH Myoung Sub SHIM Eunsam KIM Jongmoo CHOI Donghee LEE Sam H. NOH

LETTER-Software System

Vol:
E93-D No:8
Page(s):
2302-2305
Due to the detachability of Flash storage, which is a dominant portable storage, data integrity stored in Flash storages becomes an important issue. This study considers the performance of Flash Translation Layer (FTL) schemes embedded in Flash storages in conjunction with file system behavior that pursue high data integrity. To assure extreme data integrity, file systems synchronously write all file data to storage accompanying hot write references. In this study, we concentrate on the effect of hot write references on Flash storage, and we consider the effect of absorbing the hot write references via nonvolatile write cache on the performance of the FTL schemes in Flash storage. In so doing, we quantify the performance of typical FTL schemes for a realistic digital camera workload that contains hot write references through experiments on a real system environment. Results show that for the workload with hot write references FTL performance does not conform with previously reported studies. We also conclude that the impact of the underlying FTL schemes on the performance of Flash storage is dramatically reduced by absorbing the hot write references via nonvolatile write cache.
Access-Driven Cache Attack on the Stream Cipher DICING Using the Chosen IV
Yukiyasu TSUNOO Takeshi KAWABATA Tomoyasu SUZAKI Hiroyasu KUBO Teruo SAITO

PAPER-Cryptography and Information Security

Vol:
E93-A No:4
Page(s):
799-807
A cache attack against DICING is presented. Cache attacks use CPU cache miss and hit information as side-channel information. DICING is a stream cipher that was proposed at eSTREAM. No effective attack on DICING has been reported before. Because DICING uses a key-dependent S-box and there is no key addition before the first S-box layer, a conventional cache attack is considered to be difficult. We therefore investigated an access-driven cache attack that employs the special features of transformation L to give the chosen IV. We also investigated reduction of the computational complexity required to obtain the secret key from the information gained in the cache attack. We were able to obtain a 40-bit key differential given a total of 218 chosen IVs on a Pentium III processor. From the obtained key differential, the 128-bit secret key could be recovered with computational complexity of from 249 to 263. This result shows that the new cache attack, which is based on a different attack model, is also applicable in an actual environment.
A High Performance and Low Bandwidth Multi-Standard Motion Compensation Design for HD Video Decoder
Xianmin CHEN Peilin LIU Dajiang ZHOU Jiayi ZHU Xingguang PAN Satoshi GOTO

PAPER

Vol:
E93-C No:3
Page(s):
253-260
Motion compensation is widely used in many video coding standards. Due to its bandwidth requirement and complexity, motion compensation is one of the most challenging parts in the design of high definition video decoder. In this paper, we propose a high performance and low bandwidth motion compensation design, which supports H.264/AVC, MPEG-1/2 and Chinese AVS standards. We introduce a 2-Dimensional cache that can greatly reduce the external bandwidth requirement. Similarities among the 3 standards are also explored to reduce hardware cost. We also propose a block-pipelining strategy to conceal the long latency of external memory access. Experimental results show that our motion compensation design can reduce the bandwidth by 74% in average and it can real-time decode 1920x1088@30 fps video stream at 80 MHz.
Architectures and Technologies for the Future Mobile Internet Open Access
Dipankar RAYCHAUDHURI

INVITED LETTER

Vol:
E93-B No:3
Page(s):
436-441
This position paper outlines the author's view on architectural directions and key technology enablers for the future mobile Internet. It is pointed out that mobile and wireless services will dominate Internet usage in the near future, and it is therefore important to design next-generation network protocols with features suitable for efficiently serving emerging wireless scenarios and applications. Several key requirements for mobile/wireless scenarios are identified - these include new capabilities such as dynamic spectrum coordination, cross-layer support, disconnection tolerant routing, content addressing, and location awareness. Specific examples of enabling technologies which address some of these requirements are given from ongoing research projects at WINLAB. Topics covered briefly include wireless network virtualization, the cache-and-forward (CNF) protocol, geographic (GEO) protocol stack, cognitive radio protocols, and open networking testbeds.
A Two-Level Cache Design Space Exploration System for Embedded Applications
Nobuaki TOJO Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-Embedded, Real-Time and Reconfigurable Systems

Vol:
E92-A No:12
Page(s):
3238-3247
Recently, two-level cache, L1 cache and L2 cache, is commonly used in a processor. Particularly in an embedded system whereby a single application or a class of applications is repeatedly executed on a processor, its cache configuration can be customized such that an optimal one is achieved. An optimal two-level cache configuration can be obtained which minimizes overall memory access time or memory energy consumption by varying the three cache parameters: the number of sets, a line size, and an associativity, for L1 cache and L2 cache. In this paper, we first extend the L1 cache simulation algorithm so that we can explore two-level cache configuration. Second, we propose two-level cache design space exploration algorithms: CRCB-T1 and CRCB-T2, each of which is based on applying Cache Inclusion Property to two-level cache configuration. Each of the proposed algorithms realizes exact cache simulation but decreases the number of cache hit/miss judgments by a factor of several thousands. Experimental results show that, by using our approach, the number of cache hit/miss judgments required to optimize a cache configurations is reduced to 1/50-1/5500 compared to the exhaustive approach. As a result, our proposed approach totally runs an average of 1398.25 times faster compared to the exhaustive approach. Our proposed cache simulation approach achieves the world fastest two-level cache design space exploration.
An L1 Cache Design Space Exploration System for Embedded Applications
Nobuaki TOJO Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-VLSI Design Technology and CAD

Vol:
E92-A No:6
Page(s):
1442-1453
In an embedded system where a single application or a class of applications is repeatedly executed on a processor, its cache configuration can be customized such that an optimal one is achieved. We can have an optimal cache configuration which minimizes overall memory access time by varying the three cache parameters: the number of sets, a line size, and an associativity. In this paper, we first propose two cache simulation algorithms: CRCB1 and CRCB2, based on Cache Inclusion Property. They realize exact cache simulation but decrease the number of cache hit/miss judgments dramatically. We further propose three more cache design space exploration algorithms: CRMF1, CRMF2, and CRMF3, based on our experimental observations. They can find an almost optimal cache configuration from the viewpoint of access time. By using our approach, the number of cache hit/miss judgments required for optimizing cache configurations is reduced to 1/10-1/50 compared to conventional approaches. As a result, our proposed approach totally runs an average of 3.2 times faster and a maximum of 5.3 times faster compared to the fastest approach proposed so far. Our proposed cache simulation approach achieves the world fastest cache design space exploration when optimizing total memory access time.
An Effective Self-Adaptive Admission Control Algorithm for Large Web Caches
Chul-Woong YANG Ki Yong LEE Yon Dohn CHUNG Myoung Ho KIM Yoon-Joon LEE

LETTER-Contents Technology and Web Information Systems

Vol:
E92-D No:4
Page(s):
732-735
In this paper, we propose an effective Web cache admission control algorithm. By selectively admitting objects into the cache, the proposed algorithm can significantly reduce the amount of disk I/O on a Web cache while maintaining a high hit ratio. The proposed algorithm adaptively adjusts its own admission control parameter, requiring no user-supplied parameters. Through extensive experiments, we show the effectiveness of the proposed algorithm.
XIR: Efficient Cache Invalidation Strategies for XML Data in Wireless Environments
Jae-Ho CHOI Sang-Hyun PARK Myong-Soo LEE SangKeun LEE

PAPER-Broadcast Systems

Vol:
E92-B No:4
Page(s):
1337-1345
With the growth of wireless computing and the popularity of eXtensible Markup Language (XML), wireless XML data management is emerging as an important research area. In this paper, cache invalidation methodology with XML update is addressed in wireless computing environments. A family of XML cache invalidation strategies, called S-XIR, D-XIR and E-XIR, is suggested. Using S-XIR and D-XIR, the unchanged part of XML data, only its structure changes, can be effectively reused in client caching. E-XIR, which uses prefetching, can further improve access time. Simulations are carried out to evaluate the proposed methodology; they show that the proposed strategies improve both tuning time and access time significantly. In particular, the proposed strategies are on average about 4 to 12 times better than the previous approach in terms of tuning time.
A Way Enabling Mechanism Based on the Branch Prediction Information for Low Power Instruction Cache
Gi-Ho PARK Jung-Wook PARK Hoi-Jin LEE Gunok JUNG Sung-Bae PARK Shin-Dug KIM

LETTER

Vol:
E92-C No:4
Page(s):
517-521
This paper presents a cache way enabling mechanism using branch target addresses. This mechanism uses branch prediction information to avoid the power consumption due to unnecessary cache way access by enabling only the cache way(s) that should be accessed. The proposed cache way enabling mechanism reduces the power consumption of the instruction cache by 63% without any performance degradation of the processor. An ARM1136 processor simulator and the Synopsys PrimeTime are used to perform the performance/power simulation and static timing analysis of the proposed mechanisms respectively.
A Remedy for Network Operators against Increasing P2P Traffic: Enabling Packet Cache for P2P Applications Open Access
Akihiro NAKAO Kengo SASAKI Shu YAMAMOTO

INVITED PAPER

Vol:
E91-B No:12
Page(s):
3810-3820
We observe that P2P traffic has peculiar characteristics as opposed to the other type of traffic such as web browsing and file transfer. Since they exploit swarm effect -- a multitude of end points downloading the same content piece by piece nearly at the same time, thus, increasing the effectiveness of caching -- the same pieces of data end up traversing the network over and over again within mostly a short time window. In the light of this observation, we propose a network layer packet-level caching for reducing the volume of emerging P2P traffic, transparently to the P2P applications -- without affecting operations of the P2P applications at all -- rather than banning it, restricting it, or modifying P2P systems themselves. Unlike the other caching techniques, we aim to provide as generic a caching mechanism as possible at network layer -- without knowing much detail of P2P application protocols -- to extend applicability to arbitrary P2P protocols. Our preliminary evaluation shows that our approach is expected to reduce a significant amount of P2P traffic transparently to P2P applications.
Cache Optimization for H.264/AVC Motion Compensation
Sangyong YOON Soo-Ik CHAE

LETTER-Image Processing and Video Processing

Vol:
E91-D No:12
Page(s):
2902-2905
In this letter, we propose a cache organization that substantially reduces the memory bandwidth of motion compensation (MC) in the H.264/AVC decoders. To reduce duplicated memory accesses to P and B pictures, we employ a four-way set-associative cache in which its index bits are composed of horizontal and vertical address bits of the frame buffer and each line stores an 8 2 pixel data in the reference frames. Moreover, we alleviate the data fragmentation problem by selecting its line size that equals the minimum access size of the DDR SDRAM. The bandwidth of the optimized cache averaged over five QCIF IBBP image sequences requires only 129% of the essential bandwidth of an H.264/AVC MC.
DAC: A Device-Aware Cache Management Algorithm for Heterogeneous Mobile Storage Systems
Young-Jin KIM Jihong KIM

PAPER-System Programs

Vol:
E91-D No:12
Page(s):
2818-2833
In recent years, heterogeneous devices have been employed frequently in mobile storage systems because a combination of such devices can supply a synergistically useful storage solution by taking advantage of each device. One important design constraint in heterogeneous storage systems is to mitigate I/O performance degradation stemming from the difference between access times of different devices. To this end, there has not been much work to devise proper buffer cache management algorithms. This paper presents a novel buffer cache management algorithm which considers both I/O cost per device and workload patterns in mobile computing systems with a heterogeneous storage pair of a hard disk and a NAND flash memory. In order to minimize the total I/O cost under varying workload patterns, the proposed algorithm employs a dynamic cache partitioning technique over different devices and manages each partition according to request patterns and I/O types along with the temporal locality. Trace-based simulations show that the proposed algorithm reduces the total I/O cost and flash write count significantly over the existing buffer cache algorithms on typical mobile traces.
Way-Scaling to Reduce Power of Cache with Delay Variation
Maziar GOUDARZI Tadayuki MATSUMURA Tohru ISHIHARA

PAPER-High-Level Synthesis and System-Level Design

Vol:
E91-A No:12
Page(s):
3576-3584
The share of leakage in cache power consumption increases with technology scaling. Choosing a higher threshold voltage (Vth) and/or gate-oxide thickness (Tox) for cache transistors improves leakage, but impacts cell delay. We show that due to uncorrelated random within-die delay variation, only some (not all) of cells actually violate the cache delay after the above change. We propose to add a spare cache way to replace delay-violating cache-lines separately in each cache-set. By SPICE and gate-level simulations in a commercial 90 nm process, we show that choosing higher Vth, Tox and adding one spare way to a 4-way 16 KB cache reduces leakage power by 42%, which depending on the share of leakage in total cache power, gives up to 22.59% and 41.37% reduction of total energy respectively in L1 instruction- and L2 unified-cache with a negligible delay penalty, but without sacrificing cache capacity or timing-yield.

81-100hit(201hit)

Keyword Search Result

[Keyword] cache(201hit)

An Adaptive Various-Width Data Cache for Low Power Design

Analysis before Starting an Access: A New Power-Efficient Instruction Fetch Mechanism

A Novel Cache Replacement Policy via Dynamic Adaptive Insertion and Re-Reference Prediction

Cache Based Motion Compensation Architecture for Quad-HD H.264/AVC Video Decoder

Short Term Cell-Flipping Technique for Mitigating SNM Degradation Due to NBTI

Dynamic Program Behavior Identification for High Performance CMPs with Private LLCs

PAW: A Pattern-Aware Write Policy for a Flash Non-volatile Cache

An Empirical Study of FTL Performance in Conjunction with File System Pursuing Data Integrity

Access-Driven Cache Attack on the Stream Cipher DICING Using the Chosen IV

A High Performance and Low Bandwidth Multi-Standard Motion Compensation Design for HD Video Decoder

Architectures and Technologies for the Future Mobile Internet Open Access

A Two-Level Cache Design Space Exploration System for Embedded Applications

An L1 Cache Design Space Exploration System for Embedded Applications

An Effective Self-Adaptive Admission Control Algorithm for Large Web Caches

XIR: Efficient Cache Invalidation Strategies for XML Data in Wireless Environments

A Way Enabling Mechanism Based on the Branch Prediction Information for Low Power Instruction Cache

A Remedy for Network Operators against Increasing P2P Traffic: Enabling Packet Cache for P2P Applications Open Access

Cache Optimization for H.264/AVC Motion Compensation

DAC: A Device-Aware Cache Management Algorithm for Heterogeneous Mobile Storage Systems

Way-Scaling to Reduce Power of Cache with Delay Variation

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles