The search functionality is under construction.

Keyword Search Result

[Keyword] cache(201hit)

81-100hit(201hit)

  • An Adaptive Various-Width Data Cache for Low Power Design

    Jiongyao YE  Yu WAN  Takahiro WATANABE  

     
    PAPER-Computer System

      Vol:
    E94-D No:8
      Page(s):
    1539-1546

    Modern microprocessors employ caches to bridge the great speed variance between a main memory and a central processing unit, but these caches consume a larger and larger proportion of the total power consumption. In fact, many values in a processor rarely need the full-bit dynamic range supported by a cache. The narrow-width value occupies a large portion of the cache access and storage. In view of these observations, this paper proposes an Adaptive Various-width Data Cache (AVDC) to reduce the power consumption in a cache, which exploits the popularity of narrow-width value stored in the cache. In AVDC, the data storage unit consists of three sub-arrays to store data of different widths. When high sub-arrays are not used, they are closed to save its dynamic and static power consumption through the modified high-bit SRAM cell. The main advantages of AVDC are: 1) Both the dynamic and static power consumption can be reduced. 2) Low power consumption is achieved by the modification of the data storage unit with less hardware modification. 3) We exploit the redundancy of narrow-width values instead of compressed values, thus cache access latency does not increase. Experimental results using SPEC 2000 benchmarks show that our proposed AVDC can reduce the power consumption, by 34.83% for dynamic power saving and by 42.87% for static power saving on average, compared with a cache without AVDC.

  • Analysis before Starting an Access: A New Power-Efficient Instruction Fetch Mechanism

    Jiongyao YE  Yingtao HU  Hongfeng DING  Takahiro WATANABE  

     
    PAPER-Computer System

      Vol:
    E94-D No:7
      Page(s):
    1398-1408

    Power consumption has become an increasing concern in high performance microprocessor design. Especially, Instruction Cache (I-Cache) contributes a large portion of the total power consumption in a microprocessor, since it is a complex unit and is accessed very frequently. Several studies on low-power design have been presented for the power-efficient cache design. However, these techniques usually suffer from the restrictions in the traditional Instruction Fetch Unit (IFU) architectures where the fetch address needs to be sent to I-Cache once it is available. Therefore, work to reduce the power consumption is limited after the address generation and before starting an access. In this paper, we present a new power-aware IFU architecture, named Analysis Before Starting an Access (ABSA), which aims at maximizing the power efficiency of the low-power designs by eliminating the restrictions on those low-power designs of the traditional IFU. To achieve this goal, ABSA reorganizes the IFU pipeline and carefully assigns tasks for each stages so that sufficient time and information can be provided for the low-power techniques to maximize the power efficiency before starting an access. The proposed design is fully scalable and its cost is low. Compared to a conventional IFU design, simulation results show that ABSA saves about 30.3% fetch power consumption, on average. I-Cache employed by ABSA reduces both static and dynamic power consumptions about 85.63% and 66.92%, respectively. Meanwhile the performance degradation is only about 0.97%.

  • A Novel Cache Replacement Policy via Dynamic Adaptive Insertion and Re-Reference Prediction

    Xi ZHANG  Chongmin LI  Zhenyu LIU  Haixia WANG  Dongsheng WANG  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    468-476

    Previous research illustrates that LRU replacement policy is not efficient when applications exhibit a distant re-reference interval. Recently RRIP policy is proposed to improve the performance for such kind of workloads. However, the lack of access recency information in RRIP confuses the replacement policy to make the accurate prediction. To enhance the robustness of RRIP for recency-friendly workloads, we propose an Dynamic Adaptive Insertion and Re-reference Prediction (DAI-RRP) policy which evicts data based on both re-reference prediction value and the access recency information. DAI-RRP makes adaptive adjustment on insertion position and prediction value for different access patterns, which makes the policy robust across different workloads and different phases. Simulation results show that DAI-RRP outperforms LRU and RRIP. For a single-core processor with a 1 MB 16-way set last-level cache (LLC), DAI-RRP reduces CPI over LRU and Dynamic RRIP by an average of 8.1% and 2.7% respectively. Evaluations on quad-core CMP with a 4 MB shared LLC show that DAI-RRP outperforms LRU and Dynamic RRIP (DRRIP) on the weighted speedup metric by an average of 8.1% and 15.7% respectively. Furthermore, compared to LRU, DAI-RRP consumes the similar hardware for 16-way cache, or even less hardware for high-associativity cache. In summary, the proposed policy is practical and can be easily integrated into existing hardware approximations of LRU.

  • Cache Based Motion Compensation Architecture for Quad-HD H.264/AVC Video Decoder

    Jinjia ZHOU  Dajiang ZHOU  Gang HE  Satoshi GOTO  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    439-447

    In this paper, we present a cache based motion compensation (MC) architecture for Quad-HD H.264/AVC video decoder. With the significantly increased throughput requirement, VLSI design for MC is greatly challenged by the huge area cost and power consumption. Moreover, the long memory system latency leads to performance drop of the MC pipeline. To solve these problems, three optimization schemes are proposed in this work. Firstly, a high-performance interpolator based on Horizontal-Vertical Expansion and Luma-Chroma Parallelism (HVE-LCP) is proposed to efficiently increase the processing throughput to at least over 4 times as the previous designs. Secondly, an efficient cache memory organization scheme (4S×4) is adopted to improve the on-chip memory utilization, which contributes to memory area saving of 25% and memory power saving of 3949%. Finally, by employing a Split Task Queue (STQ) architecture, the cache system is capable of tolerating much longer latency of the memory system. Consequently, the cache idle time is saved by 90%, which contributes to reducing the overall processing time by 2440%. When implemented with SMIC 90 nm process, this design costs a logic gate count and on-chip memory of 108.8 k and 3.1 kB respectively. The proposed MC architecture can support real-time processing of 3840×2160@60 fps with less than 166 MHz.

  • Short Term Cell-Flipping Technique for Mitigating SNM Degradation Due to NBTI

    Yuji KUNITAKE  Toshinori SATO  Hiroto YASUURA  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    520-529

    Negative Bias Temperature Instability (NBTI) is one of the major reliability problems in advanced technologies. NBTI causes threshold voltage shift in a PMOS transistor. When the PMOS transistor is biased to negative voltage, threshold voltage shifts to negatively. On the other hand, the threshold voltage recovers if the PMOS transistor is positively biased. In an SRAM cell, due to NBTI, threshold voltage degrades in the load PMOS transistors. The degradation has the impact on Static Noise Margin (SNM), which is a measure of read stability of a 6-T SRAM cell. In this paper, we discuss the relationship between NBTI degradation in an SRAM cell and the dynamic stress and recovery condition. There are two important characteristics. One is a stress probability, which is defined as the rate that the PMOS transistor is negatively biased. The other is a stress and recovery cycle, which is defined as the switching interval of an SRAM value. In our observations, in order to mitigate the NBTI degradation, the stress probability should be small and the stress and recovery cycle should be shorter than 10 msec. Based on the observations, we propose a novel cell-flipping technique, which makes the stress probability close to 50%. In addition, we show results of the case studies, which apply the cell-flipping technique to register file and cache memories.

  • Dynamic Program Behavior Identification for High Performance CMPs with Private LLCs

    Xiaomin JIA  Pingjing LU  Caixia SUN  Minxuan ZHANG  

     
    PAPER

      Vol:
    E93-D No:12
      Page(s):
    3211-3222

    Chip Multi-Processors (CMPs) emerge as a mainstream architectural design alternative for high performance parallel and distributed computing. Last Level Cache (LLC) management is critical to CMPs because off-chip accesses often require a long latency. Due to its short access latency, well performance isolation and easy scalability, private cache is an attractive design alternative for LLC of CMPs. This paper proposes program Behavior Identification-based Cache Sharing (BICS) for LLC management. BICS is based on a private cache organization for the shorter access latency. Meanwhile, BICS tries to simulate a shared cache organization by allowing evicted blocks of one private LLC to be saved at peer LLCs. This technique is called spilling. BICS identifies cache behavior types of applications at runtime. When a cache block is evicted from a private LLC, cache behavior characteristics of the local application are evaluated so as to determine whether the block is to be spilled. Spilled blocks are allowed to replace some valid blocks of the peer LLCs as long as the interference is within a reasonable level. Experimental results using a full system CMP simulator show that BICS improves the overall throughput by as much as 14.5%, 12.6%, 11.0% and 11.7% (on average 8.8%, 4.8%, 4.0% and 6.8%) over private cache, shared cache, Utility-based Cache Partitioning (UCP) scheme and the baseline spilling-based organization Cooperative Caching (CC) respectively on a 4-core CMP for SPEC CPU2006 benchmarks.

  • PAW: A Pattern-Aware Write Policy for a Flash Non-volatile Cache

    Young-Jin KIM  Jihong KIM  Jeong-Bae LEE  Kee-Wook RIM  

     
    PAPER-Software System

      Vol:
    E93-D No:11
      Page(s):
    3017-3026

    In disk-based storage systems, non-volatile write caches have been widely used to reduce write latency as well as to ensure data consistency at the level of a storage controller. Write cache policies should basically consider which data is important to cache and evict, and they should also take into account the real I/O features of a non-volatile device. However, existing work has mainly focused on improving basic cache operations, but has not considered the I/O cost of a non-volatile device properly. In this paper, we propose a pattern-aware write cache policy, PAW for a NAND flash memory in disk-based mobile storage systems. PAW is designed to face a mix of a number of sequential accesses and fewer non-sequential ones in mobile storage systems by redirecting the latter to a NAND flash memory and the former to a disk. In addition, PAW employs the synergistic effect of combining a pattern-aware write cache policy and an I/O clustering-based queuing method to strengthen the sequentiality with the aim of reducing the overall system I/O latency. For evaluations, we have built a practical hard disk simulator with a non-volatile cache of a NAND flash memory. Experimental results show that our policy significantly improves the overall I/O performance by reducing the overhead from a non-volatile cache considerably over a traditional one, achieving a high efficiency in energy consumption.

  • An Empirical Study of FTL Performance in Conjunction with File System Pursuing Data Integrity

    In Hwan DOH  Myoung Sub SHIM  Eunsam KIM  Jongmoo CHOI  Donghee LEE  Sam H. NOH  

     
    LETTER-Software System

      Vol:
    E93-D No:8
      Page(s):
    2302-2305

    Due to the detachability of Flash storage, which is a dominant portable storage, data integrity stored in Flash storages becomes an important issue. This study considers the performance of Flash Translation Layer (FTL) schemes embedded in Flash storages in conjunction with file system behavior that pursue high data integrity. To assure extreme data integrity, file systems synchronously write all file data to storage accompanying hot write references. In this study, we concentrate on the effect of hot write references on Flash storage, and we consider the effect of absorbing the hot write references via nonvolatile write cache on the performance of the FTL schemes in Flash storage. In so doing, we quantify the performance of typical FTL schemes for a realistic digital camera workload that contains hot write references through experiments on a real system environment. Results show that for the workload with hot write references FTL performance does not conform with previously reported studies. We also conclude that the impact of the underlying FTL schemes on the performance of Flash storage is dramatically reduced by absorbing the hot write references via nonvolatile write cache.

  • Access-Driven Cache Attack on the Stream Cipher DICING Using the Chosen IV

    Yukiyasu TSUNOO  Takeshi KAWABATA  Tomoyasu SUZAKI  Hiroyasu KUBO  Teruo SAITO  

     
    PAPER-Cryptography and Information Security

      Vol:
    E93-A No:4
      Page(s):
    799-807

    A cache attack against DICING is presented. Cache attacks use CPU cache miss and hit information as side-channel information. DICING is a stream cipher that was proposed at eSTREAM. No effective attack on DICING has been reported before. Because DICING uses a key-dependent S-box and there is no key addition before the first S-box layer, a conventional cache attack is considered to be difficult. We therefore investigated an access-driven cache attack that employs the special features of transformation L to give the chosen IV. We also investigated reduction of the computational complexity required to obtain the secret key from the information gained in the cache attack. We were able to obtain a 40-bit key differential given a total of 218 chosen IVs on a Pentium III processor. From the obtained key differential, the 128-bit secret key could be recovered with computational complexity of from 249 to 263. This result shows that the new cache attack, which is based on a different attack model, is also applicable in an actual environment.

  • A High Performance and Low Bandwidth Multi-Standard Motion Compensation Design for HD Video Decoder

    Xianmin CHEN  Peilin LIU  Dajiang ZHOU  Jiayi ZHU  Xingguang PAN  Satoshi GOTO  

     
    PAPER

      Vol:
    E93-C No:3
      Page(s):
    253-260

    Motion compensation is widely used in many video coding standards. Due to its bandwidth requirement and complexity, motion compensation is one of the most challenging parts in the design of high definition video decoder. In this paper, we propose a high performance and low bandwidth motion compensation design, which supports H.264/AVC, MPEG-1/2 and Chinese AVS standards. We introduce a 2-Dimensional cache that can greatly reduce the external bandwidth requirement. Similarities among the 3 standards are also explored to reduce hardware cost. We also propose a block-pipelining strategy to conceal the long latency of external memory access. Experimental results show that our motion compensation design can reduce the bandwidth by 74% in average and it can real-time decode 1920x1088@30 fps video stream at 80 MHz.

  • Architectures and Technologies for the Future Mobile Internet Open Access

    Dipankar RAYCHAUDHURI  

     
    INVITED LETTER

      Vol:
    E93-B No:3
      Page(s):
    436-441

    This position paper outlines the author's view on architectural directions and key technology enablers for the future mobile Internet. It is pointed out that mobile and wireless services will dominate Internet usage in the near future, and it is therefore important to design next-generation network protocols with features suitable for efficiently serving emerging wireless scenarios and applications. Several key requirements for mobile/wireless scenarios are identified - these include new capabilities such as dynamic spectrum coordination, cross-layer support, disconnection tolerant routing, content addressing, and location awareness. Specific examples of enabling technologies which address some of these requirements are given from ongoing research projects at WINLAB. Topics covered briefly include wireless network virtualization, the cache-and-forward (CNF) protocol, geographic (GEO) protocol stack, cognitive radio protocols, and open networking testbeds.

  • A Two-Level Cache Design Space Exploration System for Embedded Applications

    Nobuaki TOJO  Nozomu TOGAWA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-Embedded, Real-Time and Reconfigurable Systems

      Vol:
    E92-A No:12
      Page(s):
    3238-3247

    Recently, two-level cache, L1 cache and L2 cache, is commonly used in a processor. Particularly in an embedded system whereby a single application or a class of applications is repeatedly executed on a processor, its cache configuration can be customized such that an optimal one is achieved. An optimal two-level cache configuration can be obtained which minimizes overall memory access time or memory energy consumption by varying the three cache parameters: the number of sets, a line size, and an associativity, for L1 cache and L2 cache. In this paper, we first extend the L1 cache simulation algorithm so that we can explore two-level cache configuration. Second, we propose two-level cache design space exploration algorithms: CRCB-T1 and CRCB-T2, each of which is based on applying Cache Inclusion Property to two-level cache configuration. Each of the proposed algorithms realizes exact cache simulation but decreases the number of cache hit/miss judgments by a factor of several thousands. Experimental results show that, by using our approach, the number of cache hit/miss judgments required to optimize a cache configurations is reduced to 1/50-1/5500 compared to the exhaustive approach. As a result, our proposed approach totally runs an average of 1398.25 times faster compared to the exhaustive approach. Our proposed cache simulation approach achieves the world fastest two-level cache design space exploration.

  • An L1 Cache Design Space Exploration System for Embedded Applications

    Nobuaki TOJO  Nozomu TOGAWA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E92-A No:6
      Page(s):
    1442-1453

    In an embedded system where a single application or a class of applications is repeatedly executed on a processor, its cache configuration can be customized such that an optimal one is achieved. We can have an optimal cache configuration which minimizes overall memory access time by varying the three cache parameters: the number of sets, a line size, and an associativity. In this paper, we first propose two cache simulation algorithms: CRCB1 and CRCB2, based on Cache Inclusion Property. They realize exact cache simulation but decrease the number of cache hit/miss judgments dramatically. We further propose three more cache design space exploration algorithms: CRMF1, CRMF2, and CRMF3, based on our experimental observations. They can find an almost optimal cache configuration from the viewpoint of access time. By using our approach, the number of cache hit/miss judgments required for optimizing cache configurations is reduced to 1/10-1/50 compared to conventional approaches. As a result, our proposed approach totally runs an average of 3.2 times faster and a maximum of 5.3 times faster compared to the fastest approach proposed so far. Our proposed cache simulation approach achieves the world fastest cache design space exploration when optimizing total memory access time.

  • An Effective Self-Adaptive Admission Control Algorithm for Large Web Caches

    Chul-Woong YANG  Ki Yong LEE  Yon Dohn CHUNG  Myoung Ho KIM  Yoon-Joon LEE  

     
    LETTER-Contents Technology and Web Information Systems

      Vol:
    E92-D No:4
      Page(s):
    732-735

    In this paper, we propose an effective Web cache admission control algorithm. By selectively admitting objects into the cache, the proposed algorithm can significantly reduce the amount of disk I/O on a Web cache while maintaining a high hit ratio. The proposed algorithm adaptively adjusts its own admission control parameter, requiring no user-supplied parameters. Through extensive experiments, we show the effectiveness of the proposed algorithm.

  • XIR: Efficient Cache Invalidation Strategies for XML Data in Wireless Environments

    Jae-Ho CHOI  Sang-Hyun PARK  Myong-Soo LEE  SangKeun LEE  

     
    PAPER-Broadcast Systems

      Vol:
    E92-B No:4
      Page(s):
    1337-1345

    With the growth of wireless computing and the popularity of eXtensible Markup Language (XML), wireless XML data management is emerging as an important research area. In this paper, cache invalidation methodology with XML update is addressed in wireless computing environments. A family of XML cache invalidation strategies, called S-XIR, D-XIR and E-XIR, is suggested. Using S-XIR and D-XIR, the unchanged part of XML data, only its structure changes, can be effectively reused in client caching. E-XIR, which uses prefetching, can further improve access time. Simulations are carried out to evaluate the proposed methodology; they show that the proposed strategies improve both tuning time and access time significantly. In particular, the proposed strategies are on average about 4 to 12 times better than the previous approach in terms of tuning time.

  • A Way Enabling Mechanism Based on the Branch Prediction Information for Low Power Instruction Cache

    Gi-Ho PARK  Jung-Wook PARK  Hoi-Jin LEE  Gunok JUNG  Sung-Bae PARK  Shin-Dug KIM  

     
    LETTER

      Vol:
    E92-C No:4
      Page(s):
    517-521

    This paper presents a cache way enabling mechanism using branch target addresses. This mechanism uses branch prediction information to avoid the power consumption due to unnecessary cache way access by enabling only the cache way(s) that should be accessed. The proposed cache way enabling mechanism reduces the power consumption of the instruction cache by 63% without any performance degradation of the processor. An ARM1136 processor simulator and the Synopsys PrimeTime are used to perform the performance/power simulation and static timing analysis of the proposed mechanisms respectively.

  • A Remedy for Network Operators against Increasing P2P Traffic: Enabling Packet Cache for P2P Applications Open Access

    Akihiro NAKAO  Kengo SASAKI  Shu YAMAMOTO  

     
    INVITED PAPER

      Vol:
    E91-B No:12
      Page(s):
    3810-3820

    We observe that P2P traffic has peculiar characteristics as opposed to the other type of traffic such as web browsing and file transfer. Since they exploit swarm effect -- a multitude of end points downloading the same content piece by piece nearly at the same time, thus, increasing the effectiveness of caching -- the same pieces of data end up traversing the network over and over again within mostly a short time window. In the light of this observation, we propose a network layer packet-level caching for reducing the volume of emerging P2P traffic, transparently to the P2P applications -- without affecting operations of the P2P applications at all -- rather than banning it, restricting it, or modifying P2P systems themselves. Unlike the other caching techniques, we aim to provide as generic a caching mechanism as possible at network layer -- without knowing much detail of P2P application protocols -- to extend applicability to arbitrary P2P protocols. Our preliminary evaluation shows that our approach is expected to reduce a significant amount of P2P traffic transparently to P2P applications.

  • Cache Optimization for H.264/AVC Motion Compensation

    Sangyong YOON  Soo-Ik CHAE  

     
    LETTER-Image Processing and Video Processing

      Vol:
    E91-D No:12
      Page(s):
    2902-2905

    In this letter, we propose a cache organization that substantially reduces the memory bandwidth of motion compensation (MC) in the H.264/AVC decoders. To reduce duplicated memory accesses to P and B pictures, we employ a four-way set-associative cache in which its index bits are composed of horizontal and vertical address bits of the frame buffer and each line stores an 8 2 pixel data in the reference frames. Moreover, we alleviate the data fragmentation problem by selecting its line size that equals the minimum access size of the DDR SDRAM. The bandwidth of the optimized cache averaged over five QCIF IBBP image sequences requires only 129% of the essential bandwidth of an H.264/AVC MC.

  • DAC: A Device-Aware Cache Management Algorithm for Heterogeneous Mobile Storage Systems

    Young-Jin KIM  Jihong KIM  

     
    PAPER-System Programs

      Vol:
    E91-D No:12
      Page(s):
    2818-2833

    In recent years, heterogeneous devices have been employed frequently in mobile storage systems because a combination of such devices can supply a synergistically useful storage solution by taking advantage of each device. One important design constraint in heterogeneous storage systems is to mitigate I/O performance degradation stemming from the difference between access times of different devices. To this end, there has not been much work to devise proper buffer cache management algorithms. This paper presents a novel buffer cache management algorithm which considers both I/O cost per device and workload patterns in mobile computing systems with a heterogeneous storage pair of a hard disk and a NAND flash memory. In order to minimize the total I/O cost under varying workload patterns, the proposed algorithm employs a dynamic cache partitioning technique over different devices and manages each partition according to request patterns and I/O types along with the temporal locality. Trace-based simulations show that the proposed algorithm reduces the total I/O cost and flash write count significantly over the existing buffer cache algorithms on typical mobile traces.

  • Way-Scaling to Reduce Power of Cache with Delay Variation

    Maziar GOUDARZI  Tadayuki MATSUMURA  Tohru ISHIHARA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E91-A No:12
      Page(s):
    3576-3584

    The share of leakage in cache power consumption increases with technology scaling. Choosing a higher threshold voltage (Vth) and/or gate-oxide thickness (Tox) for cache transistors improves leakage, but impacts cell delay. We show that due to uncorrelated random within-die delay variation, only some (not all) of cells actually violate the cache delay after the above change. We propose to add a spare cache way to replace delay-violating cache-lines separately in each cache-set. By SPICE and gate-level simulations in a commercial 90 nm process, we show that choosing higher Vth, Tox and adding one spare way to a 4-way 16 KB cache reduces leakage power by 42%, which depending on the share of leakage in total cache power, gives up to 22.59% and 41.37% reduction of total energy respectively in L1 instruction- and L2 unified-cache with a negligible delay penalty, but without sacrificing cache capacity or timing-yield.

81-100hit(201hit)