The search functionality is under construction.

Keyword Search Result

[Keyword] cache(201hit)

41-60hit(201hit)

  • Improvement of Renamed Trace Cache through the Reduction of Dependent Path Length for High Energy Efficiency

    Ryota SHIOYA  Hideki ANDO  

     
    PAPER-Computer System

      Pubricized:
    2015/12/04
      Vol:
    E99-D No:3
      Page(s):
    630-640

    Out-of-order superscalar processors rename register numbers to remove false dependencies between instructions. A renaming logic for register renaming is a high-cost module in a superscalar processor, and it consumes considerable energy. A renamed trace cache (RTC) was proposed for reducing the energy consumption of a renaming logic. An RTC caches and reuses renamed operands, and thus, register renaming can be omitted on RTC hits. However, conventional RTCs suffer from several performance, energy consumption, and hardware overhead problems. We propose a semi-global renamed trace cache (SGRTC) that caches only renamed operands that are short distance from producers outside traces, and solves the problems of conventional RTCs. Evaluation results show that SGRTC achieves 64% lower energy consumption for renaming with a 0.2% performance overhead as compared to a conventional processor.

  • A SOI Cache-Tag Memory with Dual-Rail Wordline Scheme

    Nobutaro SHIBATA  Takako ISHIHARA  

     
    PAPER-Integrated Electronics

      Vol:
    E99-C No:2
      Page(s):
    316-330

    Cache memories are the major application of high-speed SRAMs, and they are frequently installed in high performance logic VLSIs including microprocessors. This paper presents a 4-way set-associative, SOI cache-tag memory. To obtain higher operating speed with less power dissipation, we devised an I/O-separated memory cell with a dual-rail wordline, which is used to transmit complementary selection signals. The address decoding delay was shortened using CMOS dual-rail logic. To enhance the maximum operating frequency, bitline's recovery operations after writing data were eliminated using a memory array configuration without half-selected cells. Moreover, conventional, sensitive but slow differential amplifiers were successfully removed from the data I/O circuitry with a hierarchical bitline scheme. As regards the stored data management, we devised a new hardware-oriented LRU-data replacement algorithm on the basis of 6-bit directed graph. With the experimental results obtained with a test chip fabricated with a 0.25-µm CMOS/SIMOX process, the core of the cache-tag memory with a 1024-set configuration can achieve a 1.5-ns address access time under typical conditions of a 2-V power supply and 25°C. The power dissipation during standby was less than 14 µW, and that at the 500-MHz operation was 13-83 mW, depending on the bit-stream data pattern.

  • Performance Evaluation of Partial Deployment of an In-Network Cache Location Guide Scheme, Breadcrumbs

    Hideyuki NAKAJIMA  Tatsuhiro TSUTSUI  Hiroyuki URABAYASHI  Miki YAMAMOTO  Elisha ROSENSWEIG  James F. KUROSE  

     
    PAPER-Network

      Vol:
    E99-B No:1
      Page(s):
    157-166

    In recent years, much work has been devoted to developing protocols and architectures for supporting the growing trend of data-oriented services. One drawback of many of these proposals is the need to upgrade or replace all the routers in order for the new systems to work. Among the few systems that allow for gradual deployment is the recently-proposed Breadcrumbs technique for distributed coordination among caches in a cache network. Breadcrumbs uses information collected locally at each cache during past downloads to support in-network guiding of current requests to desired content. Specifically, during content download a series of short-term pointers, called breadcrumbs, is set up along the download path. Future requests for this content are initially routed towards the server which holds (a copy of) this content. However, if this route leads the request to a Breadcrumbs-supporting router, this router re-directs the request in the direction of the latest downloaded, using the aforementioned pointers. Thus, content requests are initially forwarded by a location ID (e.g., IP address), but encountering a breadcrumb entry can cause a shift over to content-based routing. This property enables the Breadcrumbs system to be deployed gradually, since it only enhances the existing location-based routing mechanism (i.e. IP-based routing). In this paper we evaluate the performance of a network where Breadcrumbs is only partially deployed. Our simulation results show Breadcrumbs performs poorly when sparsely deployed. However, if an overlay of Breadcrumbs-supporting routers is set-up, system performance is greatly improved. We believe that the reduced load on servers achieved with even a limited deployment of Breadcrumbs-supporting routers, combined with the flexibility of being able to deploy the system gradually, should motivate further investigation and eventual deployment of Breadcrumbs. In the paper, we also evaluate more coarse level than router level, i.e. ISP-level Breadcrumbs deployment issues. Our evaluation results show that Higher-layer first deployment approach obtains great improvement caused by Breadcrumbs redirections because of traffic aggregation in higher layer ISP.

  • A Light-Weight Rollback Mechanism for Testing Kernel Variants in Auto-Tuning

    Shoichi HIRASAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER-Software

      Pubricized:
    2015/09/15
      Vol:
    E98-D No:12
      Page(s):
    2178-2186

    Automatic performance tuning of a practical application could be time-consuming and sometimes infeasible, because it often needs to evaluate the performances of a large number of code variants to find the best one. In this paper, hence, a light-weight rollback mechanism is proposed to evaluate each of code variants at a low cost. In the proposed mechanism, once one code variant of a target code block is executed, the execution state is rolled back to the previous state of not yet executing the block so as to repeatedly execute only the block to find the best code variant. It also has a feature of terminating a code variant whose execution time is longer than the shortest execution time so far. As a result, it can prevent executing the whole application many times and thus reduces the timing overhead of an auto-tuning process required for finding the best code variant.

  • FLEXII: A Flexible Insertion Policy for Dynamic Cache Resizing Mechanisms

    Masayuki SATO  Ryusuke EGAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER

      Vol:
    E98-C No:7
      Page(s):
    550-558

    As energy consumption of cache memories increases, an energy-efficient cache management mechanism is required. While a dynamic cache resizing mechanism is one promising approach to the energy reduction of microprocessors, one problem is that its effect is limited by the existence of dead-on-fill blocks, which are not used until their evictions from the cache memory. To solve this problem, this paper proposes a cache management policy named FLEXII, which can reduce the number of dead-on-fill blocks and help dynamic cache resizing mechanisms further reduce the energy consumption of the cache memories.

  • Cache-Conscious Data Access for DBMS in Multicore Environments

    Fang XI  Takeshi MISHIMA  Haruo YOKOTA  

     
    PAPER

      Pubricized:
    2015/01/21
      Vol:
    E98-D No:5
      Page(s):
    1001-1012

    In recent years, dramatic improvements have been made to computer hardware. In particular, the number of cores on a chip has been growing exponentially, enabling an ever-increasing number of processes to be executed in parallel. Having been originally developed for single-core processors, database (DB) management systems (DBMSs) running on multicore processors suffer from cache conflicts as the number of concurrently executing DB processes (DBPs) increases. Therefore, a cache-efficient solution for arranging the execution of concurrent DBPs on multicore platforms would be highly attractive for DBMSs. In this paper, we propose CARIC-DA, middleware for achieving higher performance in DBMSs on multicore processors, by reducing cache misses with a new cache-conscious dispatcher for concurrent queries. CARIC-DA logically range-partitions the dataset into multiple subsets. This enables different processor cores to access different subsets by ensuring that different DBPs are pinned to different cores and by dispatching queries to DBPs according to the data-partitioning information. In this way, CARIC-DA is expected to achieve better performance via a higher cache hit rate for the private cache of each core. It can also balance the loads between cores by changing the range of each subset. Note that CARIC-DA is pure middleware, meaning that it avoids any modification to existing operating systems (OSs) and DBMSs, thereby making it more practical. This is important because the source code for existing DBMSs is large and complex, making it very expensive to modify. We implemented a prototype that uses unmodified existing Linux and PostgreSQL environments, and evaluated the effectiveness of our proposal on three different multicore platforms. The performance evaluation against benchmarks revealed that CARIC-DA achieved improved cache hit rates and higher performance.

  • Multi-ISP Cooperative Cache Sharing for Saving Inter-ISP Transit Cost in Content Centric Networking

    Kazuhito MATSUDA  Go HASEGAWA  Masayuki MURATA  

     
    PAPER-Internet

      Vol:
    E98-B No:4
      Page(s):
    621-629

    Content-Centric Networking (CCN) has an in-network caching mechanism, which can reduce the traffic volume along the route to the destination host. This traffic volume reduction on the transit link can decrease inter-ISP transit cost. However, the memory space for caching in CCN routers is small relative to content volume. In addition, any initial access to the content requested by a user must use the transit link, even when a nearby CCN router outside the route has the cached content. In this paper, we propose a method of cooperative cache sharing among CCN routers in multiple ISPs. It aims to attain a further reduction in the inter-ISP transit cost by improving the cache hit ratio. In the proposed method, the CCN routers share the memory space for caching of non-overlapping cache content. We evaluate the proposed method by simulation experiments using the IP-level network topology of actual ISP, and show that the inter-ISP transit traffic can be reduced by up to 28% compared with normal caching behavior of CCN.

  • Adaptive TTL Control to Minimize Resource Cost in Hierarchical Caching Networks

    Satoshi IMAI  Kenji LEIBNITZ  Masayuki MURATA  

     
    PAPER-Internet Architecture and Protocols

      Pubricized:
    2014/12/11
      Vol:
    E98-D No:3
      Page(s):
    565-577

    Content caching networks like Information-Centric Networking (ICN) are beneficial to reduce the network traffic by storing content data on routers near to users. In ICN, it becomes an important issue to manage system resources, such as storage and network bandwidth, which are influenced by cache characteristics of each cache node. Meanwhile, cache aging techniques based on Time-To-Live (TTL) of content facilitate analyzing cache characteristics and can realize appropriate resource management by setting efficient TTLs. However, it is difficult to search for the efficient TTLs in a distributed cache system connected by multiple cache nodes. Therefore, we propose an adaptive control mechanism of the TTL value of content in distributed cache systems by using predictive models which can estimate the impact of the TTL values on network resources and cache performance. Furthermore, we show the effectiveness of the proposed mechanism.

  • Evaluation Method for Access-Driven Cache Attacks Using Correlation Coefficient

    Junko TAKAHASHI  Toshinori FUKUNAGA  Kazumaro AOKI  Hitoshi FUJI  

     
    PAPER-Foundation

      Vol:
    E98-A No:1
      Page(s):
    192-202

    This paper proposes a new accurate evaluation method for examining the resistance of cryptographic implementations against access-driven cache attacks (CAs). We show that a mathematical correlation method between the sets of measured access time and the ideal data, which depend on the guessed key, can be utilized to evaluate quantitatively the correct key in access-driven CAs. We show the effectiveness of the proposed method using the access time measured in noisy environments. We also estimate the number of key candidates based on mathematical proof while considering memory allocation. Furthermore, based on the proposed method, we analyze quantitatively how the correlation values change with the number of plaintexts for a successful attack.

  • Optimal Cooperative Routing Protocol for Efficient In-Network Cache Management in Content-Centric Networks

    Saran TARNOI  Wuttipong KUMWILAISAK  Yusheng JI  

     
    PAPER

      Vol:
    E97-B No:12
      Page(s):
    2627-2640

    This paper presents an optimal cooperative routing protocol (OCRP) aiming to improve the in-network cache utilization of the Content-Centric Networking (CCN). The objective of OCRP is to selectively aggregate the multiple flows of interest messages onto the same path in order to improve the cache utilization while mitigating the cache contention of the Content Stores (CSs) of CCN routers on the routing path. The proposed routing protocol consists of three processes: (1) Prefix Popularity Observation; (2) Prefix Group (Un)Subscription; and (3) Forwarding Information Base (FIB) Reconstruction. Prefix Popularity Observation observes the popularly cited prefixes to activate a prefix group (un)subscription function, which lets the Designated Router (DR) know which requester router wants to either join or leave a prefix group. Prefix Group (Un)Subscription lets the DR know which requester router is demanding to join or leave which prefix group. FIB Reconstruction reconstructs the FIB entries of the CCN routers involved in the newly computed optimal cooperative path of all prefix groups. The optimal routing path is obtained by binary linear optimization under a flow conservation constraint, cache contention mitigating constraint, and path length constraint. Two metrics of server load and round-trip hop distance are used to measure the performance of the proposed routing protocol. Simulation results from various network scenarios and various settings show advantages over the shortest path routing and our previously proposed cooperative routing schemes.

  • In-Network Cache Management Based on Differentiated Service for Information-Centric Networking

    Qian HU  Muqing WU  Hailong HAN  Ning WANG  Chaoyi ZHANG  

     
    PAPER

      Vol:
    E97-B No:12
      Page(s):
    2616-2626

    As a promising future network architecture, Information-centric networking (ICN) has attracted much attention, its ubiquitous in-network caching is one of the key technologies to optimize the dissemination of information. However, considering the diversity of contents and the limitation of cache resources in the Internet, it is usually difficult to find a one-fit-all caching strategy. How to manage the ubiquitous in-network cache in ICN has become an important problem. In this paper, we explore ways to improve cache performance from the three perspectives of spatiality, temporality and availability, based on which we further propose an in-network cache management strategy to support differentiated service. We divide contents requested in the network into different levels and the selection of caching strategies depends on the content level. Furthermore, the corresponding models of utilizing cache resources in spatiality, temporality and availability are also derived for comparison and analysis. Simulation verifies that our differentiated service based cache management strategy can optimize the utilization of cache resources and get higher overall cache performance.

  • MVP-Cache: A Multi-Banked Cache Memory for Energy-Efficient Vector Processing of Multimedia Applications

    Ye GAO  Masayuki SATO  Ryusuke EGAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER-Computer System

      Pubricized:
    2014/08/22
      Vol:
    E97-D No:11
      Page(s):
    2835-2843

    Vector processors have significant advantages for next generation multimedia applications (MMAs). One of the advantages is that vector processors can achieve high data transfer performance by using a high bandwidth memory sub-system, resulting in a high sustained computing performance. However, the high bandwidth memory sub-system usually leads to enormous costs in terms of chip area, power and energy consumption. These costs are too expensive for commodity computer systems, which are the main execution platform of MMAs. This paper proposes a new multi-banked cache memory for commodity computer systems called MVP-cache in order to expand the potential of vector architectures on MMAs. Unlike conventional multi-banked cache memories, which employ one tag array and one data array in a sub-cache, MVP-cache associates one tag array with multiple independent data arrays of small-sized cache lines. In this way, MVP-cache realizes less static power consumption on its tag arrays. MVP-cache can also achieve high efficiency on short vector data transfers because the flexibility of data transfers can be improved by independently controlling the data transfers of each data array.

  • Workload-Aware Caching Policy for Information-Centric Networking

    Qian HU  Muqing WU  Song GUO  Hailong HAN  Chaoyi ZHANG  

     
    PAPER-Network

      Vol:
    E97-B No:10
      Page(s):
    2157-2166

    Information-centric networking (ICN) is a promising architecture and has attracted much attention in the area of future Internet architectures. As one of the key technologies in ICN, in-network caching can enhance content retrieval at a global scale without requiring any special infrastructure. In this paper, we propose a workload-aware caching policy, LRU-GT, which allows cache nodes to protect newly cached contents for a period of time (guard time) during which contents are protected from being replaced. LRU-GT can utilize the temporal locality and distinguish contents of different popularity, which are both the characteristics of the workload. Cache replacement is modeled as a semi-Markov process under the Independent Reference Model (IRM) assumption and a theoretical analysis proves that popular contents have longer sojourn time in the cache compared with unpopular ones in LRU-GT and the value of guard time can affect the cache hit ratio. We also propose a dynamic guard time adjustment algorithm to optimize the performance. Simulation results show that LRU-GT can reduce the average hops to get contents and improve cache hit ratio.

  • Write Avoidance Cache Coherence Protocol for Non-volatile Memory as Last-Level Cache in Chip-Multiprocessor

    Ju Hee CHOI  Jong Wook KWAK  Chu Shik JHON  

     
    LETTER-Computer System

      Vol:
    E97-D No:8
      Page(s):
    2166-2169

    Non-Volatile Memories (NVMs) are considered as promising memory technologies for Last-Level Cache (LLC) due to their low leakage and high density. However, NVMs have some drawbacks such as high dynamic energy in modifying NVM cells, long latency for write operation, and limited write endurance. A number of approaches have been proposed to overcome these drawbacks. But very little attention is paid to consider the cache coherency issue. In this letter, we suggest a new cache coherence protocol to reduce the write operations of the LLC. In our protocol, the block data of the LLC is updated only if the cache block is written-back from a private cache, which leads to avoiding useless write operations in the LLC. The simulation results show that our protocol provides 27.1% energy savings and 26.3% lifetime improvements in STT-RAM at maximum.

  • CRRP: Cost-Based Replacement with Random Placement for En-Route Caching

    Sen WANG  Jun BI  Jianping WU  

     
    LETTER-Information Network

      Vol:
    E97-D No:7
      Page(s):
    1914-1917

    Caching is considered widely as an efficient way to reduce access latency and network bandwidth consumption. En-route caching, where caches are associated with routing nodes in the network, is proposed in the context of Web cache to exploit fully the potential of caching. To make sensible replacement and placement decision for en-route caching, traditional caching schemes either engage computation-intensive algorithm like dynamic programming or suffer from inferior performance in terms of average access latency. In this article, we propose a new caching scheme with cost-based replacement and random placement, which is named CRRP. The cost-based replacement of CRRP introduces probing request to timely perceive cost change and the random placement is independent of current caching state, of O(1) computational complexity of placement decision. Through extensive simulations, we show that CRRP outperforms a wide range of caching schemes and is very close to the traditional dynamic-programming-based algorithm, in terms of average access delay.

  • A 40-nm Resilient Cache Memory for Dynamic Variation Tolerance Delivering ×91 Failure Rate Improvement under 35% Supply Voltage Fluctuation

    Yohei NAKATA  Yuta KIMI  Shunsuke OKUMURA  Jinwook JUNG  Takuya SAWADA  Taku TOSHIKAWA  Makoto NAGATA  Hirofumi NAKANO  Makoto YABUUCHI  Hidehiro FUJIWARA  Koji NII  Hiroyuki KAWAI  Hiroshi KAWAGUCHI  Masahiko YOSHIMOTO  

     
    PAPER

      Vol:
    E97-C No:4
      Page(s):
    332-341

    This paper presents a resilient cache memory for dynamic variation tolerance in a 40-nm CMOS. The cache can perform sustained operations under a large-amplitude voltage droop. To realize sustained operation, the resilient cache exploits 7T/14T bit-enhancing SRAM and on-chip voltage/temperature monitoring circuit. 7T/14T bit-enhancing SRAM can reconfigure itself dynamically to a reliable bit-enhancing mode. The on-chip voltage/temperature monitoring circuit can sense a precise supply voltage level of a power rail of the cache. The proposed cache can dynamically change its operation mode using the voltage/temperature monitoring result and can operate reliably under a large-amplitude voltage droop. Experimental result shows that it does not fail with 25% and 30% droop of Vdd and it provides 91 times better failure rate with a 35% droop of Vdd compared with the conventional design.

  • Data Filter Cache with Partial Tag Matching for Low Power Embedded Processor

    Ju Hee CHOI  Jong Wook KWAK  Seong Tae JHANG  Chu Shik JHON  

     
    LETTER-Computer System

      Vol:
    E97-D No:4
      Page(s):
    972-975

    Filter caches have been studied as an energy efficient solution. They achieve energy savings via selected access to L1 cache, but severely decrease system performance. Therefore, a filter cache system should adopt components that balance execution delay against energy savings. In this letter, we analyze the legacy filter cache system and propose Data Filter Cache with Partial Tag Cache (DFPC) as a new solution. The proposed DFPC scheme reduces energy consumption of L1 data cache and does not impair system performance at all. Simulation results show that DFPC provides the 46.36% energy savings without any performance loss.

  • Improving Cache Partitioning Algorithms for Pseudo-LRU Policies

    Xi ZHANG  Chuanyi LIU  Zhenyu LIU  Dongsheng WANG  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2514-2523

    As the number of concurrently running applications on the chip multiprocessors (CMPs) is increasing, efficient management of the shared last-level cache (LLC) is crucial to guarantee overall performance. Recent studies have shown that cache partitioning can provide benefits in throughput, fairness and quality of service. Most prior arts apply true Least Recently Used (LRU) as the underlying cache replacement policy and rely on its stack property to work properly. However, in commodity processors, pseudo-LRU policies without stack property are commonly used instead of LRU for their simplicity and low storage overhead. Therefore, this study sets out to understand whether LRU-based cache partitioning techniques can be applied to commodity processors. In this work, we instead propose a cache partitioning mechanism for two popular pseudo-LRU policies: Not Recently Used (NRU) and Binary Tree (BT). Without the help of true LRU's stack property, we propose a profiling logic that applies curve approximation methods to derive the hit curve (hit counts under varied way allocations) for an application. We then propose a hybrid partitioning mechanism, which mitigates the gap between the predicted hit curve and the actual statistics. Simulation results demonstrate that our proposal can improve throughput by 15.3% on average and outperforms the stack-estimate proposal by 12.6% on average. Similar results can be achieved in weighted speedup. For the cache configurations under study, it requires less than 0.5% storage overhead compared to the last-level cache. In addition, we also show that profiling mechanism with only one true LRU ATD achieves comparable performance and can further reduce the hardware cost by nearly two thirds compared with the hybrid mechanism.

  • Periodic Pattern Coding for Last Level Cache Data Compression

    Haruhiko KANEKO  

     
    PAPER-Data Compression

      Vol:
    E96-A No:12
      Page(s):
    2351-2359

    In spite of continuous improvement of computational power of multi/many-core processors, the memory access performance of the processors has not been improved sufficiently, and thus the overall performance of recent processors is often restricted by the delay of off-chip memory accesses. Low-delay data compression for last level cache (LLC) would be effective to improve the processor performance because the compression increases the effective size of LLC, and thus reduces the number of off-chip memory accesses. This paper proposes a novel data compression method suitable for high-speed parallel decoding in the LLC. Since cache line data often have periodicity of certain lengths, such as 32- or 64-bit instructions, 32-bit integers, and 64-bit floating point numbers, an information word is encoded as a base pattern and a differential pattern between the original word and the base pattern. Evaluation using a GPU simulator shows that the compression ratio of the proposed coding is comparable to LZSS coding and X-Match Pro and superior to other conventional compression algorithms for cache memories. Also this paper presents an experimental decoder designed for ASIC, and the synthesized result shows that the decoder can decompress cache line data of length 32bytes in four clock cycles. Evaluation of the IPC on the GPU simulator shows that, for several benchmark programs, the IPC achieved by the proposed coding is higher than that by the conventional BΔI coding, where the maximum improvement of the IPC is 20%.

  • Region-Based Way-Partitioning on L1 Data Cache for Low Power

    Zhong ZHENG  Zhiying WANG  Li SHEN  

     
    LETTER-Computer System

      Vol:
    E96-D No:11
      Page(s):
    2466-2469

    Power consumption has become a critical factor for embedded systems, especially for battery powered ones. Caches in these systems consume a large portion of the whole chip power. Embedded systems usually adopt set-associative caches to get better performance. However, parallel accessed cache ways incur more energy dissipation. This paper proposed a region-based way-partitioning scheme to reduce cache way access, and without sacrificing performance, to reduce the cache power consumption. The stack accesses and non-stack accesses are isolated and redirected to different ways of the L1 data cache. Under way-partitioning, cache way accesses are reduced, as well as the memory reference interference. Experimental results show that the proposed approach could save around 27.5% of L1 data cache energy on average, without significant performance degradation.

41-60hit(201hit)