The search functionality is under construction.

Keyword Search Result

[Keyword] cache(201hit)

1-20hit(201hit)

  • An Efficient Reference Image Sharing Method for the Image-Division Parallel Video Encoding Architecture

    Ken NAKAMURA  Yuya OMORI  Daisuke KOBAYASHI  Koyo NITTA  Kimikazu SANO  Masayuki SATO  Hiroe IWASAKI  Hiroaki KOBAYASHI  

     
    PAPER

      Pubricized:
    2022/11/29
      Vol:
    E106-C No:6
      Page(s):
    312-320

    This paper proposes an efficient reference image sharing method for the image-division parallel video encoding architecture. This method efficiently reduces the amount of data transfer by using pre-transfer with area prediction and on-demand transfer with a transfer management table. Experimental results show that the data transfer can be reduced to 19.8-35.3% of the conventional method on average without major degradation of coding performance. This makes it possible to reduce the required bandwidth of the inter-chip transfer interface by saving the amount of data transfer.

  • Data Covert Channels between the Secure World and the Normal World in the ARM TrustZone Architecture

    Haehyun CHO  

     
    LETTER

      Pubricized:
    2022/07/28
      Vol:
    E105-D No:11
      Page(s):
    1925-1927

    The ARM TrustZone architecture, which provides hardware-assisted isolation, is widely adopted in mobile and IoT devices. The security of ARM TrustZone relies on the idea of splitting system-on-chip hardware and software into two worlds, namely normal world and secure world. There are legitimate channels at the hardware level that the normal world and the secure world can use to communicate with each other. To protect these channels from being abused, research efforts were invested on restricting the access to these channels from normal world components. Therefore, only predefined and legitimate normal world components can use cross-world communication channels. In this work, we present a study on data covert channels that can bypass such protection mechanisms and smuggle sensitive information. We first analyze causes of the noise in the covert channel between two worlds. Then, we evaluate the accuracy and bandwidth of covert channels built by our PRIME+COUNT method with one built by PRIME+PROBE method. Our results demonstrate that PRIME+COUNT is an effective technique for enabling cross-world covert channels in the ARM TrustZone.

  • A Two-Level Cache Aware Adaptive Data Replication Mechanism for Shared LLC

    Qianqian WU  Zhenzhou JI  

     
    LETTER-Computer System

      Pubricized:
    2022/03/25
      Vol:
    E105-D No:7
      Page(s):
    1320-1324

    The shared last level cache (SLLC) in tile chip multiprocessors (TCMP) provides a low off-chip miss rate, but it causes a long on-chip access latency. In the two-level cache hierarchy, data replication stores replicas of L1 victims in the local LLC (L2 cache) to obtain a short local LLC access latency on the next accesses. Many data replication mechanisms have been proposed, but they do not consider both L1 victim reuse behaviors and LLC replica reception capability. They either produce many useless replicas or increase LLC pressure, which limits the improvement of system performance. In this paper, we propose a two-level cache aware adaptive data replication mechanism (TCDR), which controls replication based on both L1 victim reuse behaviors prediction and LLC replica reception capability monitoring. TCDR not only increases the accuracy of L1 replica selection, but also avoids the pressure of replication on LLC. The results show that TCDR improves the system performance with reasonable hardware overhead.

  • A Conflict-Aware Capacity Control Mechanism for Deep Cache Hierarchy

    Jiaheng LIU  Ryusuke EGAWA  Hiroyuki TAKIZAWA  

     
    PAPER-Computer System

      Pubricized:
    2022/03/09
      Vol:
    E105-D No:6
      Page(s):
    1150-1163

    As the number of cores on a processor increases, cache hierarchies contain more cache levels and a larger last level cache (LLC). Thus, the power and energy consumption of the cache hierarchy becomes non-negligible. Meanwhile, because the cache usage behaviors of individual applications can be different, it is possible to achieve higher energy efficiency of the computing system by determining the appropriate cache configurations for individual applications. This paper proposes a cache control mechanism to improve energy efficiency by adjusting a cache hierarchy to each application. Our mechanism first bypasses and disables a less-significant cache level, then partially disables the LLC, and finally adjusts the associativity if it suffers from a large number of conflict misses. The mechanism can achieve significant energy saving at the sacrifice of small performance degradation. The evaluation results show that our mechanism improves energy efficiency by 23.9% and 7.0% on average over the baseline and the cache-level bypassing mechanisms, respectively. In addition, even if the LLC resource contention occurs, the proposed mechanism is still effective for improving energy efficiency.

  • NFD.P4: NDN In-Networking Cache Implementation Scheme with P4

    Saifeng HOU  Yuxiang HU  Le TIAN  Zhiguang DANG  

     
    LETTER-Information Network

      Pubricized:
    2021/12/27
      Vol:
    E105-D No:4
      Page(s):
    820-823

    This work proposes NFD.P4, a cache implementation scheme in Named Data Networking (NDN), to solve the problem of insufficient cache space of prgrammable switch and realize the practical application of NDN. We transplant the cache function of NDN.P4 to the NDN Forwarding Daemon (NFD) cache server, which replace the memory space of programmable switch.

  • Fogcached: A DRAM/NVMM Hybrid KVS Server for Edge Computing

    Kouki OZAWA  Takahiro HIROFUCHI  Ryousei TAKANO  Midori SUGAYA  

     
    PAPER

      Pubricized:
    2021/08/18
      Vol:
    E104-D No:12
      Page(s):
    2089-2096

    With the development of IoT devices and sensors, edge computing is leading towards new services like autonomous cars and smart cities. Low-latency data access is an essential requirement for such services, and a large-capacity cache server is needed on the edge side. However, it is not realistic to build a large capacity cache server using only DRAM because DRAM is expensive and consumes substantially large power. A hybrid main memory system is promising to address this issue, in which main memory consists of DRAM and non-volatile memory. It achieves a large capacity of main memory within the power supply capabilities of current servers. In this paper, we propose Fogcached, that is, the extension of a widely-used KVS (Key-Value Store) server program (i.e., Memcached) to exploit both DRAM and non-volatile main memory (NVMM). We used Intel Optane DCPM as NVMM for its prototype. Fogcached implements a Dual-LRU (Least Recently Used) mechanism that seamlessly extends the memory management of Memcached to hybrid main memory. Fogcached reuses the segmented LRU of Memcached to manage cached objects in DRAM, adds another segmented LRU for those in DCPM and bridges the LRUs by a mechanism to automatically replace cached objects between DRAM and DCPM. Cached objects are autonomously moved between the two memory devices according to their access frequencies. Through experiments, we confirmed that Fogcached improved the peak value of a latency distribution by about 40% compared to Memcached.

  • Fogcached-Ros: DRAM/NVMM Hybrid KVS Server with ROS Based Extension for ROS Application and SLAM Evaluation

    Koki HIGASHI  Yoichi ISHIWATA  Takeshi OHKAWA  Midori SUGAYA  

     
    PAPER

      Pubricized:
    2021/08/18
      Vol:
    E104-D No:12
      Page(s):
    2097-2108

    Recently, edge servers located closer than the cloud have become expected for the purpose of processing the large amount of sensor data generated by IoT devices such as robots. Research has been proposed to improve responsiveness as a cache server by applying KVS (Key-Value Store) to the edge as a method for obtaining high responsiveness. Above all, a hybrid-KVS server that uses both DRAM and NVMM (Non-Volatile Main Memory) devices is expected to achieve both responsiveness and reliability. However, its effectiveness has not been verified in actual applications, and its effectiveness is not clear in terms of its relationship with the cloud. The purpose of this study is to evaluate the effectiveness of hybrid-KVS servers using the SLAM (Simultaneous Localization and Mapping), which is a widely used application in robots and autonomous driving. It is appropriate for applying an edge server and requires responsiveness and reliability. SLAM is generally implemented on ROS (Robot Operating System) middleware and communicates with the server through ROS middleware. However, if we use hybrid-KVS on the edge with the SLAM and ROS, the communication could not be achieved since the message objects are different from the format expected by KVS. Therefore, in this research, we propose a mechanism to apply the ROS memory object to hybrid-KVS by designing and implementing the data serialization function to extend ROS. As a result of the proposed fogcached-ros and evaluation, we confirm the effectiveness of low API overhead, support for data used by SLAM, and low latency difference between the edge and cloud.

  • Mitigating Congestion with Explicit Cache Placement Notification for Adaptive Video Streaming over ICN

    Rei NAKAGAWA  Satoshi OHZAHATA  Ryo YAMAMOTO  Toshihiko KATO  

     
    PAPER-Information Network

      Pubricized:
    2021/06/18
      Vol:
    E104-D No:9
      Page(s):
    1406-1419

    Recently, information centric network (ICN) has attracted attention because cached content delivery from router's cache storage improves quality of service (QoS) by reducing redundant traffic. Then, adaptive video streaming is applied to ICN to improve client's quality of experience (QoE). However, in the previous approaches for the cache control, the router implicitly caches the content requested by a user for the other users who may request the same content subsequently. As a result, these approaches are not able to use the cache effectively to improve client's QoE because the cached contents are not always requested by the other users. In addition, since the previous cache control does not consider network congestion state, the adaptive bitrate (ABR) algorithm works incorrectly and causes congestion, and then QoE degrades due to unnecessary congestion. In this paper, we propose an explicit cache placement notification for congestion-aware adaptive video streaming over ICN (CASwECPN) to mitigate congestion. CASwECPN encourages explicit feedback according to the congestion detection in the router on the communication path. While congestion is detected, the router caches the requested content to its cache storage and explicitly notifies the client that the requested content is cached (explicit cache placement and notification) to mitigate congestion quickly. Then the client retrieve the explicitly cached content in the router detecting congestion according to the general procedures of ICN. The simulation experiments show that CASwECPN improves both QoS and client's QoE in adaptive video streaming that adjusts the bitrate adaptively every video segment download. As a result, CASwECPN effectively uses router's cache storage as compared to the conventional cache control policies.

  • Instruction Prefetch for Improving GPGPU Performance

    Jianli CAO  Zhikui CHEN  Yuxin WANG  He GUO  Pengcheng WANG  

     
    PAPER-VLSI Design Technology and CAD

      Pubricized:
    2020/11/16
      Vol:
    E104-A No:5
      Page(s):
    773-785

    Like many processors, GPGPU suffers from memory wall. The traditional solution for this issue is to use efficient schedulers to hide long memory access latency or use data prefetch mech-anism to reduce the latency caused by data transfer. In this paper, we study the instruction fetch stage of GPU's pipeline and analyze the relationship between the capacity of GPU kernel and instruction miss rate. We improve the next line prefetch mechanism to fit the SIMT model of GPU and determine the optimal parameters of prefetch mechanism on GPU through experiments. The experimental result shows that the prefetch mechanism can achieve 12.17% performance improvement on average. Compared with the solution of enlarging I-Cache, prefetch mechanism has the advantages of more beneficiaries and lower cost.

  • Packet Processing Architecture with Off-Chip Last Level Cache Using Interleaved 3D-Stacked DRAM Open Access

    Tomohiro KORIKAWA  Akio KAWABATA  Fujun HE  Eiji OKI  

     
    PAPER-Network System

      Pubricized:
    2020/08/06
      Vol:
    E104-B No:2
      Page(s):
    149-157

    The performance of packet processing applications is dependent on the memory access speed of network systems. Table lookup requires fast memory access and is one of the most common processes in various packet processing applications, which can be a dominant performance bottleneck. Therefore, in Network Function Virtualization (NFV)-aware environments, on-chip fast cache memories of a CPU of general-purpose hardware become critical to achieve high performance packet processing speeds of over tens of Gbps. Also, multiple types of applications and complex applications are executed in the same system simultaneously in carrier network systems, which require adequate cache memory capacities as well. In this paper, we propose a packet processing architecture that utilizes interleaved 3 Dimensional (3D)-stacked Dynamic Random Access Memory (DRAM) devices as off-chip Last Level Cache (LLC) in addition to several levels of dedicated cache memories of each CPU core. Entries of a lookup table are distributed in every bank and vault to utilize both bank interleaving and vault-level memory parallelism. Frequently accessed entries in 3D-stacked DRAM are also cached in on-chip dedicated cache memories of each CPU core. The evaluation results show that the proposed architecture reduces the memory access latency by 57%, and increases the throughput by 100% while reducing the blocking probability but about 10% compared to the architecture with shared on-chip LLC. These results indicate that 3D-stacked DRAM can be practical as off-chip LLC in parallel packet processing systems.

  • RPC: An Approach for Reducing Compulsory Misses in Packet Processing Cache

    Hayato YAMAKI  Hiroaki NISHI  Shinobu MIWA  Hiroki HONDA  

     
    PAPER-Information Network

      Pubricized:
    2020/09/07
      Vol:
    E103-D No:12
      Page(s):
    2590-2599

    We propose a technique to reduce compulsory misses of packet processing cache (PPC), which largely affects both throughput and energy of core routers. Rather than prefetching data, our technique called response prediction cache (RPC) speculatively stores predicted data in PPC without additional access to the low-throughput and power-consuming memory (i.e., TCAM). RPC predicts the data related to a response flow at the arrival of the corresponding request flow, based on the request-response model of internet communications. Our experimental results with 11 real-network traces show that RPC can reduce the PPC miss rate by 13.4% in upstream and 47.6% in downstream on average when we suppose three-layer PPC. Moreover, we extend RPC to adaptive RPC (A-RPC) that selects the use of RPC in each direction within a core router for further improvement in PPC misses. Finally, we show that A-RPC can achieve 1.38x table-lookup throughput with 74% energy consumption per packet, when compared to conventional PPC.

  • Clustering for Interference Alignment with Cache-Enabled Base Stations under Limited Backhaul Links

    Junyao RAN  Youhua FU  Hairong WANG  Chen LIU  

     
    PAPER-Wireless Communication Technologies

      Pubricized:
    2019/12/25
      Vol:
    E103-B No:7
      Page(s):
    796-803

    We propose to use clustered interference alignment for the situation where the backhaul link capacity is limited and the base station is cache-enabled given MIMO interference channels, when the number of Tx-Rx pairs exceeds the feasibility constraint of interference alignment. We optimize clustering with the soft cluster size constraint algorithm by adding a cluster size balancing process. In addition, the CSI overhead is quantified as a system performance indicator along with the average throughput. Simulation results show that cluster size balancing algorithm generates clusters that are more balanced as well as attaining higher long-term throughput than the soft cluster size constraint algorithm. The long-term throughput is further improved under high SNR by reallocating the capacity of the backhaul links based on the clustering results.

  • Analysis on Hybrid SSD Configuration with Emerging Non-Volatile Memories Including Quadruple-Level Cell (QLC) NAND Flash Memory and Various Types of Storage Class Memories (SCMs)

    Yoshiki TAKAI  Mamoru FUKUCHI  Chihiro MATSUI  Reika KINOSHITA  Ken TAKEUCHI  

     
    PAPER-Integrated Electronics

      Vol:
    E103-C No:4
      Page(s):
    171-180

    This paper analyzes the optimal SSD configuration including emerging non-volatile memories such as quadruple-level cell (QLC) NAND flash memory [1] and storage class memories (SCMs). First, SSD performance and SSD endurance lifetime of hybrid SSD are evaluated in four configurations: 1) single-level cell (SLC)/QLC NAND flash, 2) SCM/QLC NAND flash, 3) SCM/triple-level cell (TLC)/QLC NAND flash and 4) SCM/TLC NAND flash. Furthermore, these four configurations are compared in limited cost. In case of cold workloads or high total SSD cost assumption, SCM/TLC NAND flash hybrid configuration is recommended in both SSD performance and endurance lifetime. For hot workloads with low total SSD cost assumption, however, SLC/QLC NAND flash hybrid configuration is recommended with emphasis on SSD endurance lifetime. Under the same conditions as above, SCM/TLC/QLC NAND flash tri-hybrid is the best configuration in SSD performance considering cost. In particular, for prxy_0 (write-hot workload), SCM/TLC/QLC NAND flash tri-hybrid achieves 67% higher IOPS/cost than SCM/TLC NAND flash hybrid. Moreover, the configurations with the highest IOPS/cost in each workload and cost limit are picked up and analyzed with various types of SCMs. For all cases except for the case of prxy_1 with high total SSD cost assumption, middle-end SCM (write latency: 1us, read latency: 1us) is recommended in performance considering cost. However, for prxy_1 (read-hot workload) with high total SSD cost assumption, high-end SCM (write latency: 100ns, read latency: 100ns) achieves the best performance.

  • Compiler Software Coherent Control for Embedded High Performance Multicore

    Boma A. ADHI  Tomoya KASHIMATA  Ken TAKAHASHI  Keiji KIMURA  Hironori KASAHARA  

     
    PAPER

      Vol:
    E103-C No:3
      Page(s):
    85-97

    The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism; therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 “equake” with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8 cores with no hardware coherence mechanism available.

  • Distributed Key-Value Storage for Edge Computing and Its Explicit Data Distribution Method

    Takehiro NAGATO  Takumi TSUTANO  Tomio KAMADA  Yumi TAKAKI  Chikara OHTA  

     
    PAPER-Network

      Pubricized:
    2019/08/05
      Vol:
    E103-B No:1
      Page(s):
    20-31

    In this article, we propose a data framework for edge computing that allows developers to easily attain efficient data transfer between mobile devices or users. We propose a distributed key-value storage platform for edge computing and its explicit data distribution management method that follows the publish/subscribe relationships specific to applications. In this platform, edge servers organize the distributed key-value storage in a uniform namespace. To enable fast data access to a record in edge computing, the allocation strategy of the record and its cache on the edge servers is important. Our platform offers distributed objects that can dynamically change their home server and allocate cache objects proactively following user-defined rules. A rule is defined in a declarative manner and specifies where to place cache objects depending on the status of the target record and its associated records. The system can reflect record modification to the cached records immediately. We also integrate a push notification system using WebSocket to notify events on a specified table. We introduce a messaging service application between mobile appliances and several other applications to show how cache rules apply to them. We evaluate the performance of our system using some sample applications.

  • On-Chip Cache Architecture Exploiting Hybrid Memory Structures for Near-Threshold Computing

    Hongjie XU  Jun SHIOMI  Tohru ISHIHARA  Hidetoshi ONODERA  

     
    PAPER

      Vol:
    E102-A No:12
      Page(s):
    1741-1750

    This paper focuses on power-area trade-off axis to memory systems. Compared with the power-performance-area trade-off application on the traditional high performance cache, this paper focuses on the edge processing environment which is becoming more and more important in the Internet of Things (IoT) era. A new power-oriented trade-off is proposed for on-chip cache architecture. As a case study, this paper exploits a good energy efficiency of Standard-Cell Memory (SCM) operating in a near-threshold voltage region and a good area efficiency of Static Random Access Memory (SRAM). A hybrid 2-level on-chip cache structure is first introduced as a replacement of 6T-SRAM cache as L0 cache to save the energy consumption. This paper proposes a method for finding the best capacity combination for SCM and SRAM, which minimizes the energy consumption of the hybrid cache under a specific cache area constraint. The simulation result using a 65-nm process technology shows that up to 80% energy consumption is reduced without increasing the die area by replacing the conventional SRAM instruction cache with the hybrid 2-level cache. The result shows that energy consumption can be reduced if the area constraint for the proposed hybrid cache system is less than the area which is equivalent to a 8kB SRAM. If the target operating frequency is less than 100MHz, energy reduction can be achieved, which implies that the proposed cache system is suitable for low-power systems where a moderate processing speed is required.

  • Cross-VM Cache Timing Attacks on Virtualized Network Functions

    Youngjoo SHIN  

     
    LETTER-Information Network

      Pubricized:
    2019/05/27
      Vol:
    E102-D No:9
      Page(s):
    1874-1877

    Network function virtualization (NFV) achieves the flexibility of network service provisioning by using virtualization technology. However, NFV is exposed to a serious security threat known as cross-VM cache timing attacks. In this letter, we look into real security impacts on network virtualization. Specifically, we present two kinds of practical cache timing attacks on virtualized firewalls and routers. We also propose some countermeasures to mitigate such attacks on virtualized network functions.

  • On Scaling Property of Information-Centric Networking

    Ryo NAKAMURA  Hiroyuki OHSAKI  

     
    PAPER

      Pubricized:
    2019/03/22
      Vol:
    E102-B No:9
      Page(s):
    1804-1812

    In this paper, we focus on a large-scale ICN (Information-Centric Networking), and reveal the scaling property of ICN. Because of in-network content caching, ICN is a sort of cache networks and expected to be a promising architecture for replacing future Internet. To realize a global-scale (e.g., Internet-scale) ICN, it is crucial to understand the fundamental properties of such large-scale cache networks. However, the scaling property of ICN has not been well understood due to the lack of theoretical foundations and analysis methodologies. For answering research questions regarding the scaling property of ICN, we derive the cache hit probability at each router, the average content delivery delay of each entity, and the average content delivery delay of all entities over a content distribution tree comprised of a single repository (i.e., content provider), multiple routers, and multiple entities (i.e., content consumers). Through several numerical examples, we investigate the effect of the topology and the size of the content distribution tree and the cache size at routers on the average content delivery delay of all entities. Our findings include that the average content delivery delay of ICNs converges to a constant value if the cache size of routers are not small, which implies high scalability of ICNs, and that even when the network size would grow indefinitely, the average content delivery delay is upper-bounded by a constant value if routers in the network are provided with a fair amount of content caches.

  • Cefore: Software Platform Enabling Content-Centric Networking and Beyond Open Access

    Hitoshi ASAEDA  Atsushi OOKA  Kazuhisa MATSUZONO  Ruidong LI  

     
    INVITED PAPER

      Pubricized:
    2019/03/22
      Vol:
    E102-B No:9
      Page(s):
    1792-1803

    Information-Centric or Content-Centric Networking (ICN/CCN) is a promising novel network architecture that naturally integrates in-network caching, multicast, and multipath capabilities, without relying on centralized application-specific servers. Software platforms are vital for researching ICN/CCN; however, existing platforms lack a focus on extensibility and lightweight implementation. In this paper, we introduce a newly developed software platform enabling CCN, named Cefore. In brief, Cefore is lightweight, with the ability to run even on top of a resource-constrained device, but is also easily extensible with arbitrary plugin libraries or external software implementations. For large-scale experiments, a network emulator (Cefore-Emu) and network simulator (Cefore-Sim) have also been developed for this platform. Both Cefore-Emu and Cefore-Sim support hybrid experimental environments that incorporate physical networks into the emulated/simulated networks. In this paper, we describe the design, specification, and usage of Cefore as well as Cefore-Emu and Cefore-Sim. We show performance evaluations of in-network caching and streaming on Cefore-Emu and content fetching on Cefore-Sim, verifying the salient features of the Cefore software platform.

  • PMOP: Efficient Per-Page Most-Offset Prefetcher

    Kanghee KIM  Wooseok LEE  Sangbang CHOI  

     
    PAPER-Computer System

      Pubricized:
    2019/04/12
      Vol:
    E102-D No:7
      Page(s):
    1271-1279

    Hardware prefetching involves a sophisticated balance between accuracy, coverage, and timeliness while minimizing hardware cost. Recent prefetchers have achieved these goals, but they still require complex hardware and a significant amount of storage. In this paper, we propose an efficient Per-page Most-Offset Prefetcher (PMOP) that minimizes hardware cost and simultaneously improves accuracy while maintaining coverage and timeliness. We achieve these objectives using an enhanced offset prefetcher that performs well with a reasonable hardware cost. Our approach first addresses coverage and timeliness by allowing multiple Most-Offset predictions. To minimize offset interference between pages, the PMOP leverages a fine-grain per-page offset filter. This filter records the access history with page-IDs, which enables efficient mapping and tracking of multiple offset streams from diverse pages. Analysis results show that PMOP outperforms the state-of-the-art Signature Path Prefetcher while reducing storage overhead by a factor of 3.4.

1-20hit(201hit)