The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] prefetch(18hit)

1-18hit
  • A Metadata Prefetching Mechanism for Hybrid Memory Architectures Open Access

    Shunsuke TSUKADA  Hikaru TAKAYASHIKI  Masayuki SATO  Kazuhiko KOMATSU  Hiroaki KOBAYASHI  

     
    PAPER

      Pubricized:
    2021/12/03
      Vol:
    E105-C No:6
      Page(s):
    232-243

    A hybrid memory architecture (HMA) that consists of some distinct memory devices is expected to achieve a good balance between high performance and large capacity. Unlike conventional memory architectures, the HMA needs the metadata for data management since the data are migrated between the memory devices during the execution of an application. The memory controller caches the metadata to avoid accessing the memory devices for the metadata reference. However, as the amount of the metadata increases in proportion to the size of the HMA, the memory controller needs to handle a large amount of metadata. As a result, the memory controller cannot cache all the metadata and increases the number of metadata references. This results in an increase in the access latency to reach the target data and degrades the performance. To solve this problem, this paper proposes a metadata prefetching mechanism for HMAs. The proposed mechanism loads the metadata needed in the near future by prefetching. Moreover, to increase the effect of the metadata prefetching, the proposed mechanism predicts the metadata used in the near future based on an address difference that is the difference between two consecutive access addresses. The evaluation results show that the proposed metadata prefetching mechanism can improve the instructions per cycle by up to 44% and 9% on average.

  • DCUIP Poisoning Attack in Intel x86 Processors

    Youngjoo SHIN  

     
    LETTER-Dependable Computing

      Pubricized:
    2021/05/13
      Vol:
    E104-D No:8
      Page(s):
    1386-1390

    Cache prefetching technique brings huge benefits to performance improvement, but it comes at the cost of microarchitectural security in processors. In this letter, we deep dive into internal workings of a DCUIP prefetcher, which is one of prefetchers equipped in Intel processors. We discover that a DCUIP table is shared among different execution contexts in hyperthreading-enabled processors, which leads to another microarchitectural vulnerability. By exploiting the vulnerability, we propose a DCUIP poisoning attack. We demonstrate an AES encryption key can be extracted from an AES-NI implementation by mounting the proposed attack.

  • Instruction Prefetch for Improving GPGPU Performance

    Jianli CAO  Zhikui CHEN  Yuxin WANG  He GUO  Pengcheng WANG  

     
    PAPER-VLSI Design Technology and CAD

      Pubricized:
    2020/11/16
      Vol:
    E104-A No:5
      Page(s):
    773-785

    Like many processors, GPGPU suffers from memory wall. The traditional solution for this issue is to use efficient schedulers to hide long memory access latency or use data prefetch mech-anism to reduce the latency caused by data transfer. In this paper, we study the instruction fetch stage of GPU's pipeline and analyze the relationship between the capacity of GPU kernel and instruction miss rate. We improve the next line prefetch mechanism to fit the SIMT model of GPU and determine the optimal parameters of prefetch mechanism on GPU through experiments. The experimental result shows that the prefetch mechanism can achieve 12.17% performance improvement on average. Compared with the solution of enlarging I-Cache, prefetch mechanism has the advantages of more beneficiaries and lower cost.

  • Experimental Verification of SDN/NFV in Integrated mmWave Access and Mesh Backhaul Networks Open Access

    Makoto NAKAMURA  Hiroaki NISHIUCHI  Jin NAKAZATO  Konstantin KOSLOWSKI  Julian DAUBE  Ricardo SANTOS  Gia Khanh TRAN  Kei SAKAGUCHI  

     
    PAPER-Network

      Pubricized:
    2020/09/29
      Vol:
    E104-B No:3
      Page(s):
    217-228

    In this paper, a Proof-of-Concept (PoC) architecture is constructed, and the effectiveness of mmWave overlay heterogeneous network (HetNet) with mesh backhaul utilizing route-multiplexing and Multi-access Edge Computing (MEC) utilizing prefetching algorithm is verified by measuring the throughput and the download time of real contents. The architecture can cope with the intensive mobile data traffic since data delivery utilizes multiple backhaul routes based on the mesh topology, i.e. route-multiplexing mechanism. On the other hand, MEC deploys the network edge contents requested in advance by nearby User Equipment (UE) based on pre-registered context information such as location, destination, demand application, etc. to the network edge, which is called prefetching algorithm. Therefore, mmWave access can be fully exploited even with capacity-limited backhaul networks by introducing the proposed algorithm. These technologies solve the problems in conventional mmWave HetNet to reduce mobile data traffic on backhaul networks to cloud networks. In addition, the proposed architecture is realized by introducing wireless Software Defined Network (SDN) and Network Function Virtualization (NFV). In our architecture, the network is dynamically controlled via wide-coverage microwave band links by which UE's context information is collected for optimizing the network resources and controlling network infrastructures to establish backhaul routes and MEC servers. In this paper, we develop the hardware equipment and middleware systems, and introduce these algorithms which are used as a driver of IEEE802.11ad and open source software. For 5G and beyond, the architecture integrated in mmWave backhaul, MEC and SDN/NFV will support some scenarios and use cases.

  • PMOP: Efficient Per-Page Most-Offset Prefetcher

    Kanghee KIM  Wooseok LEE  Sangbang CHOI  

     
    PAPER-Computer System

      Pubricized:
    2019/04/12
      Vol:
    E102-D No:7
      Page(s):
    1271-1279

    Hardware prefetching involves a sophisticated balance between accuracy, coverage, and timeliness while minimizing hardware cost. Recent prefetchers have achieved these goals, but they still require complex hardware and a significant amount of storage. In this paper, we propose an efficient Per-page Most-Offset Prefetcher (PMOP) that minimizes hardware cost and simultaneously improves accuracy while maintaining coverage and timeliness. We achieve these objectives using an enhanced offset prefetcher that performs well with a reasonable hardware cost. Our approach first addresses coverage and timeliness by allowing multiple Most-Offset predictions. To minimize offset interference between pages, the PMOP leverages a fine-grain per-page offset filter. This filter records the access history with page-IDs, which enables efficient mapping and tracking of multiple offset streams from diverse pages. Analysis results show that PMOP outperforms the state-of-the-art Signature Path Prefetcher while reducing storage overhead by a factor of 3.4.

  • Application Prefetcher Design Using both I/O Reordering and I/O Interleaving

    Yongsoo JOO  Sangsoo PARK  Hyokyung BAHN  

     
    LETTER-Computer System

      Pubricized:
    2015/08/20
      Vol:
    E98-D No:12
      Page(s):
    2317-2321

    Application prefetchers improve application launch performance on HDDs through either I/O reordering or I/O interleaving, but there has been no proposal to combine the two techniques. We present a new algorithm to combine both approaches, and demonstrate that it reduces cold start launch time by 50%.

  • A Data Prefetch and Reuse Strategy for Coarse-Grained Reconfigurable Architectures

    Wei GE  Zhi QI  Yue DU  Lu MA  Longxing SHI  

     
    PAPER-Computer System

      Vol:
    E96-D No:3
      Page(s):
    616-623

    The Coarse Grained Reconfigurable Architectures (CGRAs) are proposed as new choices for enhancing the ability of parallel processing. Data transfer throughput between Reconfigurable Cell Array (RCA) and on-chip local memory is usually the main performance bottleneck of CGRAs. In order to release this stress, we propose a novel data transfer strategy that is called Heuristic Data Prefetch and Reuse (HDPR), for the first time in the case of explicit CGRAs. The HDPR strategy provides not only the flexible data access schedule but also the high data throughput needed to realize fast pipelined implementations of various loop kernels. To improve the data utilization efficiency, a dual-bank cache-like data reuse structure is proposed. Furthermore, a heuristic data prefetch is also introduced to decrease the data access latency. Experimental results demonstrate that when compared with conventional explicit data transfer strategies, our work achieves a significant speedup improvement of, on average, 1.73 times at the expense of only 5.86% increase in area.

  • An Asynchronous Striping-Aware Readahead Framework for Disk Arrays in Linux

    Sung Hoon BAEK  

     
    PAPER-Software System

      Vol:
    E96-D No:1
      Page(s):
    19-27

    Disk arrays and prefetching schemes are used to mitigate the performance gap between main memory and disks. This paper presents a new problem that arises if prefetching schemes that are widely used in operation systems are applied to disk arrays. The key point of the problem is that block address space from the viewpoint of the host is contiguous but from that of the disk array it is discontiguous and thus more disk accesses than expected are required. This paper presents two ways to resolve the problem that arises from the Linux readahead framework. The proposed scheme prevents a readahead window from being split into multiple requests from the viewpoint of the disk array but not from the viewpoint of the host thereby reducing disk head movements. In addition, it outperforms the prior work by adopting an asynchronous solution, improving performance for fragmented files, eliminating readahead size restriction, and improving disk parallelism. We implemented the proposed scheme and integrated it with Linux. Our experiment shows that the solution significantly improved the original Linux readahead framework when a storage server processes multiple concurrent requests.

  • Using Cacheline Reuse Characteristics for Prefetcher Throttling

    Hidetsugu IRIE  Takefumi MIYOSHI  Goki HONJO  Kei HIRAKI  Tsutomu YOSHINAGA  

     
    PAPER-Computer Architecture

      Vol:
    E95-D No:12
      Page(s):
    2928-2938

    One of the significant issues of processor architecture is to overcome memory latency. Prefetching can greatly improve cache performance, but it has the drawback of cache pollution, unless its aggressiveness is properly set. Several techniques that have been proposed for prefetcher throttling use accuracy as a metric, but their robustness were not sufficient because of the variations in programs' working set sizes and cache capacities. In this study, we revisit prefetcher throttling from the viewpoint of data lifetime. Exploiting the characteristics of cache line reuse, we propose Cache-Convection-Control-based Prefetch Optimization Plus (CCCPO+), which enhances the feedback algorithm of our previous CCCPO. Evaluation results showed that this novel approach achieved a 30% improvement over no prefetching in the geometric mean of the SPEC CPU 2006 benchmark suite with 256 KB LLC, 1.8% over the latest prefetcher throttling, and 0.5% over our previous CCCPO. Moreover, it showed superior stability compared to related works, while lowering the hardware cost.

  • RPP: Reference Pattern Based Kernel Prefetching Controller

    Hyo J. LEE  In Hwan DOH  Eunsam KIM  Sam H. NOH  

     
    LETTER-System Programs

      Vol:
    E92-D No:12
      Page(s):
    2512-2515

    Conventional kernel prefetching schemes have focused on taking advantage of sequential access patterns that are easy to detect. However, it is observed that, on random and even sequential references, they may cause performance degradation due to inaccurate pattern prediction and overshooting. To address these problems, we propose a novel approach to work with existing kernel prefetching schemes, called Reference Pattern based kernel Prefetching (RPP). The RPP can reduce negative effects of existing schemes by identifying one more reference pattern, i.e., looping, in addition to random and sequential patterns and delaying starting prefetching until patterns are confirmed to be sequential or looping.

  • A Cache Optimized Multidimensional Index in Disk-Based Environments

    Myungsun PARK  Sukho LEE  

     
    PAPER-Database

      Vol:
    E88-D No:8
      Page(s):
    1932-1939

    R-trees have been traditionally optimized for I/O performance with disk pages as tree nodes. Recently, researchers have proposed cache-conscious variations of R-trees optimized for CPU cache performance in main memory environments, where the node size is several cache lines wide and more entries are packed in a node by compressing MBR keys. However, because there is a big difference between the node sizes of two types of R-trees, disk-optimized R-trees show poor cache performance while cache-optimized R-trees exhibit poor disk performance. In this paper, we propose a cache and disk optimized R-tree, called PR-tree (Prefetching R-tree). For cache performance, the node size of the PR-tree is wider than a cache line, and the prefetch instruction is used to reduce the number of cache misses. For I/O performance, the nodes of the PR-tree are fitted into one disk page. We represent the detailed analysis of cache misses for range queries, and enumerate all the reasonable in-page leaf and nonleaf node sizes, and heights of in-page trees to figure out tree parameters for the best cache and I/O performance. The PR-tree that we propose achieves better cache performance than the disk-optimized R-tree: a factor of 3.5-15.1 improvement for one-by-one insertions, 6.5-15.1 improvement for deletions, 1.3-1.9 improvement for range queries, and 2.7-9.7 improvement for k-nearest neighbor queries. All experimental results do not show notable declines of I/O performance.

  • Improving the Performance of Linux Operating System via Buffer Cache Partitioning and Prefetching

    Heung Seok JEON  Sam H. NOH  

     
    PAPER-Software Systems

      Vol:
    E86-D No:3
      Page(s):
    616-622

    Buffer caching is an integral part of the operating system. In this paper, we propose a scheme that integrates buffer cache management and prefetching via cache partitioning. The scheme, which we call SA-W2R, is simple to implement, making it a feasible solution in real systems. In its basic form, for buffer replacement, it uses the LRU policy. However, its modular design allows for any replacement policy to be incorporated into the scheme. For prefetching, it uses the LRU-One Block Lookahead (LRU-OBL) approach, eliminating any extra burden that is generally necessary in other prefetching approaches. Implementation studies based on the GNU/Linux kernel version 2.2.14 show that the SA-W2R performs better than the scheme currently used, with a maximum increases of 23% for the workloads considered.

  • Dynamic File Prefetching Scheme Based on File Access Patterns in VIA-Based Parallel File System

    Yoon-Young LEE  Chei-Yol KIM  Dae-Wha SEO  

     
    PAPER-Computer Systems

      Vol:
    E85-D No:4
      Page(s):
    714-721

    A parallel file system is normally used to support excessive file requests from parallel applications in a cluster system, whereas prefetching is useful for improving the file system performance. This paper proposes dynamic file prefetching scheme based on file access patterns, named table-comparison prefetching policy, that is particularly suitable for parallel scientific applications and multimedia web services in a VIA-based parallel file system. VIA relieves the communication overhead of traditional communication protocols, such as TCP/IP. The proposed policy introduces a table-comparison method to predict data for prefetching. In addition, it includes an algorithm to determine whether and when prefetching is performed using the current available I/O bandwidth. Experimental results confirmed that the use of the proposed prefetching policy in a VIA-based parallel file system produced a higher file system performance for various file access patterns.

  • Adaptive Transmission Scheme for Web Prefetching in Wireless Environment

    Ryoichi SHINKUMA  Minoru OKADA  Shozo KOMAKI  

     
    PAPER-Signal Processing

      Vol:
    E85-C No:3
      Page(s):
    485-491

    This paper proposes an adaptive transmission scheme for web prefetching in wireless communication systems. The proposed adaptive transmission scheme controls the modulation format and the error control scheme according to the access probability of the web document being transmitted. In the proposed system, the actually requested documents and the documents which have high access probability are transmitted with a reliable transmission format, while the pages whose access probabilities are lower than a certain threshold are transmitted with a bandwidth efficient transmission format. The computer simulation results show that the proposed scheme drastically improves the latency performance.

  • Adaptive Stride Prefetching for the Secondary Data Cache of UMA and NUMA

    Ando KI  

     
    PAPER-Computer Systems

      Vol:
    E83-D No:2
      Page(s):
    168-176

    Prefetching is a promising approach to tackle the memory latency problem. Two basic variants of hardware data prefetching methods are sequential prefetching and stride prefetching. The latter based on stride calculation of future references has the potential to out-perform the former which is based on the data locality. In this paper, a typical stride prefetching and its improved version, adaptive stride prefetching, are compared in quantitative way using simulation for some parallel benchmark programs in the context of uniform memory access and non-uniform memory access architectures. The simulation results show that adaptability of stride is essential since the proposed adaptive scheme can reduce pending stall time which is large in the typical scheme.

  • A Simple Hardware Prefetching Scheme Using Sequentiality for Shared-Memory Multiprocessors

    Myoung Kwon TCHEUN  Seung Ryoul MAENG  Jung Wan CHO  

     
    PAPER-Computer Hardware and Design

      Vol:
    E80-D No:11
      Page(s):
    1055-1063

    To reduce the memory access latency on sharedmemory multiprocessors, several prefetching schemes have been proposed. The sequential prefetching scheme is a simple hardware-controlled scheme, which exploits the sequentiality of memory accesses to predict which blocks will be read in the near future. Aggressive sequential prefetching prefetches many blocks on each miss to reduce the miss rates and results in good performance for application programs with high sequentiality. However, conservative sequential prefetching prefetches a few blocks on each miss to avoid prefetching of useless blocks, which shows better performance than aggressive sequential prefetching for application programs with low sequentiality. We analyze the relationship between the sequentiality of application programs and the effectiveness of sequential prefetching on various memory and network latency and propose a new adaptive sequential prefetching scheme. Simply adding a small table to the sequential prefetching scheme, the proposed scheme prefetches a large number of blocks for application programs with high sequentiality and reduces the miss rates significantly, and prefetches a small number of blocks for application programs with low sequentiality and avoids loading useless blocks.

  • High Speed DRAMs with Innovative Architectures

    Shigeo OHSHIMA  Tohru FURUYAMA  

     
    INVITED PAPER-DRAM

      Vol:
    E77-C No:8
      Page(s):
    1303-1315

    The newly developed high speed DRAMs are introduced and their innovative circuit techniques for achieving a high data bandwidth are described; the synchronous DRAM, the cache DRAM and the Rambus DRAM. They are all designed to fill the performance gap between MPUs and the main memory of computer systems, which will diverge in '90s. Although these high speed DRAMs have the same purpose to increase the data bandwidth, their approaches to accomplish it is different, which may in turn lead to some advantages or disadvantages as well as their fields of applications. The paper is intended not only to discuss them from technical overview, but also to be a guide to DRAM users when choosing the best fitting one for their systems.

  • Performance Evaluation of a Processing Element for an On-Chip Multiprocessor

    Masafumi TAKAHASHI  Hiroshige FUJII  Emi KANEKO  Takeshi YOSHIDA  Toshinori SATO  Hiroyuki TAKANO  Haruyuki TAGO  Seigo SUZUKI  Nobuyuki GOTO  

     
    PAPER

      Vol:
    E77-C No:7
      Page(s):
    1092-1100

    A 250-MIPS, 125-MFLOPS peak performance processing element (PE), which is being developed for an on-chip multiprocessor, has been modeled and evaluated. The PE includes the following new architecture components: an FPU shared by several IUs in order to increase the efficiency of the FPU pipelines, an on-chip data cache with a prefetch mechanism to reduce clock cycles waiting for memory, and an interface to high speed DRAM, such as Rambus DRAM and Synchronous DRAM. As a result, a PE model with an FPU shared by four or eight IUs causes only 10% performance reduction compared to a model with an un-shared FPU model while saving the cost of three FPUs. Furthermore, a PE model with prefetch operates 1.2 to 1.8 times faster than a model without prefetch at 250-MHz clock rate when the Rambus DRAM is connected. It becomes clear that this PE architecture can bring a high effective performance at over 250-MHz, and is cost-effective for the on-chip multiprocessor.