1-18hit |
Shunsuke TSUKADA Hikaru TAKAYASHIKI Masayuki SATO Kazuhiko KOMATSU Hiroaki KOBAYASHI
A hybrid memory architecture (HMA) that consists of some distinct memory devices is expected to achieve a good balance between high performance and large capacity. Unlike conventional memory architectures, the HMA needs the metadata for data management since the data are migrated between the memory devices during the execution of an application. The memory controller caches the metadata to avoid accessing the memory devices for the metadata reference. However, as the amount of the metadata increases in proportion to the size of the HMA, the memory controller needs to handle a large amount of metadata. As a result, the memory controller cannot cache all the metadata and increases the number of metadata references. This results in an increase in the access latency to reach the target data and degrades the performance. To solve this problem, this paper proposes a metadata prefetching mechanism for HMAs. The proposed mechanism loads the metadata needed in the near future by prefetching. Moreover, to increase the effect of the metadata prefetching, the proposed mechanism predicts the metadata used in the near future based on an address difference that is the difference between two consecutive access addresses. The evaluation results show that the proposed metadata prefetching mechanism can improve the instructions per cycle by up to 44% and 9% on average.
Cache prefetching technique brings huge benefits to performance improvement, but it comes at the cost of microarchitectural security in processors. In this letter, we deep dive into internal workings of a DCUIP prefetcher, which is one of prefetchers equipped in Intel processors. We discover that a DCUIP table is shared among different execution contexts in hyperthreading-enabled processors, which leads to another microarchitectural vulnerability. By exploiting the vulnerability, we propose a DCUIP poisoning attack. We demonstrate an AES encryption key can be extracted from an AES-NI implementation by mounting the proposed attack.
Jianli CAO Zhikui CHEN Yuxin WANG He GUO Pengcheng WANG
Like many processors, GPGPU suffers from memory wall. The traditional solution for this issue is to use efficient schedulers to hide long memory access latency or use data prefetch mech-anism to reduce the latency caused by data transfer. In this paper, we study the instruction fetch stage of GPU's pipeline and analyze the relationship between the capacity of GPU kernel and instruction miss rate. We improve the next line prefetch mechanism to fit the SIMT model of GPU and determine the optimal parameters of prefetch mechanism on GPU through experiments. The experimental result shows that the prefetch mechanism can achieve 12.17% performance improvement on average. Compared with the solution of enlarging I-Cache, prefetch mechanism has the advantages of more beneficiaries and lower cost.
Makoto NAKAMURA Hiroaki NISHIUCHI Jin NAKAZATO Konstantin KOSLOWSKI Julian DAUBE Ricardo SANTOS Gia Khanh TRAN Kei SAKAGUCHI
In this paper, a Proof-of-Concept (PoC) architecture is constructed, and the effectiveness of mmWave overlay heterogeneous network (HetNet) with mesh backhaul utilizing route-multiplexing and Multi-access Edge Computing (MEC) utilizing prefetching algorithm is verified by measuring the throughput and the download time of real contents. The architecture can cope with the intensive mobile data traffic since data delivery utilizes multiple backhaul routes based on the mesh topology, i.e. route-multiplexing mechanism. On the other hand, MEC deploys the network edge contents requested in advance by nearby User Equipment (UE) based on pre-registered context information such as location, destination, demand application, etc. to the network edge, which is called prefetching algorithm. Therefore, mmWave access can be fully exploited even with capacity-limited backhaul networks by introducing the proposed algorithm. These technologies solve the problems in conventional mmWave HetNet to reduce mobile data traffic on backhaul networks to cloud networks. In addition, the proposed architecture is realized by introducing wireless Software Defined Network (SDN) and Network Function Virtualization (NFV). In our architecture, the network is dynamically controlled via wide-coverage microwave band links by which UE's context information is collected for optimizing the network resources and controlling network infrastructures to establish backhaul routes and MEC servers. In this paper, we develop the hardware equipment and middleware systems, and introduce these algorithms which are used as a driver of IEEE802.11ad and open source software. For 5G and beyond, the architecture integrated in mmWave backhaul, MEC and SDN/NFV will support some scenarios and use cases.
Kanghee KIM Wooseok LEE Sangbang CHOI
Hardware prefetching involves a sophisticated balance between accuracy, coverage, and timeliness while minimizing hardware cost. Recent prefetchers have achieved these goals, but they still require complex hardware and a significant amount of storage. In this paper, we propose an efficient Per-page Most-Offset Prefetcher (PMOP) that minimizes hardware cost and simultaneously improves accuracy while maintaining coverage and timeliness. We achieve these objectives using an enhanced offset prefetcher that performs well with a reasonable hardware cost. Our approach first addresses coverage and timeliness by allowing multiple Most-Offset predictions. To minimize offset interference between pages, the PMOP leverages a fine-grain per-page offset filter. This filter records the access history with page-IDs, which enables efficient mapping and tracking of multiple offset streams from diverse pages. Analysis results show that PMOP outperforms the state-of-the-art Signature Path Prefetcher while reducing storage overhead by a factor of 3.4.
Yongsoo JOO Sangsoo PARK Hyokyung BAHN
Application prefetchers improve application launch performance on HDDs through either I/O reordering or I/O interleaving, but there has been no proposal to combine the two techniques. We present a new algorithm to combine both approaches, and demonstrate that it reduces cold start launch time by 50%.
Wei GE Zhi QI Yue DU Lu MA Longxing SHI
The Coarse Grained Reconfigurable Architectures (CGRAs) are proposed as new choices for enhancing the ability of parallel processing. Data transfer throughput between Reconfigurable Cell Array (RCA) and on-chip local memory is usually the main performance bottleneck of CGRAs. In order to release this stress, we propose a novel data transfer strategy that is called Heuristic Data Prefetch and Reuse (HDPR), for the first time in the case of explicit CGRAs. The HDPR strategy provides not only the flexible data access schedule but also the high data throughput needed to realize fast pipelined implementations of various loop kernels. To improve the data utilization efficiency, a dual-bank cache-like data reuse structure is proposed. Furthermore, a heuristic data prefetch is also introduced to decrease the data access latency. Experimental results demonstrate that when compared with conventional explicit data transfer strategies, our work achieves a significant speedup improvement of, on average, 1.73 times at the expense of only 5.86% increase in area.
Disk arrays and prefetching schemes are used to mitigate the performance gap between main memory and disks. This paper presents a new problem that arises if prefetching schemes that are widely used in operation systems are applied to disk arrays. The key point of the problem is that block address space from the viewpoint of the host is contiguous but from that of the disk array it is discontiguous and thus more disk accesses than expected are required. This paper presents two ways to resolve the problem that arises from the Linux readahead framework. The proposed scheme prevents a readahead window from being split into multiple requests from the viewpoint of the disk array but not from the viewpoint of the host thereby reducing disk head movements. In addition, it outperforms the prior work by adopting an asynchronous solution, improving performance for fragmented files, eliminating readahead size restriction, and improving disk parallelism. We implemented the proposed scheme and integrated it with Linux. Our experiment shows that the solution significantly improved the original Linux readahead framework when a storage server processes multiple concurrent requests.
Hidetsugu IRIE Takefumi MIYOSHI Goki HONJO Kei HIRAKI Tsutomu YOSHINAGA
One of the significant issues of processor architecture is to overcome memory latency. Prefetching can greatly improve cache performance, but it has the drawback of cache pollution, unless its aggressiveness is properly set. Several techniques that have been proposed for prefetcher throttling use accuracy as a metric, but their robustness were not sufficient because of the variations in programs' working set sizes and cache capacities. In this study, we revisit prefetcher throttling from the viewpoint of data lifetime. Exploiting the characteristics of cache line reuse, we propose Cache-Convection-Control-based Prefetch Optimization Plus (CCCPO+), which enhances the feedback algorithm of our previous CCCPO. Evaluation results showed that this novel approach achieved a 30% improvement over no prefetching in the geometric mean of the SPEC CPU 2006 benchmark suite with 256 KB LLC, 1.8% over the latest prefetcher throttling, and 0.5% over our previous CCCPO. Moreover, it showed superior stability compared to related works, while lowering the hardware cost.
Hyo J. LEE In Hwan DOH Eunsam KIM Sam H. NOH
Conventional kernel prefetching schemes have focused on taking advantage of sequential access patterns that are easy to detect. However, it is observed that, on random and even sequential references, they may cause performance degradation due to inaccurate pattern prediction and overshooting. To address these problems, we propose a novel approach to work with existing kernel prefetching schemes, called Reference Pattern based kernel Prefetching (RPP). The RPP can reduce negative effects of existing schemes by identifying one more reference pattern, i.e., looping, in addition to random and sequential patterns and delaying starting prefetching until patterns are confirmed to be sequential or looping.
R-trees have been traditionally optimized for I/O performance with disk pages as tree nodes. Recently, researchers have proposed cache-conscious variations of R-trees optimized for CPU cache performance in main memory environments, where the node size is several cache lines wide and more entries are packed in a node by compressing MBR keys. However, because there is a big difference between the node sizes of two types of R-trees, disk-optimized R-trees show poor cache performance while cache-optimized R-trees exhibit poor disk performance. In this paper, we propose a cache and disk optimized R-tree, called PR-tree (Prefetching R-tree). For cache performance, the node size of the PR-tree is wider than a cache line, and the prefetch instruction is used to reduce the number of cache misses. For I/O performance, the nodes of the PR-tree are fitted into one disk page. We represent the detailed analysis of cache misses for range queries, and enumerate all the reasonable in-page leaf and nonleaf node sizes, and heights of in-page trees to figure out tree parameters for the best cache and I/O performance. The PR-tree that we propose achieves better cache performance than the disk-optimized R-tree: a factor of 3.5-15.1 improvement for one-by-one insertions, 6.5-15.1 improvement for deletions, 1.3-1.9 improvement for range queries, and 2.7-9.7 improvement for k-nearest neighbor queries. All experimental results do not show notable declines of I/O performance.
Buffer caching is an integral part of the operating system. In this paper, we propose a scheme that integrates buffer cache management and prefetching via cache partitioning. The scheme, which we call SA-W2R, is simple to implement, making it a feasible solution in real systems. In its basic form, for buffer replacement, it uses the LRU policy. However, its modular design allows for any replacement policy to be incorporated into the scheme. For prefetching, it uses the LRU-One Block Lookahead (LRU-OBL) approach, eliminating any extra burden that is generally necessary in other prefetching approaches. Implementation studies based on the GNU/Linux kernel version 2.2.14 show that the SA-W2R performs better than the scheme currently used, with a maximum increases of 23% for the workloads considered.
Yoon-Young LEE Chei-Yol KIM Dae-Wha SEO
A parallel file system is normally used to support excessive file requests from parallel applications in a cluster system, whereas prefetching is useful for improving the file system performance. This paper proposes dynamic file prefetching scheme based on file access patterns, named table-comparison prefetching policy, that is particularly suitable for parallel scientific applications and multimedia web services in a VIA-based parallel file system. VIA relieves the communication overhead of traditional communication protocols, such as TCP/IP. The proposed policy introduces a table-comparison method to predict data for prefetching. In addition, it includes an algorithm to determine whether and when prefetching is performed using the current available I/O bandwidth. Experimental results confirmed that the use of the proposed prefetching policy in a VIA-based parallel file system produced a higher file system performance for various file access patterns.
Ryoichi SHINKUMA Minoru OKADA Shozo KOMAKI
This paper proposes an adaptive transmission scheme for web prefetching in wireless communication systems. The proposed adaptive transmission scheme controls the modulation format and the error control scheme according to the access probability of the web document being transmitted. In the proposed system, the actually requested documents and the documents which have high access probability are transmitted with a reliable transmission format, while the pages whose access probabilities are lower than a certain threshold are transmitted with a bandwidth efficient transmission format. The computer simulation results show that the proposed scheme drastically improves the latency performance.
Prefetching is a promising approach to tackle the memory latency problem. Two basic variants of hardware data prefetching methods are sequential prefetching and stride prefetching. The latter based on stride calculation of future references has the potential to out-perform the former which is based on the data locality. In this paper, a typical stride prefetching and its improved version, adaptive stride prefetching, are compared in quantitative way using simulation for some parallel benchmark programs in the context of uniform memory access and non-uniform memory access architectures. The simulation results show that adaptability of stride is essential since the proposed adaptive scheme can reduce pending stall time which is large in the typical scheme.
Myoung Kwon TCHEUN Seung Ryoul MAENG Jung Wan CHO
To reduce the memory access latency on sharedmemory multiprocessors, several prefetching schemes have been proposed. The sequential prefetching scheme is a simple hardware-controlled scheme, which exploits the sequentiality of memory accesses to predict which blocks will be read in the near future. Aggressive sequential prefetching prefetches many blocks on each miss to reduce the miss rates and results in good performance for application programs with high sequentiality. However, conservative sequential prefetching prefetches a few blocks on each miss to avoid prefetching of useless blocks, which shows better performance than aggressive sequential prefetching for application programs with low sequentiality. We analyze the relationship between the sequentiality of application programs and the effectiveness of sequential prefetching on various memory and network latency and propose a new adaptive sequential prefetching scheme. Simply adding a small table to the sequential prefetching scheme, the proposed scheme prefetches a large number of blocks for application programs with high sequentiality and reduces the miss rates significantly, and prefetches a small number of blocks for application programs with low sequentiality and avoids loading useless blocks.
The newly developed high speed DRAMs are introduced and their innovative circuit techniques for achieving a high data bandwidth are described; the synchronous DRAM, the cache DRAM and the Rambus DRAM. They are all designed to fill the performance gap between MPUs and the main memory of computer systems, which will diverge in '90s. Although these high speed DRAMs have the same purpose to increase the data bandwidth, their approaches to accomplish it is different, which may in turn lead to some advantages or disadvantages as well as their fields of applications. The paper is intended not only to discuss them from technical overview, but also to be a guide to DRAM users when choosing the best fitting one for their systems.
Masafumi TAKAHASHI Hiroshige FUJII Emi KANEKO Takeshi YOSHIDA Toshinori SATO Hiroyuki TAKANO Haruyuki TAGO Seigo SUZUKI Nobuyuki GOTO
A 250-MIPS, 125-MFLOPS peak performance processing element (PE), which is being developed for an on-chip multiprocessor, has been modeled and evaluated. The PE includes the following new architecture components: an FPU shared by several IUs in order to increase the efficiency of the FPU pipelines, an on-chip data cache with a prefetch mechanism to reduce clock cycles waiting for memory, and an interface to high speed DRAM, such as Rambus DRAM and Synchronous DRAM. As a result, a PE model with an FPU shared by four or eight IUs causes only 10% performance reduction compared to a model with an un-shared FPU model while saving the cost of three FPUs. Furthermore, a PE model with prefetch operates 1.2 to 1.8 times faster than a model without prefetch at 250-MHz clock rate when the Rambus DRAM is connected. It becomes clear that this PE architecture can bring a high effective performance at over 250-MHz, and is cost-effective for the on-chip multiprocessor.