IEICE global.ieice.org Site

Keyword Search Result

[Keyword] cache optimization(4hit)

1-4hit

Cache-Aware, In-Place Rotation Method for Texture-Based Volume Rendering
Yuji MISAKI Fumihiko INO Kenichi HAGIHARA

PAPER-Fundamentals of Information Systems

Pubricized:
2016/12/12
Vol:
E100-D No:3
Page(s):
452-461
We propose a cache-aware method to accelerate texture-based volume rendering on a graphics processing unit (GPU) that is compatible with the compute unified device architecture. The proposed method extends a previous method such that it can maximize the average rendering performance while rotating the viewing direction around a volume. To realize this, the proposed method performs in-place rotation of volume data, which rearranges the order of voxels to allow consecutive threads (warps) to refer to voxels with the minimum access strides. Experiments indicate that the proposed method replaces the worst texture cache (TC) hit rate of 42% with the best TC hit rate of 93% for a 10243-voxel volume. Thus, the average frame rate increases by a factor of 1.6 in the proposed method compared with that in the previous method. Although the overhead of in-place rotation slightly decreases the frame rate from 2.0 frames per second (fps) to 1.9 fps, this slowdown occurs only with a few viewing directions.
Cache-Aware GPU Optimization for Out-of-Core Cone Beam CT Reconstruction of High-Resolution Volumes
Yuechao LU Fumihiko INO Kenichi HAGIHARA

PAPER-Computer System

Pubricized:
2016/09/05
Vol:
E99-D No:12
Page(s):
3060-3071
This paper proposes a cache-aware optimization method to accelerate out-of-core cone beam computed tomography reconstruction on a graphics processing unit (GPU) device. Our proposed method extends a previous method by increasing the cache hit rate so as to speed up the reconstruction of high-resolution volumes that exceed the capacity of device memory. More specifically, our approach accelerates the well-known Feldkamp-Davis-Kress algorithm by utilizing the following three strategies: (1) a loop organization strategy that identifies the best tradeoff point between the cache hit rate and the number of off-chip memory accesses; (2) a data structure that exploits high locality within a layered texture; and (3) a fully pipelined strategy for hiding file input/output (I/O) time with GPU execution and data transfer times. We implement our proposed method on NVIDIA's latest Maxwell architecture and provide tuning guidelines for adjusting the execution parameters, which include the granularity and shape of thread blocks as well as the granularity of I/O data to be streamed through the pipeline, which maximizes reconstruction performance. Our experimental results show that it took less than three minutes to reconstruct a 20483-voxel volume from 1200 20482-pixel projection images on a single GPU; this translates to a speedup of approximately 1.47 as compared to the previous method.
A Two-Level Cache Design Space Exploration System for Embedded Applications
Nobuaki TOJO Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-Embedded, Real-Time and Reconfigurable Systems

Vol:
E92-A No:12
Page(s):
3238-3247
Recently, two-level cache, L1 cache and L2 cache, is commonly used in a processor. Particularly in an embedded system whereby a single application or a class of applications is repeatedly executed on a processor, its cache configuration can be customized such that an optimal one is achieved. An optimal two-level cache configuration can be obtained which minimizes overall memory access time or memory energy consumption by varying the three cache parameters: the number of sets, a line size, and an associativity, for L1 cache and L2 cache. In this paper, we first extend the L1 cache simulation algorithm so that we can explore two-level cache configuration. Second, we propose two-level cache design space exploration algorithms: CRCB-T1 and CRCB-T2, each of which is based on applying Cache Inclusion Property to two-level cache configuration. Each of the proposed algorithms realizes exact cache simulation but decreases the number of cache hit/miss judgments by a factor of several thousands. Experimental results show that, by using our approach, the number of cache hit/miss judgments required to optimize a cache configurations is reduced to 1/50-1/5500 compared to the exhaustive approach. As a result, our proposed approach totally runs an average of 1398.25 times faster compared to the exhaustive approach. Our proposed cache simulation approach achieves the world fastest two-level cache design space exploration.
An L1 Cache Design Space Exploration System for Embedded Applications
Nobuaki TOJO Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-VLSI Design Technology and CAD

Vol:
E92-A No:6
Page(s):
1442-1453
In an embedded system where a single application or a class of applications is repeatedly executed on a processor, its cache configuration can be customized such that an optimal one is achieved. We can have an optimal cache configuration which minimizes overall memory access time by varying the three cache parameters: the number of sets, a line size, and an associativity. In this paper, we first propose two cache simulation algorithms: CRCB1 and CRCB2, based on Cache Inclusion Property. They realize exact cache simulation but decrease the number of cache hit/miss judgments dramatically. We further propose three more cache design space exploration algorithms: CRMF1, CRMF2, and CRMF3, based on our experimental observations. They can find an almost optimal cache configuration from the viewpoint of access time. By using our approach, the number of cache hit/miss judgments required for optimizing cache configurations is reduced to 1/10-1/50 compared to conventional approaches. As a result, our proposed approach totally runs an average of 3.2 times faster and a maximum of 5.3 times faster compared to the fastest approach proposed so far. Our proposed cache simulation approach achieves the world fastest cache design space exploration when optimizing total memory access time.