This paper proposes a cache-aware optimization method to accelerate out-of-core cone beam computed tomography reconstruction on a graphics processing unit (GPU) device. Our proposed method extends a previous method by increasing the cache hit rate so as to speed up the reconstruction of high-resolution volumes that exceed the capacity of device memory. More specifically, our approach accelerates the well-known Feldkamp-Davis-Kress algorithm by utilizing the following three strategies: (1) a loop organization strategy that identifies the best tradeoff point between the cache hit rate and the number of off-chip memory accesses; (2) a data structure that exploits high locality within a layered texture; and (3) a fully pipelined strategy for hiding file input/output (I/O) time with GPU execution and data transfer times. We implement our proposed method on NVIDIA's latest Maxwell architecture and provide tuning guidelines for adjusting the execution parameters, which include the granularity and shape of thread blocks as well as the granularity of I/O data to be streamed through the pipeline, which maximizes reconstruction performance. Our experimental results show that it took less than three minutes to reconstruct a 20483-voxel volume from 1200 20482-pixel projection images on a single GPU; this translates to a speedup of approximately 1.47 as compared to the previous method.
Yuechao LU
Osaka University
Fumihiko INO
Osaka University
Kenichi HAGIHARA
Osaka University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Yuechao LU, Fumihiko INO, Kenichi HAGIHARA, "Cache-Aware GPU Optimization for Out-of-Core Cone Beam CT Reconstruction of High-Resolution Volumes" in IEICE TRANSACTIONS on Information,
vol. E99-D, no. 12, pp. 3060-3071, December 2016, doi: 10.1587/transinf.2016EDP7174.
Abstract: This paper proposes a cache-aware optimization method to accelerate out-of-core cone beam computed tomography reconstruction on a graphics processing unit (GPU) device. Our proposed method extends a previous method by increasing the cache hit rate so as to speed up the reconstruction of high-resolution volumes that exceed the capacity of device memory. More specifically, our approach accelerates the well-known Feldkamp-Davis-Kress algorithm by utilizing the following three strategies: (1) a loop organization strategy that identifies the best tradeoff point between the cache hit rate and the number of off-chip memory accesses; (2) a data structure that exploits high locality within a layered texture; and (3) a fully pipelined strategy for hiding file input/output (I/O) time with GPU execution and data transfer times. We implement our proposed method on NVIDIA's latest Maxwell architecture and provide tuning guidelines for adjusting the execution parameters, which include the granularity and shape of thread blocks as well as the granularity of I/O data to be streamed through the pipeline, which maximizes reconstruction performance. Our experimental results show that it took less than three minutes to reconstruct a 20483-voxel volume from 1200 20482-pixel projection images on a single GPU; this translates to a speedup of approximately 1.47 as compared to the previous method.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016EDP7174/_p
Copy
@ARTICLE{e99-d_12_3060,
author={Yuechao LU, Fumihiko INO, Kenichi HAGIHARA, },
journal={IEICE TRANSACTIONS on Information},
title={Cache-Aware GPU Optimization for Out-of-Core Cone Beam CT Reconstruction of High-Resolution Volumes},
year={2016},
volume={E99-D},
number={12},
pages={3060-3071},
abstract={This paper proposes a cache-aware optimization method to accelerate out-of-core cone beam computed tomography reconstruction on a graphics processing unit (GPU) device. Our proposed method extends a previous method by increasing the cache hit rate so as to speed up the reconstruction of high-resolution volumes that exceed the capacity of device memory. More specifically, our approach accelerates the well-known Feldkamp-Davis-Kress algorithm by utilizing the following three strategies: (1) a loop organization strategy that identifies the best tradeoff point between the cache hit rate and the number of off-chip memory accesses; (2) a data structure that exploits high locality within a layered texture; and (3) a fully pipelined strategy for hiding file input/output (I/O) time with GPU execution and data transfer times. We implement our proposed method on NVIDIA's latest Maxwell architecture and provide tuning guidelines for adjusting the execution parameters, which include the granularity and shape of thread blocks as well as the granularity of I/O data to be streamed through the pipeline, which maximizes reconstruction performance. Our experimental results show that it took less than three minutes to reconstruct a 20483-voxel volume from 1200 20482-pixel projection images on a single GPU; this translates to a speedup of approximately 1.47 as compared to the previous method.},
keywords={},
doi={10.1587/transinf.2016EDP7174},
ISSN={1745-1361},
month={December},}
Copy
TY - JOUR
TI - Cache-Aware GPU Optimization for Out-of-Core Cone Beam CT Reconstruction of High-Resolution Volumes
T2 - IEICE TRANSACTIONS on Information
SP - 3060
EP - 3071
AU - Yuechao LU
AU - Fumihiko INO
AU - Kenichi HAGIHARA
PY - 2016
DO - 10.1587/transinf.2016EDP7174
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E99-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2016
AB - This paper proposes a cache-aware optimization method to accelerate out-of-core cone beam computed tomography reconstruction on a graphics processing unit (GPU) device. Our proposed method extends a previous method by increasing the cache hit rate so as to speed up the reconstruction of high-resolution volumes that exceed the capacity of device memory. More specifically, our approach accelerates the well-known Feldkamp-Davis-Kress algorithm by utilizing the following three strategies: (1) a loop organization strategy that identifies the best tradeoff point between the cache hit rate and the number of off-chip memory accesses; (2) a data structure that exploits high locality within a layered texture; and (3) a fully pipelined strategy for hiding file input/output (I/O) time with GPU execution and data transfer times. We implement our proposed method on NVIDIA's latest Maxwell architecture and provide tuning guidelines for adjusting the execution parameters, which include the granularity and shape of thread blocks as well as the granularity of I/O data to be streamed through the pipeline, which maximizes reconstruction performance. Our experimental results show that it took less than three minutes to reconstruct a 20483-voxel volume from 1200 20482-pixel projection images on a single GPU; this translates to a speedup of approximately 1.47 as compared to the previous method.
ER -