The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of the shared memory access of GPUs. Bank conflicts should be avoided for maximizing the bandwidth of the shared memory access. Offline permutation of an array is a task to copy all elements in array a into array b along a permutation given in advance. The main contribution of this paper is to implement a conflict-free permutation algorithm on the DMM in a GPU. We have also implemented straightforward permutation algorithms on the GPU. The experimental results for 1024 double (64-bit) numbers on NVIDIA GeForce GTX-680 show that the straightforward permutation algorithm takes 247.8 ns for the random permutation and 1684ns for the worst permutation that involves the maximum bank conflicts. Our conflict-free permutation algorithm runs in 167ns for any permutation including the random permutation and the worst permutation, although it performs more memory accesses. It follows that our conflict-free permutation is 1.48 times faster for the random permutation and 10.0 times faster for the worst permutation.
Akihiko KASAGI
Hiroshima University
Koji NAKANO
Hiroshima University
Yasuaki ITO
Hiroshima University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Akihiko KASAGI, Koji NAKANO, Yasuaki ITO, "Offline Permutation Algorithms on the Discrete Memory Machine with Performance Evaluation on the GPU" in IEICE TRANSACTIONS on Information,
vol. E96-D, no. 12, pp. 2617-2625, December 2013, doi: 10.1587/transinf.E96.D.2617.
Abstract: The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of the shared memory access of GPUs. Bank conflicts should be avoided for maximizing the bandwidth of the shared memory access. Offline permutation of an array is a task to copy all elements in array a into array b along a permutation given in advance. The main contribution of this paper is to implement a conflict-free permutation algorithm on the DMM in a GPU. We have also implemented straightforward permutation algorithms on the GPU. The experimental results for 1024 double (64-bit) numbers on NVIDIA GeForce GTX-680 show that the straightforward permutation algorithm takes 247.8 ns for the random permutation and 1684ns for the worst permutation that involves the maximum bank conflicts. Our conflict-free permutation algorithm runs in 167ns for any permutation including the random permutation and the worst permutation, although it performs more memory accesses. It follows that our conflict-free permutation is 1.48 times faster for the random permutation and 10.0 times faster for the worst permutation.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E96.D.2617/_p
Copy
@ARTICLE{e96-d_12_2617,
author={Akihiko KASAGI, Koji NAKANO, Yasuaki ITO, },
journal={IEICE TRANSACTIONS on Information},
title={Offline Permutation Algorithms on the Discrete Memory Machine with Performance Evaluation on the GPU},
year={2013},
volume={E96-D},
number={12},
pages={2617-2625},
abstract={The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of the shared memory access of GPUs. Bank conflicts should be avoided for maximizing the bandwidth of the shared memory access. Offline permutation of an array is a task to copy all elements in array a into array b along a permutation given in advance. The main contribution of this paper is to implement a conflict-free permutation algorithm on the DMM in a GPU. We have also implemented straightforward permutation algorithms on the GPU. The experimental results for 1024 double (64-bit) numbers on NVIDIA GeForce GTX-680 show that the straightforward permutation algorithm takes 247.8 ns for the random permutation and 1684ns for the worst permutation that involves the maximum bank conflicts. Our conflict-free permutation algorithm runs in 167ns for any permutation including the random permutation and the worst permutation, although it performs more memory accesses. It follows that our conflict-free permutation is 1.48 times faster for the random permutation and 10.0 times faster for the worst permutation.},
keywords={},
doi={10.1587/transinf.E96.D.2617},
ISSN={1745-1361},
month={December},}
Copy
TY - JOUR
TI - Offline Permutation Algorithms on the Discrete Memory Machine with Performance Evaluation on the GPU
T2 - IEICE TRANSACTIONS on Information
SP - 2617
EP - 2625
AU - Akihiko KASAGI
AU - Koji NAKANO
AU - Yasuaki ITO
PY - 2013
DO - 10.1587/transinf.E96.D.2617
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E96-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2013
AB - The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of the shared memory access of GPUs. Bank conflicts should be avoided for maximizing the bandwidth of the shared memory access. Offline permutation of an array is a task to copy all elements in array a into array b along a permutation given in advance. The main contribution of this paper is to implement a conflict-free permutation algorithm on the DMM in a GPU. We have also implemented straightforward permutation algorithms on the GPU. The experimental results for 1024 double (64-bit) numbers on NVIDIA GeForce GTX-680 show that the straightforward permutation algorithm takes 247.8 ns for the random permutation and 1684ns for the worst permutation that involves the maximum bank conflicts. Our conflict-free permutation algorithm runs in 167ns for any permutation including the random permutation and the worst permutation, although it performs more memory accesses. It follows that our conflict-free permutation is 1.48 times faster for the random permutation and 10.0 times faster for the worst permutation.
ER -