Offline Permutation on the CUDA-enabled GPU

Akihiko KASAGI; Koji NAKANO; Yasuaki ITO

doi:10.1587/transinf.2014PAP0010

IEICE TRANSACTIONS on Information

Offline Permutation on the CUDA-enabled GPU

Akihiko KASAGI, Koji NAKANO, Yasuaki ITO

Full Text Views

0

Cite this

Summary :

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm can complete the offline permutation by executing b[p[i]] ← a[i] for all i in parallel, where an array p stores the permutation P. We first present that the conventional algorithm runs $D_w(P)+2{nover w}+3L-3$ time units using n threads on the HMM with width w and latency L, where D_w(P) is the distribution of P. We next show that important regular permutations including transpose, shuffle, and bit-reversal permutations run $2{nover w}+2{nover kw}+2L-2$ time units on the HMM with k DMMs. We have implemented permutation algorithms for these regular permutations on GeForce GTX 680 GPU. The experimental results show that these algorithms run much faster than the conventional algorithm. We also present an offline permutation algorithm for any permutation running in $16{nover w}+16{nover kw}+16L-16$ time units on the HMM with k DMMs. Quite surprisingly, our offline permutation algorithm on the GPU achieves better performance than the conventional algorithm in random permutation, although the running time has a large constant factor. We can say that the experimental results provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.

Publication: IEICE TRANSACTIONS on Information Vol.E97-D No.12 pp.3052-3062

Publication Date: 2014/12/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2014PAP0010

Type of Manuscript: Special Section PAPER (Special Section on Parallel and Distributed Computing and Networking)

Category: GPU

Authors

Akihiko KASAGI
  Hiroshima University
Koji NAKANO
  Hiroshima University
Yasuaki ITO
  Hiroshima University

Keyword

memory machine models, offline permutation, GPU, CUDA

Cite this

Copy

Akihiko KASAGI, Koji NAKANO, Yasuaki ITO, "Offline Permutation on the CUDA-enabled GPU" in IEICE TRANSACTIONS on Information, vol. E97-D, no. 12, pp. 3052-3062, December 2014, doi: 10.1587/transinf.2014PAP0010.
Abstract: The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm can complete the offline permutation by executing b[p[i]] ← a[i] for all i in parallel, where an array p stores the permutation P. We first present that the conventional algorithm runs $D_w(P)+2{nover w}+3L-3$ time units using n threads on the HMM with width w and latency L, where D_w(P) is the distribution of P. We next show that important regular permutations including transpose, shuffle, and bit-reversal permutations run $2{nover w}+2{nover kw}+2L-2$ time units on the HMM with k DMMs. We have implemented permutation algorithms for these regular permutations on GeForce GTX 680 GPU. The experimental results show that these algorithms run much faster than the conventional algorithm. We also present an offline permutation algorithm for any permutation running in $16{nover w}+16{nover kw}+16L-16$ time units on the HMM with k DMMs. Quite surprisingly, our offline permutation algorithm on the GPU achieves better performance than the conventional algorithm in random permutation, although the running time has a large constant factor. We can say that the experimental results provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2014PAP0010/_p

Copy

@ARTICLE{e97-d_12_3052,
author={Akihiko KASAGI, Koji NAKANO, Yasuaki ITO, },
journal={IEICE TRANSACTIONS on Information},
title={Offline Permutation on the CUDA-enabled GPU},
year={2014},
volume={E97-D},
number={12},
pages={3052-3062},
abstract={The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm can complete the offline permutation by executing b[p[i]] ← a[i] for all i in parallel, where an array p stores the permutation P. We first present that the conventional algorithm runs $D_w(P)+2{nover w}+3L-3$ time units using n threads on the HMM with width w and latency L, where D_w(P) is the distribution of P. We next show that important regular permutations including transpose, shuffle, and bit-reversal permutations run $2{nover w}+2{nover kw}+2L-2$ time units on the HMM with k DMMs. We have implemented permutation algorithms for these regular permutations on GeForce GTX 680 GPU. The experimental results show that these algorithms run much faster than the conventional algorithm. We also present an offline permutation algorithm for any permutation running in $16{nover w}+16{nover kw}+16L-16$ time units on the HMM with k DMMs. Quite surprisingly, our offline permutation algorithm on the GPU achieves better performance than the conventional algorithm in random permutation, although the running time has a large constant factor. We can say that the experimental results provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.},
keywords={},
doi={10.1587/transinf.2014PAP0010},
ISSN={1745-1361},
month={December},}

Copy

TY - JOUR
TI - Offline Permutation on the CUDA-enabled GPU
T2 - IEICE TRANSACTIONS on Information
SP - 3052
EP - 3062
AU - Akihiko KASAGI
AU - Koji NAKANO
AU - Yasuaki ITO
PY - 2014
DO - 10.1587/transinf.2014PAP0010
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E97-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2014
AB - The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm can complete the offline permutation by executing b[p[i]] ← a[i] for all i in parallel, where an array p stores the permutation P. We first present that the conventional algorithm runs $D_w(P)+2{nover w}+3L-3$ time units using n threads on the HMM with width w and latency L, where D_w(P) is the distribution of P. We next show that important regular permutations including transpose, shuffle, and bit-reversal permutations run $2{nover w}+2{nover kw}+2L-2$ time units on the HMM with k DMMs. We have implemented permutation algorithms for these regular permutations on GeForce GTX 680 GPU. The experimental results show that these algorithms run much faster than the conventional algorithm. We also present an offline permutation algorithm for any permutation running in $16{nover w}+16{nover kw}+16L-16$ time units on the HMM with k DMMs. Quite surprisingly, our offline permutation algorithm on the GPU achieves better performance than the conventional algorithm in random permutation, although the running time has a large constant factor. We can say that the experimental results provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.
ER -

IEICE TRANSACTIONS on Information