Asymptotically Optimal Merging on ManyCore GPUs

Arne KUTZNER; Pok-Son KIM; Won-Kwang PARK

doi:10.1587/transinf.E95.D.2769

Asymptotically Optimal Merging on ManyCore GPUs

Arne KUTZNER, Pok-Son KIM, Won-Kwang PARK

Full Text Views

0

Cite this

Summary :

We propose a family of algorithms for efficiently merging on contemporary GPUs, so that each algorithm requires O(m log (+1)) element comparisons, where m and n are the sizes of the input sequences with m ≤ n. According to the lower bounds for merging all proposed algorithms are asymptotically optimal regarding the number of necessary comparisons. First we introduce a parallely structured algorithm that splits a merging problem of size 2^l into 2ⁱ subproblems of size 2^l-i, for some arbitrary i with (0 ≤ i ≤ l). This algorithm represents a merger for i=l but it is rather inefficient in this case. The efficiency is boosted by moving to a two stage approach where the splitting process stops at some predetermined level and transfers control to several parallely operating block-mergers. We formally prove the asymptotic optimality of the splitting process and show that for symmetrically sized inputs our approach delivers up to 4 times faster runtimes than the thrust::merge function that is part of the Thrust library. For assessing the value of our merging technique in the context of sorting we construct and evaluate a MergeSort on top of it. In the context of our benchmarking the resulting MergeSort clearly outperforms the MergeSort implementation provided by the Thrust library as well as Cederman's GPU optimized variant of QuickSort.

Publication: IEICE TRANSACTIONS on Information Vol.E95-D No.12 pp.2769-2777

Publication Date: 2012/12/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E95.D.2769

Type of Manuscript: Special Section PAPER (Special Section on Parallel and Distributed Computing and Networking)

Category: Parallel and Distributed Computing

Cite this

Copy

Arne KUTZNER, Pok-Son KIM, Won-Kwang PARK, "Asymptotically Optimal Merging on ManyCore GPUs" in IEICE TRANSACTIONS on Information, vol. E95-D, no. 12, pp. 2769-2777, December 2012, doi: 10.1587/transinf.E95.D.2769.
Abstract: We propose a family of algorithms for efficiently merging on contemporary GPUs, so that each algorithm requires O(m log (+1)) element comparisons, where m and n are the sizes of the input sequences with m ≤ n. According to the lower bounds for merging all proposed algorithms are asymptotically optimal regarding the number of necessary comparisons. First we introduce a parallely structured algorithm that splits a merging problem of size 2^l into 2ⁱ subproblems of size 2^l-i, for some arbitrary i with (0 ≤ i ≤ l). This algorithm represents a merger for i=l but it is rather inefficient in this case. The efficiency is boosted by moving to a two stage approach where the splitting process stops at some predetermined level and transfers control to several parallely operating block-mergers. We formally prove the asymptotic optimality of the splitting process and show that for symmetrically sized inputs our approach delivers up to 4 times faster runtimes than the thrust::merge function that is part of the Thrust library. For assessing the value of our merging technique in the context of sorting we construct and evaluate a MergeSort on top of it. In the context of our benchmarking the resulting MergeSort clearly outperforms the MergeSort implementation provided by the Thrust library as well as Cederman's GPU optimized variant of QuickSort.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E95.D.2769/_p

Copy

@ARTICLE{e95-d_12_2769,
author={Arne KUTZNER, Pok-Son KIM, Won-Kwang PARK, },
journal={IEICE TRANSACTIONS on Information},
title={Asymptotically Optimal Merging on ManyCore GPUs},
year={2012},
volume={E95-D},
number={12},
pages={2769-2777},
abstract={We propose a family of algorithms for efficiently merging on contemporary GPUs, so that each algorithm requires O(m log (+1)) element comparisons, where m and n are the sizes of the input sequences with m ≤ n. According to the lower bounds for merging all proposed algorithms are asymptotically optimal regarding the number of necessary comparisons. First we introduce a parallely structured algorithm that splits a merging problem of size 2^l into 2ⁱ subproblems of size 2^l-i, for some arbitrary i with (0 ≤ i ≤ l). This algorithm represents a merger for i=l but it is rather inefficient in this case. The efficiency is boosted by moving to a two stage approach where the splitting process stops at some predetermined level and transfers control to several parallely operating block-mergers. We formally prove the asymptotic optimality of the splitting process and show that for symmetrically sized inputs our approach delivers up to 4 times faster runtimes than the thrust::merge function that is part of the Thrust library. For assessing the value of our merging technique in the context of sorting we construct and evaluate a MergeSort on top of it. In the context of our benchmarking the resulting MergeSort clearly outperforms the MergeSort implementation provided by the Thrust library as well as Cederman's GPU optimized variant of QuickSort.},
keywords={},
doi={10.1587/transinf.E95.D.2769},
ISSN={1745-1361},
month={December},}

Copy

TY - JOUR
TI - Asymptotically Optimal Merging on ManyCore GPUs
T2 - IEICE TRANSACTIONS on Information
SP - 2769
EP - 2777
AU - Arne KUTZNER
AU - Pok-Son KIM
AU - Won-Kwang PARK
PY - 2012
DO - 10.1587/transinf.E95.D.2769
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E95-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2012
AB - We propose a family of algorithms for efficiently merging on contemporary GPUs, so that each algorithm requires O(m log (+1)) element comparisons, where m and n are the sizes of the input sequences with m ≤ n. According to the lower bounds for merging all proposed algorithms are asymptotically optimal regarding the number of necessary comparisons. First we introduce a parallely structured algorithm that splits a merging problem of size 2^l into 2ⁱ subproblems of size 2^l-i, for some arbitrary i with (0 ≤ i ≤ l). This algorithm represents a merger for i=l but it is rather inefficient in this case. The efficiency is boosted by moving to a two stage approach where the splitting process stops at some predetermined level and transfers control to several parallely operating block-mergers. We formally prove the asymptotic optimality of the splitting process and show that for symmetrically sized inputs our approach delivers up to 4 times faster runtimes than the thrust::merge function that is part of the Thrust library. For assessing the value of our merging technique in the context of sorting we construct and evaluate a MergeSort on top of it. In the context of our benchmarking the resulting MergeSort clearly outperforms the MergeSort implementation provided by the Thrust library as well as Cederman's GPU optimized variant of QuickSort.
ER -