Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs

Shuai MU; Dongdong LI; Yubei CHEN; Yangdong DENG; Zhihua WANG

doi:10.1587/transinf.E96.D.2194

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs

Shuai MU, Dongdong LI, Yubei CHEN, Yangdong DENG, Zhihua WANG

Full Text Views

0

Cite this

Summary :

By exploiting data-level parallelism, Graphics Processing Units (GPUs) have become a high-throughput, general purpose computing platform. Many real-world applications especially those following a stream processing pattern, however, feature interleaved task-pipelined and data parallelism. Current GPUs are ill equipped for such applications due to the insufficient usage of computing resources and/or the excessive off-chip memory traffic. In this paper, we focus on microarchitectural enhancements to enable task-pipelined execution of data-parallel kernels on GPUs. We propose an efficient adaptive dynamic scheduling mechanism and a moderately modified L2 design. With minor hardware overhead, our techniques orchestrate both task-pipeline and data parallelisms in a unified manner. Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.

Publication: IEICE TRANSACTIONS on Information Vol.E96-D No.10 pp.2194-2207

Publication Date: 2013/10/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E96.D.2194

Type of Manuscript: PAPER

Category: Computer System

Authors

Shuai MU
  Tsinghua University
Dongdong LI
  Tsinghua University
Yubei CHEN
  Tsinghua University
Yangdong DENG
  Tsinghua University
Zhihua WANG
  Tsinghua University

Keyword

GPU, task-pipeline, dynamic scheduling, load balance, L2 cache

Cite this

Copy

Shuai MU, Dongdong LI, Yubei CHEN, Yangdong DENG, Zhihua WANG, "Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs" in IEICE TRANSACTIONS on Information, vol. E96-D, no. 10, pp. 2194-2207, October 2013, doi: 10.1587/transinf.E96.D.2194.
Abstract: By exploiting data-level parallelism, Graphics Processing Units (GPUs) have become a high-throughput, general purpose computing platform. Many real-world applications especially those following a stream processing pattern, however, feature interleaved task-pipelined and data parallelism. Current GPUs are ill equipped for such applications due to the insufficient usage of computing resources and/or the excessive off-chip memory traffic. In this paper, we focus on microarchitectural enhancements to enable task-pipelined execution of data-parallel kernels on GPUs. We propose an efficient adaptive dynamic scheduling mechanism and a moderately modified L2 design. With minor hardware overhead, our techniques orchestrate both task-pipeline and data parallelisms in a unified manner. Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E96.D.2194/_p

Copy

@ARTICLE{e96-d_10_2194,
author={Shuai MU, Dongdong LI, Yubei CHEN, Yangdong DENG, Zhihua WANG, },
journal={IEICE TRANSACTIONS on Information},
title={Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs},
year={2013},
volume={E96-D},
number={10},
pages={2194-2207},
abstract={By exploiting data-level parallelism, Graphics Processing Units (GPUs) have become a high-throughput, general purpose computing platform. Many real-world applications especially those following a stream processing pattern, however, feature interleaved task-pipelined and data parallelism. Current GPUs are ill equipped for such applications due to the insufficient usage of computing resources and/or the excessive off-chip memory traffic. In this paper, we focus on microarchitectural enhancements to enable task-pipelined execution of data-parallel kernels on GPUs. We propose an efficient adaptive dynamic scheduling mechanism and a moderately modified L2 design. With minor hardware overhead, our techniques orchestrate both task-pipeline and data parallelisms in a unified manner. Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.},
keywords={},
doi={10.1587/transinf.E96.D.2194},
ISSN={1745-1361},
month={October},}

Copy

TY - JOUR
TI - Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs
T2 - IEICE TRANSACTIONS on Information
SP - 2194
EP - 2207
AU - Shuai MU
AU - Dongdong LI
AU - Yubei CHEN
AU - Yangdong DENG
AU - Zhihua WANG
PY - 2013
DO - 10.1587/transinf.E96.D.2194
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E96-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2013
AB - By exploiting data-level parallelism, Graphics Processing Units (GPUs) have become a high-throughput, general purpose computing platform. Many real-world applications especially those following a stream processing pattern, however, feature interleaved task-pipelined and data parallelism. Current GPUs are ill equipped for such applications due to the insufficient usage of computing resources and/or the excessive off-chip memory traffic. In this paper, we focus on microarchitectural enhancements to enable task-pipelined execution of data-parallel kernels on GPUs. We propose an efficient adaptive dynamic scheduling mechanism and a moderately modified L2 design. With minor hardware overhead, our techniques orchestrate both task-pipeline and data parallelisms in a unified manner. Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.
ER -