By exploiting data-level parallelism, Graphics Processing Units (GPUs) have become a high-throughput, general purpose computing platform. Many real-world applications especially those following a stream processing pattern, however, feature interleaved task-pipelined and data parallelism. Current GPUs are ill equipped for such applications due to the insufficient usage of computing resources and/or the excessive off-chip memory traffic. In this paper, we focus on microarchitectural enhancements to enable task-pipelined execution of data-parallel kernels on GPUs. We propose an efficient adaptive dynamic scheduling mechanism and a moderately modified L2 design. With minor hardware overhead, our techniques orchestrate both task-pipeline and data parallelisms in a unified manner. Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.
Shuai MU
Tsinghua University
Dongdong LI
Tsinghua University
Yubei CHEN
Tsinghua University
Yangdong DENG
Tsinghua University
Zhihua WANG
Tsinghua University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Shuai MU, Dongdong LI, Yubei CHEN, Yangdong DENG, Zhihua WANG, "Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs" in IEICE TRANSACTIONS on Information,
vol. E96-D, no. 10, pp. 2194-2207, October 2013, doi: 10.1587/transinf.E96.D.2194.
Abstract: By exploiting data-level parallelism, Graphics Processing Units (GPUs) have become a high-throughput, general purpose computing platform. Many real-world applications especially those following a stream processing pattern, however, feature interleaved task-pipelined and data parallelism. Current GPUs are ill equipped for such applications due to the insufficient usage of computing resources and/or the excessive off-chip memory traffic. In this paper, we focus on microarchitectural enhancements to enable task-pipelined execution of data-parallel kernels on GPUs. We propose an efficient adaptive dynamic scheduling mechanism and a moderately modified L2 design. With minor hardware overhead, our techniques orchestrate both task-pipeline and data parallelisms in a unified manner. Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E96.D.2194/_p
Copy
@ARTICLE{e96-d_10_2194,
author={Shuai MU, Dongdong LI, Yubei CHEN, Yangdong DENG, Zhihua WANG, },
journal={IEICE TRANSACTIONS on Information},
title={Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs},
year={2013},
volume={E96-D},
number={10},
pages={2194-2207},
abstract={By exploiting data-level parallelism, Graphics Processing Units (GPUs) have become a high-throughput, general purpose computing platform. Many real-world applications especially those following a stream processing pattern, however, feature interleaved task-pipelined and data parallelism. Current GPUs are ill equipped for such applications due to the insufficient usage of computing resources and/or the excessive off-chip memory traffic. In this paper, we focus on microarchitectural enhancements to enable task-pipelined execution of data-parallel kernels on GPUs. We propose an efficient adaptive dynamic scheduling mechanism and a moderately modified L2 design. With minor hardware overhead, our techniques orchestrate both task-pipeline and data parallelisms in a unified manner. Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.},
keywords={},
doi={10.1587/transinf.E96.D.2194},
ISSN={1745-1361},
month={October},}
Copy
TY - JOUR
TI - Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs
T2 - IEICE TRANSACTIONS on Information
SP - 2194
EP - 2207
AU - Shuai MU
AU - Dongdong LI
AU - Yubei CHEN
AU - Yangdong DENG
AU - Zhihua WANG
PY - 2013
DO - 10.1587/transinf.E96.D.2194
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E96-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2013
AB - By exploiting data-level parallelism, Graphics Processing Units (GPUs) have become a high-throughput, general purpose computing platform. Many real-world applications especially those following a stream processing pattern, however, feature interleaved task-pipelined and data parallelism. Current GPUs are ill equipped for such applications due to the insufficient usage of computing resources and/or the excessive off-chip memory traffic. In this paper, we focus on microarchitectural enhancements to enable task-pipelined execution of data-parallel kernels on GPUs. We propose an efficient adaptive dynamic scheduling mechanism and a moderately modified L2 design. With minor hardware overhead, our techniques orchestrate both task-pipeline and data parallelisms in a unified manner. Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.
ER -