A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU

Jingcheng SHEN; Fumihiko INO; Albert FARRÉS; Mauricio HANZICH

doi:10.1587/transinf.2020PAP0014

IEICE TRANSACTIONS on Information

A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU

Jingcheng SHEN, Fumihiko INO, Albert FARRÉS, Mauricio HANZICH

Full Text Views

0

Cite this

Summary :

Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.

Publication: IEICE TRANSACTIONS on Information Vol.E103-D No.12 pp.2421-2434

Publication Date: 2020/12/01

Publicized: 2020/09/07

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2020PAP0014

Type of Manuscript: Special Section PAPER (Special Section on Parallel, Distributed, and Reconfigurable Computing, and Networking)

Category: Fundamentals of Information Systems

Authors

Jingcheng SHEN
  Osaka University
Fumihiko INO
  Osaka University
Albert FARRÉS
  Barcelona Supercomputing Center
Mauricio HANZICH
  Barcelona Supercomputing Center

Keyword

stencil computation, out-of-core computation, data-centric optimizations, GPU

Cite this

Copy

Jingcheng SHEN, Fumihiko INO, Albert FARRÉS, Mauricio HANZICH, "A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU" in IEICE TRANSACTIONS on Information, vol. E103-D, no. 12, pp. 2421-2434, December 2020, doi: 10.1587/transinf.2020PAP0014.
Abstract: Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2020PAP0014/_p

Copy

@ARTICLE{e103-d_12_2421,
author={Jingcheng SHEN, Fumihiko INO, Albert FARRÉS, Mauricio HANZICH, },
journal={IEICE TRANSACTIONS on Information},
title={A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU},
year={2020},
volume={E103-D},
number={12},
pages={2421-2434},
abstract={Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.},
keywords={},
doi={10.1587/transinf.2020PAP0014},
ISSN={1745-1361},
month={December},}

Copy

TY - JOUR
TI - A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU
T2 - IEICE TRANSACTIONS on Information
SP - 2421
EP - 2434
AU - Jingcheng SHEN
AU - Fumihiko INO
AU - Albert FARRÉS
AU - Mauricio HANZICH
PY - 2020
DO - 10.1587/transinf.2020PAP0014
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E103-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2020
AB - Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.
ER -

IEICE TRANSACTIONS on Information