IEICE global.ieice.org Site

Keyword Search Result

[Keyword] stencil(5hit)

1-5hit

A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU
Jingcheng SHEN Fumihiko INO Albert FARRÉS Mauricio HANZICH

PAPER-Fundamentals of Information Systems

Pubricized:
2020/09/07
Vol:
E103-D No:12
Page(s):
2421-2434
Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.
Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators
Yoshikazu INAGAKI Shinya TAKAMAEDA-YAMAZAKI Jun YAO Yasuhiko NAKASHIMA

PAPER-Architecture

Pubricized:
2015/09/15
Vol:
E98-D No:12
Page(s):
2141-2149
The Energy-aware Multi-mode Accelerator eXtension [24],[25] (EMAX) is equipped with distributed single-port local memories and ring-formed interconnections. The accelerator is designed to achieve extremely high throughput for scientific computations, big data, and image processing as well as low-power consumption. However, before mapping algorithms on the accelerator, application developers require sufficient knowledge of the hardware organization and specially designed instructions. They also need significant effort to tune the code for improving execution efficiency when no well-designed compiler or library is available. To address this problem, we focus on library support for stencil (nearest-neighbor) computations that represent a class of algorithms commonly used in many partial differential equation (PDE) solvers. In this research, we address the following topics: (1) system configuration, features, and mnemonics of EMAX; (2) instruction mapping techniques that reduce the amount of data to be read from the main memory; (3) performance evaluation of the library for PDE solvers. With the features of a library that can reuse the local data across the outer loop iterations and map many instructions by unrolling the outer loops, the amount of data to be read from the main memory is significantly reduced to a minimum of 1/7 compared with a hand-tuned code. In addition, the stencil library reduced the execution time 23% more than a general-purpose processor.
Performance Modeling of Stencil Computing on a Stream-Based FPGA Accelerator for Efficient Design Space Exploration
Keisuke DOHI Koji OKINA Rie SOEJIMA Yuichiro SHIBATA Kiyoshi OGURI

PAPER-Application

Pubricized:
2014/11/19
Vol:
E98-D No:2
Page(s):
298-308
In this paper, we discuss performance modeling of 3-D stencil computing on an FPGA accelerator with a high-level synthesis environment, aiming for efficient exploration of user-space design parameters. First, we analyze resource utilization and performance to formulate these relationships as mathematical models. Then, in order to evaluate our proposed models, we implement heat conduction simulations as a benchmark application, by using MaxCompiler, which is a high-level synthesis tool for FPGAs, and MaxGenFD, which is a domain specific framework of the MaxCompiler for finite-difference equation solvers. The experimental results with various settings of architectural design parameters show the best combination of design parameters for pipeline structure can be systematically found by using our models. The effects of changing arithmetic accuracy and using data stream compression are also discussed.
LSI Design Flow for Shot Reduction of Character Projection Electron Beam Direct Writing Using Combined Cell Stencil
Taisuke KAZAMA Makoto IKEDA Kunihiro ASADA

PAPER-Physical Design

Vol:
E89-A No:12
Page(s):
3546-3550
We propose a shot reduction technique of character projection (CP) Electron Beam Direct Writing (EBDW) using combined cell stencil (CCS) or the advanced process technology. CP EBDW is expected both to reduce mask costs and to realize quick turn around time. One of major issue of the conventional CP EBDW, however, is a throughput of lithography. The throughput is determined by numbers of shots, which are proportional to numbers of cell instances in LSIs. The conventional shot reduction techniques focus on optimization of cell stencil extraction, without any modifications on designed LSI mask patterns. The proposed technique employs the proposed combined cell stencil, with proposed modified design flow, for further shot reduction. We demonstrate 22.4% shot reduction within 4.3% area increase for a microprocessor and 28.6% shot reduction for IWLS benchmarks compared with the conventional technique.
Implementation of the Multicolored SOR Method on a Vector Supercomputer
Seiji FUJINO Ryutaro HIMENO Akira KOJIMA Kazuo TERADA

PAPER

Vol:
E80-D No:4
Page(s):
518-523
We describe the implementation of an iterative method with the goal of gaining a long vector length. The strategy for vectorization by means of multipoint stencils used for discretization of the partial differential equations is discussed. Numerical experiments show that the strategy that requires certain restrictions on the number of grid points in the x and y directions improves the performance on the vector supercomputer.

Keyword Search Result

[Keyword] stencil(5hit)

A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU

Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators

Performance Modeling of Stencil Computing on a Stream-Based FPGA Accelerator for Efficient Design Space Exploration

LSI Design Flow for Shot Reduction of Character Projection Electron Beam Direct Writing Using Combined Cell Stencil

Implementation of the Multicolored SOR Method on a Vector Supercomputer

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles