1-5hit |
Jingcheng SHEN Fumihiko INO Albert FARRÉS Mauricio HANZICH
Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.
Yoshikazu INAGAKI Shinya TAKAMAEDA-YAMAZAKI Jun YAO Yasuhiko NAKASHIMA
The Energy-aware Multi-mode Accelerator eXtension [24],[25] (EMAX) is equipped with distributed single-port local memories and ring-formed interconnections. The accelerator is designed to achieve extremely high throughput for scientific computations, big data, and image processing as well as low-power consumption. However, before mapping algorithms on the accelerator, application developers require sufficient knowledge of the hardware organization and specially designed instructions. They also need significant effort to tune the code for improving execution efficiency when no well-designed compiler or library is available. To address this problem, we focus on library support for stencil (nearest-neighbor) computations that represent a class of algorithms commonly used in many partial differential equation (PDE) solvers. In this research, we address the following topics: (1) system configuration, features, and mnemonics of EMAX; (2) instruction mapping techniques that reduce the amount of data to be read from the main memory; (3) performance evaluation of the library for PDE solvers. With the features of a library that can reuse the local data across the outer loop iterations and map many instructions by unrolling the outer loops, the amount of data to be read from the main memory is significantly reduced to a minimum of 1/7 compared with a hand-tuned code. In addition, the stencil library reduced the execution time 23% more than a general-purpose processor.
Keisuke DOHI Koji OKINA Rie SOEJIMA Yuichiro SHIBATA Kiyoshi OGURI
In this paper, we discuss performance modeling of 3-D stencil computing on an FPGA accelerator with a high-level synthesis environment, aiming for efficient exploration of user-space design parameters. First, we analyze resource utilization and performance to formulate these relationships as mathematical models. Then, in order to evaluate our proposed models, we implement heat conduction simulations as a benchmark application, by using MaxCompiler, which is a high-level synthesis tool for FPGAs, and MaxGenFD, which is a domain specific framework of the MaxCompiler for finite-difference equation solvers. The experimental results with various settings of architectural design parameters show the best combination of design parameters for pipeline structure can be systematically found by using our models. The effects of changing arithmetic accuracy and using data stream compression are also discussed.
Taisuke KAZAMA Makoto IKEDA Kunihiro ASADA
We propose a shot reduction technique of character projection (CP) Electron Beam Direct Writing (EBDW) using combined cell stencil (CCS) or the advanced process technology. CP EBDW is expected both to reduce mask costs and to realize quick turn around time. One of major issue of the conventional CP EBDW, however, is a throughput of lithography. The throughput is determined by numbers of shots, which are proportional to numbers of cell instances in LSIs. The conventional shot reduction techniques focus on optimization of cell stencil extraction, without any modifications on designed LSI mask patterns. The proposed technique employs the proposed combined cell stencil, with proposed modified design flow, for further shot reduction. We demonstrate 22.4% shot reduction within 4.3% area increase for a microprocessor and 28.6% shot reduction for IWLS benchmarks compared with the conventional technique.
Seiji FUJINO Ryutaro HIMENO Akira KOJIMA Kazuo TERADA
We describe the implementation of an iterative method with the goal of gaining a long vector length. The strategy for vectorization by means of multipoint stencils used for discretization of the partial differential equations is discussed. Numerical experiments show that the strategy that requires certain restrictions on the number of grid points in the x and y directions improves the performance on the vector supercomputer.