The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] stencil(5hit)

1-5hit
  • A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU

    Jingcheng SHEN  Fumihiko INO  Albert FARRÉS  Mauricio HANZICH  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2020/09/07
      Vol:
    E103-D No:12
      Page(s):
    2421-2434

    Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.

  • Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators

    Yoshikazu INAGAKI  Shinya TAKAMAEDA-YAMAZAKI  Jun YAO  Yasuhiko NAKASHIMA  

     
    PAPER-Architecture

      Pubricized:
    2015/09/15
      Vol:
    E98-D No:12
      Page(s):
    2141-2149

    The Energy-aware Multi-mode Accelerator eXtension [24],[25] (EMAX) is equipped with distributed single-port local memories and ring-formed interconnections. The accelerator is designed to achieve extremely high throughput for scientific computations, big data, and image processing as well as low-power consumption. However, before mapping algorithms on the accelerator, application developers require sufficient knowledge of the hardware organization and specially designed instructions. They also need significant effort to tune the code for improving execution efficiency when no well-designed compiler or library is available. To address this problem, we focus on library support for stencil (nearest-neighbor) computations that represent a class of algorithms commonly used in many partial differential equation (PDE) solvers. In this research, we address the following topics: (1) system configuration, features, and mnemonics of EMAX; (2) instruction mapping techniques that reduce the amount of data to be read from the main memory; (3) performance evaluation of the library for PDE solvers. With the features of a library that can reuse the local data across the outer loop iterations and map many instructions by unrolling the outer loops, the amount of data to be read from the main memory is significantly reduced to a minimum of 1/7 compared with a hand-tuned code. In addition, the stencil library reduced the execution time 23% more than a general-purpose processor.

  • Performance Modeling of Stencil Computing on a Stream-Based FPGA Accelerator for Efficient Design Space Exploration

    Keisuke DOHI  Koji OKINA  Rie SOEJIMA  Yuichiro SHIBATA  Kiyoshi OGURI  

     
    PAPER-Application

      Pubricized:
    2014/11/19
      Vol:
    E98-D No:2
      Page(s):
    298-308

    In this paper, we discuss performance modeling of 3-D stencil computing on an FPGA accelerator with a high-level synthesis environment, aiming for efficient exploration of user-space design parameters. First, we analyze resource utilization and performance to formulate these relationships as mathematical models. Then, in order to evaluate our proposed models, we implement heat conduction simulations as a benchmark application, by using MaxCompiler, which is a high-level synthesis tool for FPGAs, and MaxGenFD, which is a domain specific framework of the MaxCompiler for finite-difference equation solvers. The experimental results with various settings of architectural design parameters show the best combination of design parameters for pipeline structure can be systematically found by using our models. The effects of changing arithmetic accuracy and using data stream compression are also discussed.

  • LSI Design Flow for Shot Reduction of Character Projection Electron Beam Direct Writing Using Combined Cell Stencil

    Taisuke KAZAMA  Makoto IKEDA  Kunihiro ASADA  

     
    PAPER-Physical Design

      Vol:
    E89-A No:12
      Page(s):
    3546-3550

    We propose a shot reduction technique of character projection (CP) Electron Beam Direct Writing (EBDW) using combined cell stencil (CCS) or the advanced process technology. CP EBDW is expected both to reduce mask costs and to realize quick turn around time. One of major issue of the conventional CP EBDW, however, is a throughput of lithography. The throughput is determined by numbers of shots, which are proportional to numbers of cell instances in LSIs. The conventional shot reduction techniques focus on optimization of cell stencil extraction, without any modifications on designed LSI mask patterns. The proposed technique employs the proposed combined cell stencil, with proposed modified design flow, for further shot reduction. We demonstrate 22.4% shot reduction within 4.3% area increase for a microprocessor and 28.6% shot reduction for IWLS benchmarks compared with the conventional technique.

  • Implementation of the Multicolored SOR Method on a Vector Supercomputer

    Seiji FUJINO  Ryutaro HIMENO  Akira KOJIMA  Kazuo TERADA  

     
    PAPER

      Vol:
    E80-D No:4
      Page(s):
    518-523

    We describe the implementation of an iterative method with the goal of gaining a long vector length. The strategy for vectorization by means of multipoint stencils used for discretization of the partial differential equations is discussed. Numerical experiments show that the strategy that requires certain restrictions on the number of grid points in the x and y directions improves the performance on the vector supercomputer.