The search functionality is under construction.

Author Search Result

[Author] Fumihiko INO(11hit)

1-11hit
  • Memory Efficient Load Balancing for Distributed Large-Scale Volume Rendering Using a Two-Layered Group Structure

    Marcus WALLDEN  Stefano MARKIDIS  Masao OKITA  Fumihiko INO  

     
    PAPER-Computer Graphics

      Pubricized:
    2019/09/09
      Vol:
    E102-D No:12
      Page(s):
    2306-2316

    We propose a novel compositing pipeline and a dynamic load balancing technique for volume rendering which utilizes a two-layered group structure to achieve effective and scalable load balancing. The technique enables each process to render data from non-contiguous regions of the volume with minimal impact on the total render time. We demonstrate the effectiveness of the proposed technique by performing a set of experiments on a modern GPU cluster. The experiments show that using the technique results in up to a 35.7% lower worst-case memory usage as compared to a dynamic k-d tree load balancing technique, whilst simultaneously achieving similar or higher render performance. The proposed technique was also able to lower the amount of transferred data during the load balancing stage by up to 72.2%. The technique has the potential to be used in many scenarios where other dynamic load balancing techniques have proved to be inadequate, such as during large-scale visualization.

  • Cache-Aware GPU Optimization for Out-of-Core Cone Beam CT Reconstruction of High-Resolution Volumes

    Yuechao LU  Fumihiko INO  Kenichi HAGIHARA  

     
    PAPER-Computer System

      Pubricized:
    2016/09/05
      Vol:
    E99-D No:12
      Page(s):
    3060-3071

    This paper proposes a cache-aware optimization method to accelerate out-of-core cone beam computed tomography reconstruction on a graphics processing unit (GPU) device. Our proposed method extends a previous method by increasing the cache hit rate so as to speed up the reconstruction of high-resolution volumes that exceed the capacity of device memory. More specifically, our approach accelerates the well-known Feldkamp-Davis-Kress algorithm by utilizing the following three strategies: (1) a loop organization strategy that identifies the best tradeoff point between the cache hit rate and the number of off-chip memory accesses; (2) a data structure that exploits high locality within a layered texture; and (3) a fully pipelined strategy for hiding file input/output (I/O) time with GPU execution and data transfer times. We implement our proposed method on NVIDIA's latest Maxwell architecture and provide tuning guidelines for adjusting the execution parameters, which include the granularity and shape of thread blocks as well as the granularity of I/O data to be streamed through the pipeline, which maximizes reconstruction performance. Our experimental results show that it took less than three minutes to reconstruct a 20483-voxel volume from 1200 20482-pixel projection images on a single GPU; this translates to a speedup of approximately 1.47 as compared to the previous method.

  • Evaluation of Performance Prediction Method for Master/Slave Parallel Programs

    Yasuharu MIZUTANI  Fumihiko INO  Kenichi HAGIHARA  

     
    PAPER-Computer Systems

      Vol:
    E87-D No:4
      Page(s):
    967-975

    This paper describes the design and implementation of a testbed for predicting master/slave (M/S) programs written using Message Passing Interface (MPI) programs. The testbed, named M/S Emulator (MSE), aims at assisting developers in evaluating the performance of M/S programs and dynamic load-balancing strategies on clusters of PCs. In order to realize this, MSE predicts the communication time by using a realistic parallel computational model, an extension of the LogGPS model. This extended model improves the prediction accuracy on a large number of processors, because it captures the master's bottleneck: the overhead required for retrieving arrival messages from the slaves. Current MSE also employs a best effort emulation method for predicting the calculation time. In our experiments, MSE demonstrated an accurate prediction on clusters, especially on a larger number of nodes. Therefore, we believe that our extended model enables us to analyze the scalability of the M/S program performance.

  • Cache-Aware, In-Place Rotation Method for Texture-Based Volume Rendering

    Yuji MISAKI  Fumihiko INO  Kenichi HAGIHARA  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2016/12/12
      Vol:
    E100-D No:3
      Page(s):
    452-461

    We propose a cache-aware method to accelerate texture-based volume rendering on a graphics processing unit (GPU) that is compatible with the compute unified device architecture. The proposed method extends a previous method such that it can maximize the average rendering performance while rotating the viewing direction around a volume. To realize this, the proposed method performs in-place rotation of volume data, which rearranges the order of voxels to allow consecutive threads (warps) to refer to voxels with the minimum access strides. Experiments indicate that the proposed method replaces the worst texture cache (TC) hit rate of 42% with the best TC hit rate of 93% for a 10243-voxel volume. Thus, the average frame rate increases by a factor of 1.6 in the proposed method compared with that in the previous method. Although the overhead of in-place rotation slightly decreases the frame rate from 2.0 frames per second (fps) to 1.9 fps, this slowdown occurs only with a few viewing directions.

  • A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU

    Jingcheng SHEN  Fumihiko INO  Albert FARRÉS  Mauricio HANZICH  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2020/09/07
      Vol:
    E103-D No:12
      Page(s):
    2421-2434

    Graphics processing units (GPUs) are highly efficient architectures for parallel stencil code; however, the small device (i.e., GPU) memory capacity (several tens of GBs) necessitates the use of out-of-core computation to process excess data. Great programming effort is needed to manually implement efficient out-of-core stencil code. To relieve such programming burdens, directive-based frameworks emerged, such as the pipelined accelerator (PACC); however, they usually lack specific optimizations to reduce data transfer. In this paper, we extend PACC with two data-centric optimizations to address data transfer problems. The first is a direct-mapping scheme that eliminates host (i.e., CPU) buffers, which intermediate between the original data and device buffers. The second is a region-sharing scheme that significantly reduces host-to-device data transfer. The extended PACC was applied to an acoustic wave propagator, automatically extending the length of original serial code 2.3-fold to obtain the out-of-core code. Experimental results revealed that on a Tesla V100 GPU, the generated code ran 41.0, 22.1, and 3.6 times as fast as implementations based on Open Multi-Processing (OpenMP), Unified Memory, and the previous PACC, respectively. The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.

  • GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

    Fumihiko INO  Shinta NAKAGAWA  Kenichi HAGIHARA  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2604-2616

    This paper presents a stream programming framework, named GPU-chariot, for accelerating stream applications running on graphics processing units (GPUs). The main contribution of our framework is that it realizes efficient software pipelines on multi-GPU systems by enabling out-of-order execution of CPU functions, kernels, and data transfers. To achieve this out-of-order execution, we apply a runtime scheduler that not only maximizes the utilization of system resources but also encapsulates the number of GPUs available in the system. In addition, we implement a load-balancing capability to flow data efficiently through multiple GPUs. Furthermore, a callback interface enables overlapping execution of functions in third-party libraries. By using kernels with different performance bottlenecks, we show that our out-of-order execution is up to 20% faster than in-order execution. Finally, we conduct several case studies on a 4-GPU system and demonstrate the advantages of GPU-chariot over a manually pipelined code. We conclude that GPU-chariot can be useful when developing stream applications with software pipelines on multiple GPUs and CPUs.

  • Parallel Adaptive Estimation of Hip Range of Motion for Total Hip Replacement Surgery

    Yasuhiro KAWASAKI  Fumihiko INO  Yoshinobu SATO  Shinichi TAMURA  Kenichi HAGIHARA  

     
    PAPER-Parallel Image Processing

      Vol:
    E90-D No:1
      Page(s):
    30-39

    This paper presents the design and implementation of a hip range of motion (ROM) estimation method that is capable of fine-grained estimation during total hip replacement (THR) surgery. Our method is based on two acceleration strategies: (1) adaptive mesh refinement (AMR) for complexity reduction and (2) parallelization for further acceleration. On the assumption that the hip ROM is a single closed region, the AMR strategy reduces the complexity for N N N stance configurations from O(N3) to O(ND), where 2≤D≤3 and D is a data-dependent value that can be approximated by 2 in most cases. The parallelization strategy employs the master-worker paradigm with multiple task queues, reducing synchronization between processors with load balancing. The experimental results indicate that the implementation on a cluster of 64 PCs completes estimation of 360360180 stance configurations in 20 seconds, playing a key role in selecting and aligning the optimal combination of artificial joint components during THR surgery.

  • Block Randomized Singular Value Decomposition on GPUs

    Yuechao LU  Yasuyuki MATSUSHITA  Fumihiko INO  

     
    PAPER-Dependable Computing

      Pubricized:
    2020/06/08
      Vol:
    E103-D No:9
      Page(s):
    1949-1959

    Fast computation of singular value decomposition (SVD) is of great interest in various machine learning tasks. Recently, SVD methods based on randomized linear algebra have shown significant speedup in this regime. For processing large-scale data, computing systems with accelerators like GPUs have become the mainstream approach. In those systems, access to the input data dominates the overall process time; therefore, it is needed to design an out-of-core algorithm to dispatch the computation into accelerators. This paper proposes an accurate two-pass randomized SVD, named block randomized SVD (BRSVD), designed for matrices with a slow-decay singular spectrum that is often observed in image data. BRSVD fully utilizes the power of modern computing system architectures and efficiently processes large-scale data in a parallel and out-of-core fashion. Our experiments show that BRSVD effectively moves the performance bottleneck from data transfer to computation, so that outperforms existing randomized SVD methods in terms of speed with retaining similar accuracy.

  • Accelerating the Held-Karp Algorithm for the Symmetric Traveling Salesman Problem

    Kazuro KIMURA  Shinya HIGA  Masao OKITA  Fumihiko INO  

     
    PAPER-Fundamentals of Information System

      Pubricized:
    2019/08/23
      Vol:
    E102-D No:12
      Page(s):
    2329-2340

    In this paper, we propose an acceleration method for the Held-Karp algorithm that solves the symmetric traveling salesman problem by dynamic programming. The proposed method achieves acceleration with two techniques. First, we locate data-independent subproblems so that the subproblems can be solved in parallel. Second, we reduce the number of subproblems by a meet in the middle (MITM) technique, which computes the optimal path from both clockwise and counterclockwise directions. We show theoretical analysis on the impact of MITM in terms of the time and space complexities. In experiments, we compared the proposed method with a previous method running on a single-core CPU. Experimental results show that the proposed method on an 8-core CPU was 9.5-10.5 times faster than the previous method on a single-core CPU. Moreover, the proposed method on a graphics processing unit (GPU) was 30-40 times faster than that on an 8-core CPU. As a side effect, the proposed method reduced the memory usage by 48%.

  • Grid Resource Monitoring and Selection for Rapid Turnaround Applications

    Kensuke MURAKI  Yasuhiro KAWASAKI  Yasuharu MIZUTANI  Fumihiko INO  Kenichi HAGIHARA  

     
    PAPER-Computer Systems

      Vol:
    E89-D No:9
      Page(s):
    2491-2501

    In this paper, we present a resource monitoring and selection method for rapid turnaround grid applications (for example, within 10 seconds). The novelty of our method is the distributed evaluation of resources for rapidly selecting the appropriate idle resources. We integrate our method with a widely used resource management system, namely the Monitoring and Discovery System 2 (MDS2), and compare our method with the original MDS2 in terms of the performance and the scalability. The performance is measured using a 64-node cluster of PCs and the scalability is analyzed using a theoretical model and the measured performance. The experimental results show that our method reduces the resource selection time by 82%, as compared with the original MDS2. The scalability analysis also indicates that our method can keep the resource selection time within 1 second, up to 500 nodes in local-area-network (LAN) environments. In addition, some simulation results are presented to estimate the impact of our method for wide-area-network (WAN) environments.

  • Accelerating Smith-Waterman Algorithm for Biological Database Search on CUDA-Compatible GPUs

    Yuma MUNEKAWA  Fumihiko INO  Kenichi HAGIHARA  

     
    PAPER-Parallel and Distributed Architecture

      Vol:
    E93-D No:6
      Page(s):
    1479-1488

    This paper presents a fast method capable of accelerating the Smith-Waterman algorithm for biological database search on a cluster of graphics processing units (GPUs). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. As compared with previous methods, our method has four major contributions. (1) The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip video memory and processing elements in the GPU. (2) It also reduces the number of data fetches by applying a data reuse technique to query and database sequences. (3) A pipelined method is also implemented to overlap GPU execution with database access. (4) Finally, a master/worker paradigm is employed to accelerate hundreds of database searches on a cluster system. In experiments, the peak performance on a GeForce GTX 280 card reaches 8.32 giga cell updates per second (GCUPS). We also find that our method reduces the amount of data fetches to 1/140, achieving approximately three times higher performance than a previous CUDA-based method. Our 32-node cluster version is approximately 28 times faster than a single GPU version. Furthermore, the effective performance reaches 75.6 giga instructions per second (GIPS) using 32 GeForce 8800 GTX cards.