The search functionality is under construction.

Keyword Search Result

[Keyword] GPU(89hit)

21-40hit(89hit)

  • OpenACC Parallelization of Stochastic Simulations on GPUs

    Pilsung KANG  

     
    LETTER-Fundamentals of Information Systems

      Pubricized:
    2019/05/17
      Vol:
    E102-D No:8
      Page(s):
    1565-1568

    We present an OpenACC-based parallelization implementation of stochastic algorithms for simulating biochemical reaction networks on modern GPUs (graphics processing units). To investigate the effectiveness of using OpenACC for leveraging the massive hardware parallelism of the GPU architecture, we carefully apply OpenACC's language constructs and mechanisms to implementing a parallel version of stochastic simulation algorithms on the GPU. Using our OpenACC implementation in comparison to both the NVidia CUDA and the CPU-based implementations, we report our initial experiences on OpenACC's performance and programming productivity in the context of GPU-accelerated scientific computing.

  • Fast Computation with Efficient Object Data Distribution for Large-Scale Hologram Generation on a Multi-GPU Cluster Open Access

    Takanobu BABA  Shinpei WATANABE  Boaz JESSIE JACKIN  Kanemitsu OOTSU  Takeshi OHKAWA  Takashi YOKOTA  Yoshio HAYASAKI  Toyohiko YATAGAI  

     
    PAPER-Human-computer Interaction

      Pubricized:
    2019/03/29
      Vol:
    E102-D No:7
      Page(s):
    1310-1320

    The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include a change of the way of object decomposition, reduction of data transfer between the CPU and GPU, kernel integration, stream processing, and utilization of multiple GPUs within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. Experimental results show that intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain an execution time of 4.28 sec for generating a 1.6 giga-pixel hologram from a 3.2 giga-pixel object. It means a 237.92 times speed-up of the sequential processing by CPU and 41.78 times speed-up of multi-threaded execution on multicore-CPU, using a conventional FFT-based algorithm.

  • Accelerating Large-Scale Interconnection Network Simulation by Cellular Automata Concept

    Takashi YOKOTA  Kanemitsu OOTSU  Takeshi OHKAWA  

     
    PAPER-Computer System

      Pubricized:
    2018/10/05
      Vol:
    E102-D No:1
      Page(s):
    52-74

    State-of-the-art parallel systems employ a huge number of computing nodes that are connected by an interconnection network. An interconnection network (ICN) plays an important role in a parallel system, since it is responsible to communication capability. In general, an ICN shows non-linear phenomena in its communication performance, most of them are caused by congestion. Thus, designing a large-scale parallel system requires sufficient discussions through repetitive simulation runs. This causes another problem in simulating large-scale systems within a reasonable cost. This paper shows a promising solution by introducing the cellular automata concept, which is originated in our prior work. Assuming 2D-torus topologies for simplification of discussion, this paper discusses fundamental design of router functions in terms of cellular automata, data structure of packets, alternative modeling of a router function, and miscellaneous optimization. The proposed models have a good affinity to GPGPU technology and, as representative speed-up results, the GPU-based simulator accelerates simulation upto about 1264 times from sequential execution on a single CPU. Furthermore, since the proposed models are applicable in the shared memory model, multithread implementation of the proposed methods achieve about 162 times speed-ups at the maximum.

  • Real-Time and Energy-Efficient Face Detection on CPU-GPU Heterogeneous Embedded Platforms

    Chanyoung OH  Saehanseul YI  Youngmin YI  

     
    PAPER-Real-time Systems

      Pubricized:
    2018/09/18
      Vol:
    E101-D No:12
      Page(s):
    2878-2888

    As energy efficiency has become a major design constraint or objective, heterogeneous manycore architectures have emerged as mainstream target platforms not only in server systems but also in embedded systems. Manycore accelerators such as GPUs are getting also popular in embedded domains, as well as the heterogeneous CPU cores. However, as the number of cores in an embedded GPU is far less than that of a server GPU, it is important to utilize both heterogeneous multi-core CPUs and GPUs to achieve the desired throughput with the minimal energy consumption. In this paper, we present a case study of mapping LBP-based face detection onto a recent CPU-GPU heterogeneous embedded platform, which exploits both task parallelism and data parallelism to achieve maximal energy efficiency with a real-time constraint. We first present the parallelization technique of each task for the GPU execution, then we propose performance and energy models for both task-parallel and data-parallel executions on heterogeneous processors, which are used in design space exploration for the optimal mapping. The design space is huge since not only processor heterogeneity such as CPU-GPU and big.LITTLE, but also various data partitioning ratios for the data-parallel execution on these heterogeneous processors are considered. In our case study of LBP face detection on Exynos 5422, the estimation error of the proposed performance and energy models were on average -2.19% and -3.67% respectively. By systematically finding the optimal mappings with the proposed models, we could achieve 28.6% less energy consumption compared to the manual mapping, while still meeting the real-time constraint.

  • View Priority Based Threads Allocation and Binary Search Oriented Reweight for GPU Accelerated Real-Time 3D Ball Tracking

    Yilin HOU  Ziwei DENG  Xina CHENG  Takeshi IKENAGA  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2018/08/31
      Vol:
    E101-D No:12
      Page(s):
    3190-3198

    In real-time 3D ball tracking of sports analysis in computer vision technology, complex algorithms which assure the accuracy could be time-consuming. Particle filter based algorithm has a large potential to accelerate since the algorithm between particles has the chance to be paralleled in heterogeneous CPU-GPU platform. Still, with the target multi-view 3D ball tracking algorithm, challenges exist: 1) serial flowchart for each step in the algorithm; 2) repeated processing for multiple views' processing; 3) the low degree of parallelism in reweight and resampling steps for sequential processing. On the CPU-GPU platform, this paper proposes the double stream system flow, the view priority based threads allocation, and the binary search oriented reweight. Double stream system flow assigns tasks which there is no data dependency exists into different streams for each frame processing to achieve parallelism in system structure level. View priority based threads allocation manipulates threads in multi-view observation task. Threads number is view number multiplied by particles number, and with view priority assigning, which could help both memory accessing and computing achieving parallelism. Binary search oriented reweight reduces the time complexity by avoiding to generate cumulative distribution function and uses an unordered array to implement a binary search. The experiment is based on videos which record the final game of an official volleyball match (2014 Inter-High School Games of Men's Volleyball held in Tokyo Metropolitan Gymnasium in Aug. 2014) and the test sequences are taken by multiple-view system which is made of 4 cameras locating at the four corners of the court. The success rate achieves 99.23% which is the same as target algorithm while the time consumption has been accelerated from 75.1ms/frame in CPU environment to 3.05ms/frame in the proposed system which is 24.62 times speed up, also, it achieves 2.33 times speedup compared with basic GPU implemented work.

  • Cooperative GPGPU Scheduling for Consolidating Server Workloads

    Yusuke SUZUKI  Hiroshi YAMADA  Shinpei KATO  Kenji KONO  

     
    PAPER-Software System

      Pubricized:
    2018/08/30
      Vol:
    E101-D No:12
      Page(s):
    3019-3037

    Graphics processing units (GPUs) have become an attractive platform for general-purpose computing (GPGPU) in various domains. Making GPUs a time-multiplexing resource is a key to consolidating GPGPU applications (apps) in multi-tenant cloud platforms. However, advanced GPGPU apps pose a new challenge for consolidation. Such highly functional GPGPU apps, referred to as GPU eaters, can easily monopolize a shared GPU and starve collocated GPGPU apps. This paper presents GLoop, which is a software runtime that enables us to consolidate GPGPU apps including GPU eaters. GLoop offers an event-driven programming model, which allows GLoop-based apps to inherit the GPU eaters' high functionality while proportionally scheduling them on a shared GPU in an isolated manner. We implemented a prototype of GLoop and ported eight GPU eaters on it. The experimental results demonstrate that our prototype successfully schedules the consolidated GPGPU apps on the basis of its scheduling policy and isolates resources among them.

  • Multi-Peak Estimation for Real-Time 3D Ping-Pong Ball Tracking with Double-Queue Based GPU Acceleration

    Ziwei DENG  Yilin HOU  Xina CHENG  Takeshi IKENAGA  

     
    PAPER-Machine Vision and its Applications

      Pubricized:
    2018/02/16
      Vol:
    E101-D No:5
      Page(s):
    1251-1259

    3D ball tracking is of great significance in ping-pong game analysis, which can be utilized to applications such as TV contents and tactic analysis, with some of them requiring real-time implementation. This paper proposes a CPU-GPU platform based Particle Filter for multi-view ball tracking including 4 proposals. The multi-peak estimation and the ball-like observation model are proposed in the algorithm design. The multi-peak estimation aims at obtaining a precise ball position in case the particles' likelihood distribution has multiple peaks under complex circumstances. The ball-like observation model with 4 different likelihood evaluation, utilizes the ball's unique features to evaluate the particle's similarity with the target. In the GPU implementation, the double-queue structure and the vectorized data combination are proposed. The double-queue structure aims at achieving task parallelism between some data-independent tasks. The vectorized data combination reduces the time cost in memory access by combining 3 different image data to 1 vector data. Experiments are based on ping-pong videos recorded in an official match taken by 4 cameras located in 4 corners of the court. The tracking success rate reaches 99.59% on CPU. With the GPU acceleration, the time consumption is 8.8 ms/frame, which is sped up by a factor of 98 compared with its CPU version.

  • GPU-Accelerated Stochastic Simulation of Biochemical Networks

    Pilsung KANG  

     
    LETTER-Fundamentals of Information Systems

      Pubricized:
    2017/12/20
      Vol:
    E101-D No:3
      Page(s):
    786-790

    We present a GPU (graphics processing unit) accelerated stochastic algorithm implementation for simulating biochemical reaction networks using the latest NVidia architecture. To effectively utilize the massive parallelism offered by the NVidia Pascal hardware, we apply a set of performance tuning methods and guidelines such as exploiting the architecture's memory hierarchy in our algorithm implementation. Based on our experimentation results as well as comparative analysis using CPU-based implementations, we report our initial experiences on the performance of modern GPUs in the context of scientific computing.

  • A GPU-Based Rasterization Algorithm for Boolean Operations on Polygons

    Yi GAO  Jianxin LUO  Hangping QIU  Bin TANG  Bo WU  Weiwei DUAN  

     
    LETTER-Fundamentals of Information Systems

      Pubricized:
    2017/09/29
      Vol:
    E101-D No:1
      Page(s):
    234-238

    This paper presents a new GPU-based rasterization algorithm for Boolean operations that handles arbitary closed polygons. We construct an efficient data structure for interoperation of CPU and GPU and propose a fast GPU-based contour extraction method to ensure the performance of our algorithm. We then design a novel traversing strategy to achieve an error-free calculation of intersection point for correct Boolean operations. We finally give a detail evaluation and the results show that our algorithm has a higher performance than exsiting algorithms on processing polygons with large amount of vertices.

  • Virtualizing Graphics Architecture of Android Mobile Platforms in KVM/ARM Environment

    Sejin PARK  Byungsu PARK  Unsung LEE  Chanik PARK  

     
    PAPER-Software System

      Pubricized:
    2017/04/18
      Vol:
    E100-D No:7
      Page(s):
    1403-1415

    With the availability of virtualization extension in mobile processors, e.g. ARM Cortex A-15, multiple virtual execution domains are efficiently supported in a mobile platform. Each execution domain requires high-performance graphics services for full-featured user interfaces such as smooth scrolling, background image blurring, and 3D images. However, graphics service is hard to be virtualized because multiple service components (e.g. ION and Fence) are involved. Moreover, the complexity of Graphical Processing Unit (GPU) device driver also makes harder virtualizing graphics service. In this paper, we propose a technique to virtualize the graphics architecture of Android mobile platform in KVM/ARM environment. The Android graphics architecture relies on underlying Linux kernel services such as the frame buffer memory allocator ION, the buffer synchronization service Fence, GPU device driver, and the display synchronization service VSync. These kernel services are provided as device files in Linux kernel. Our approach is to para-virtualize these device files based on a split device driver model. A major challenge is to translate guest-view of information into host-view of information, e.g. memory address translation, file descriptor management, and GPU Memory Management Unit (MMU) manipulation. The experimental results show that the proposed graphics virtualization technique achieved almost 84%-100% performance of native applications.

  • A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access

    Heungseop AHN  Seungwon CHOI  

     
    PAPER-Communication Theory and Signals

      Vol:
    E100-A No:5
      Page(s):
    1188-1196

    The sub-blocking algorithm has been known as a core component in implementing a turbo decoder using a Graphic Processing Unit (GPU) to use as many cores in the GPU as possible for parallel processing. However, even though the sub-blocking algorithm allows a large number of threads in a given GPU to be adopted for processing a large number of sub-blocks in parallel, each thread must access the global memory with strided addresses, which results in uncoalesced memory access. Because uncoalesced memory access causes a lot of unnecessary memory transactions, the memory bandwidth efficiency drops significantly, possibly as low as 1/8 in the case of an Long Term Evolution (LTE) turbo decoder, depending upon the compute capability of a GPU. In this paper, we present a novel method for converting uncoalesced memory access into coalesced access in a way that completely recovers the memory bandwidth efficiency to 100% without additional overhead. Our experimental tests, performed with NVIDIA's Geforce GTX 780 Ti GPU, show that the proposed method can enhance the throughput by nearly 30% compared with a conventional turbo decoder that suffers from uncoalesced memory access. Throughput provided by the proposed method has been observed to be 51.4Mbps when the number of iterations and that of sub-blocks are set to 6 and 32, respectively, in our experimental tests, which far exceeds the performance of previous works implemented the Max-Log-MAP algorithm.

  • Cache-Aware, In-Place Rotation Method for Texture-Based Volume Rendering

    Yuji MISAKI  Fumihiko INO  Kenichi HAGIHARA  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2016/12/12
      Vol:
    E100-D No:3
      Page(s):
    452-461

    We propose a cache-aware method to accelerate texture-based volume rendering on a graphics processing unit (GPU) that is compatible with the compute unified device architecture. The proposed method extends a previous method such that it can maximize the average rendering performance while rotating the viewing direction around a volume. To realize this, the proposed method performs in-place rotation of volume data, which rearranges the order of voxels to allow consecutive threads (warps) to refer to voxels with the minimum access strides. Experiments indicate that the proposed method replaces the worst texture cache (TC) hit rate of 42% with the best TC hit rate of 93% for a 10243-voxel volume. Thus, the average frame rate increases by a factor of 1.6 in the proposed method compared with that in the previous method. Although the overhead of in-place rotation slightly decreases the frame rate from 2.0 frames per second (fps) to 1.9 fps, this slowdown occurs only with a few viewing directions.

  • An Efficient Soft Shadow Mapping for Area Lights in Various Shapes and Colors

    Youngjae CHUN  Kyoungsu OH  

     
    LETTER-Computer Graphics

      Pubricized:
    2016/11/11
      Vol:
    E100-D No:2
      Page(s):
    396-400

    Shadow is an important effect that makes virtual 3D scenes more realistic. In this paper, we propose a fast and correct soft shadow generation method for area lights of various shapes and colors. To conduct efficient as well as accurate visibility tests, we exploit the complexity of shadow and area light color.

  • Geometry Clipmaps Terrain Rendering Using Hardware Tessellation

    Ge SONG  Hongyu YANG  Yulong JI  

     
    LETTER-Computer Graphics

      Pubricized:
    2016/11/09
      Vol:
    E100-D No:2
      Page(s):
    401-404

    Due to heavy rendering load and unstable frame rate when rendering large terrain, this paper proposes a geometry clipmaps based algorithm. Triangle meshes are generated by few tessellation control points in GPU tessellation shader. ‘Cracks’ caused by different resolution between adjacent levels are eliminated by modifying outer tessellation level factor of shared edges between levels. Experimental results show the algorithm is able to improve rendering efficiency and frame rate stability in terrain navigation.

  • GPU-Accelerated Bulk Execution of Multiple-Length Multiplication with Warp-Synchronous Programming Technique

    Takumi HONDA  Yasuaki ITO  Koji NAKANO  

     
    PAPER-GPU computing

      Pubricized:
    2016/08/24
      Vol:
    E99-D No:12
      Page(s):
    3004-3012

    In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.

  • Cache-Aware GPU Optimization for Out-of-Core Cone Beam CT Reconstruction of High-Resolution Volumes

    Yuechao LU  Fumihiko INO  Kenichi HAGIHARA  

     
    PAPER-Computer System

      Pubricized:
    2016/09/05
      Vol:
    E99-D No:12
      Page(s):
    3060-3071

    This paper proposes a cache-aware optimization method to accelerate out-of-core cone beam computed tomography reconstruction on a graphics processing unit (GPU) device. Our proposed method extends a previous method by increasing the cache hit rate so as to speed up the reconstruction of high-resolution volumes that exceed the capacity of device memory. More specifically, our approach accelerates the well-known Feldkamp-Davis-Kress algorithm by utilizing the following three strategies: (1) a loop organization strategy that identifies the best tradeoff point between the cache hit rate and the number of off-chip memory accesses; (2) a data structure that exploits high locality within a layered texture; and (3) a fully pipelined strategy for hiding file input/output (I/O) time with GPU execution and data transfer times. We implement our proposed method on NVIDIA's latest Maxwell architecture and provide tuning guidelines for adjusting the execution parameters, which include the granularity and shape of thread blocks as well as the granularity of I/O data to be streamed through the pipeline, which maximizes reconstruction performance. Our experimental results show that it took less than three minutes to reconstruct a 20483-voxel volume from 1200 20482-pixel projection images on a single GPU; this translates to a speedup of approximately 1.47 as compared to the previous method.

  • A Memory-Access-Efficient Implementation for Computing the Approximate String Matching Algorithm on GPUs

    Lucas Saad Nogueira NUNES  Jacir Luiz BORDIM  Yasuaki ITO  Koji NAKANO  

     
    PAPER-GPU computing

      Pubricized:
    2016/08/24
      Vol:
    E99-D No:12
      Page(s):
    2995-3003

    The closeness of a match is an important measure with a number of practical applications, including computational biology, signal processing and text retrieval. The approximate string matching (ASM) problem asks to find a substring of string Y of length n that is most similar to string X of length m. It is well-know that the ASM can be solved by dynamic programming technique by computing a table of size m×n. The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU. The proposed GPU implementation relies on warp shuffle instructions which are used to accelerate the communication between threads without resorting to shared memory access. Despite the fact that O(mn) memory access operations are necessary to access all elements of a table with size n×m, the proposed implementation performs only $O( rac{mn}{w})$ memory access operations, where w is the warp size. Experimental results carried out on a GeForce GTX 980 GPU show that the proposed implementation, called w-SCAN, provides speed-up of over two fold in computing the ASM as compared to another prominent alternative.

  • Fully Parallelized LZW Decompression for CUDA-Enabled GPUs

    Shunji FUNASAKA  Koji NAKANO  Yasuaki ITO  

     
    PAPER-GPU computing

      Pubricized:
    2016/08/25
      Vol:
    E99-D No:12
      Page(s):
    2986-2994

    The main contribution of this paper is to present a work-optimal parallel algorithm for LZW decompression and to implement it in a CUDA-enabled GPU. Since sequential LZW decompression creates a dictionary table by reading codes in a compressed file one by one, it is not easy to parallelize it. We first present a work-optimal parallel LZW decompression algorithm on the CREW-PRAM (Concurrent-Read Exclusive-Write Parallel Random Access Machine), which is a standard theoretical parallel computing model with a shared memory. We then go on to present an efficient implementation of this parallel algorithm on a GPU. The experimental results show that our GPU implementation performs LZW decompression in 1.15 milliseconds for a gray scale TIFF image with 4096×3072 pixels stored in the global memory of GeForce GTX 980. On the other hand, sequential LZW decompression for the same image stored in the main memory of Intel Core i7 CPU takes 50.1 milliseconds. Thus, our parallel LZW decompression on the global memory of the GPU is 43.6 times faster than a sequential LZW decompression on the main memory of the CPU for this image. To show the applicability of our GPU implementation for LZW decompression, we evaluated the SSD-GPU data loading time for three scenarios. The experimental results show that the scenario using our LZW decompression on the GPU is faster than the others.

  • Performance Optimization of Light-Field Applications on GPU

    Yuttakon YUTTAKONKIT  Shinya TAKAMAEDA-YAMAZAKI  Yasuhiko NAKASHIMA  

     
    PAPER-Computer System

      Pubricized:
    2016/08/24
      Vol:
    E99-D No:12
      Page(s):
    3072-3081

    Light-field image processing has been widely employed in many areas, from mobile devices to manufacturing applications. The fundamental process to extract the usable information requires significant computation with high-resolution raw image data. A graphics processing unit (GPU) is used to exploit the data parallelism as in general image processing applications. However, the sparse memory access pattern of the applications reduced the performance of GPU devices for both systematic and algorithmic reasons. Thus, we propose an optimization technique which redesigns the memory access pattern of the applications to alleviate the memory bottleneck of rendering application and to increase the data reusability for depth extraction application. We evaluated our optimized implementations with the state-of-the-art algorithm implementations on several GPUs where all implementations were optimally configured for each specific device. Our proposed optimization increased the performance of rendering application on GTX-780 GPU by 30% and depth extraction application on GTX-780 and GTX-980 GPUs by 82% and 18%, respectively, compared with the original implementations.

  • Accuracy Assessment of FDTD Method for the Analysis of Sub-Wavelength Photonic Structures

    Yasuo OHTERA  

     
    PAPER

      Vol:
    E99-C No:7
      Page(s):
    780-787

    FDTD (Finite-Difference Time-Domain) method has been widely used for the analysis of photonic devices consisting of sub-wavelength structures. In recent years, increasing efforts have been made to implement the FDTD on GPGPUs (General-Purpose Graphic Processing Units), to shorten simulation time. On the other hand, it is widely recognized that most of the middle- and low-end GPGPUs have difference of computational performance, between single-precision and double-precision type arithmetics. Therefore the type selection of single/double precision for electromagnetic field variables in FDTD becomes a key issue from the viewpoint of the total simulation performance. In this study we investigated the difference of results between the use of single-precision and double-precision. As a most fundamental sub-wavelength photonic structure, we focused on an alternating multilayer (one dimensional periodic structure). Obtained results indicate that significant difference appears for the amplitudes of higher order spatial harmonic waves.

21-40hit(89hit)