The search functionality is under construction.

Keyword Search Result

[Keyword] GPGPU(17hit)

1-17hit
  • GPGPU Implementation of Variational Bayesian Gaussian Mixture Models

    Hiroki NISHIMOTO  Renyuan ZHANG  Yasuhiko NAKASHIMA  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2021/11/24
      Vol:
    E105-D No:3
      Page(s):
    611-622

    The efficient implementation strategy for speeding up high-quality clustering algorithms is developed on the basis of general purpose graphic processing units (GPGPUs) in this work. Among various clustering algorithms, a sophisticated Gaussian mixture model (GMM) by estimating parameters through variational Bayesian (VB) mechanism is conducted due to its superior performances. Since the VB-GMM methodology is computation-hungry, the GPGPU is employed to carry out massive matrix-computations. To efficiently migrate the conventional CPU-oriented schemes of VB-GMM onto GPGPU platforms, an entire migration-flow with thirteen stages is presented in detail. The CPU-GPGPU co-operation scheme, execution re-order, and memory access optimization are proposed for optimizing the GPGPU utilization and maximizing the clustering speed. Five types of real-world applications along with relevant data-sets are introduced for the cross-validation. From the experimental results, the feasibility of implementing VB-GMM algorithm by GPGPU is verified with practical benefits. The proposed GPGPU migration achieves 192x speedup in maximum. Furthermore, it succeeded in identifying the proper number of clusters, which is hardly conducted by the EM-algotihm.

  • Instruction Prefetch for Improving GPGPU Performance

    Jianli CAO  Zhikui CHEN  Yuxin WANG  He GUO  Pengcheng WANG  

     
    PAPER-VLSI Design Technology and CAD

      Pubricized:
    2020/11/16
      Vol:
    E104-A No:5
      Page(s):
    773-785

    Like many processors, GPGPU suffers from memory wall. The traditional solution for this issue is to use efficient schedulers to hide long memory access latency or use data prefetch mech-anism to reduce the latency caused by data transfer. In this paper, we study the instruction fetch stage of GPU's pipeline and analyze the relationship between the capacity of GPU kernel and instruction miss rate. We improve the next line prefetch mechanism to fit the SIMT model of GPU and determine the optimal parameters of prefetch mechanism on GPU through experiments. The experimental result shows that the prefetch mechanism can achieve 12.17% performance improvement on average. Compared with the solution of enlarging I-Cache, prefetch mechanism has the advantages of more beneficiaries and lower cost.

  • A Rabin-Karp Implementation for Handling Multiple Pattern-Matching on the GPU

    Lucas Saad Nogueira NUNES  Jacir Luiz BORDIM  Yasuaki ITO  Koji NAKANO  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2020/09/24
      Vol:
    E103-D No:12
      Page(s):
    2412-2420

    The volume of digital information is growing at an extremely fast pace which, in turn, exacerbates the need of efficient mechanisms to find the presence of a pattern in an input text or a set of input strings. Combining the processing power of Graphics Processing Unit (GPU) with matching algorithms seems a natural alternative to speedup the string-matching process. This work proposes a Parallel Rabin-Karp implementation (PRK) that encompasses a fast-parallel prefix-sums algorithm to maximize parallelization and accelerate the matching verification. Given an input text T of length n and p patterns of length m, the proposed implementation finds all occurrences of p in T in O(m+q+n/τ+nm/q) time, where q is a sufficiently large prime number and τ is the available number of threads. Sequential and parallel versions of the PRK have been implemented. Experiments have been executed on p≥1 patterns of length m comprising of m=10, 20, 30 characters which are compared against a text string of length n=227. The results show that the parallel implementation of the PRK algorithm on NVIDIA V100 GPU provides speedup surpassing 372 times when compared to the sequential implementation and speedup of 12.59 times against an OpenMP implementation running on a multi-core server with 128 threads. Compared to another prominent GPU implementation, the PRK implementation attained speedup surpassing 37 times.

  • Accelerating Large-Scale Interconnection Network Simulation by Cellular Automata Concept

    Takashi YOKOTA  Kanemitsu OOTSU  Takeshi OHKAWA  

     
    PAPER-Computer System

      Pubricized:
    2018/10/05
      Vol:
    E102-D No:1
      Page(s):
    52-74

    State-of-the-art parallel systems employ a huge number of computing nodes that are connected by an interconnection network. An interconnection network (ICN) plays an important role in a parallel system, since it is responsible to communication capability. In general, an ICN shows non-linear phenomena in its communication performance, most of them are caused by congestion. Thus, designing a large-scale parallel system requires sufficient discussions through repetitive simulation runs. This causes another problem in simulating large-scale systems within a reasonable cost. This paper shows a promising solution by introducing the cellular automata concept, which is originated in our prior work. Assuming 2D-torus topologies for simplification of discussion, this paper discusses fundamental design of router functions in terms of cellular automata, data structure of packets, alternative modeling of a router function, and miscellaneous optimization. The proposed models have a good affinity to GPGPU technology and, as representative speed-up results, the GPU-based simulator accelerates simulation upto about 1264 times from sequential execution on a single CPU. Furthermore, since the proposed models are applicable in the shared memory model, multithread implementation of the proposed methods achieve about 162 times speed-ups at the maximum.

  • Cooperative GPGPU Scheduling for Consolidating Server Workloads

    Yusuke SUZUKI  Hiroshi YAMADA  Shinpei KATO  Kenji KONO  

     
    PAPER-Software System

      Pubricized:
    2018/08/30
      Vol:
    E101-D No:12
      Page(s):
    3019-3037

    Graphics processing units (GPUs) have become an attractive platform for general-purpose computing (GPGPU) in various domains. Making GPUs a time-multiplexing resource is a key to consolidating GPGPU applications (apps) in multi-tenant cloud platforms. However, advanced GPGPU apps pose a new challenge for consolidation. Such highly functional GPGPU apps, referred to as GPU eaters, can easily monopolize a shared GPU and starve collocated GPGPU apps. This paper presents GLoop, which is a software runtime that enables us to consolidate GPGPU apps including GPU eaters. GLoop offers an event-driven programming model, which allows GLoop-based apps to inherit the GPU eaters' high functionality while proportionally scheduling them on a shared GPU in an isolated manner. We implemented a prototype of GLoop and ported eight GPU eaters on it. The experimental results demonstrate that our prototype successfully schedules the consolidated GPGPU apps on the basis of its scheduling policy and isolates resources among them.

  • GPU-Accelerated Bulk Execution of Multiple-Length Multiplication with Warp-Synchronous Programming Technique

    Takumi HONDA  Yasuaki ITO  Koji NAKANO  

     
    PAPER-GPU computing

      Pubricized:
    2016/08/24
      Vol:
    E99-D No:12
      Page(s):
    3004-3012

    In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.

  • Accuracy Assessment of FDTD Method for the Analysis of Sub-Wavelength Photonic Structures

    Yasuo OHTERA  

     
    PAPER

      Vol:
    E99-C No:7
      Page(s):
    780-787

    FDTD (Finite-Difference Time-Domain) method has been widely used for the analysis of photonic devices consisting of sub-wavelength structures. In recent years, increasing efforts have been made to implement the FDTD on GPGPUs (General-Purpose Graphic Processing Units), to shorten simulation time. On the other hand, it is widely recognized that most of the middle- and low-end GPGPUs have difference of computational performance, between single-precision and double-precision type arithmetics. Therefore the type selection of single/double precision for electromagnetic field variables in FDTD becomes a key issue from the viewpoint of the total simulation performance. In this study we investigated the difference of results between the use of single-precision and double-precision. As a most fundamental sub-wavelength photonic structure, we focused on an alternating multilayer (one dimensional periodic structure). Obtained results indicate that significant difference appears for the amplitudes of higher order spatial harmonic waves.

  • 3D Objects Tracking by MapReduce GPGPU-Enhanced Particle Filter

    Jieyun ZHOU  Xiaofeng LI  Haitao CHEN  Rutong CHEN  Masayuki NUMAO  

     
    PAPER

      Pubricized:
    2015/01/21
      Vol:
    E98-D No:5
      Page(s):
    1035-1044

    Objects tracking methods have been wildly used in the field of video surveillance, motion monitoring, robotics and so on. Particle filter is one of the promising methods, but it is difficult to apply to real-time objects tracking because of its high computation cost. In order to reduce the processing cost without sacrificing the tracking quality, this paper proposes a new method for real-time 3D objects tracking, using parallelized particle filter algorithms by MapReduce architecture which is running on GPGPU. Our methods are as follows. First, we use a Kinect to get the 3D information of objects. Unlike the conventional 2D-based objects tracking, 3D objects tracking adds depth information. It can track not only from the x and y axis but also from the z axis, and the depth information can correct some errors in 2D objects tracking. Second, to solve the high computation cost problem, we use the MapReduce architecture on GPGPU to parallelize the particle filter algorithm. We implement the particle filter algorithms on GPU and evaluate the performance by actually running a program on CUDA5.5.

  • A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

    Yasuaki ITO  Koji NAKANO  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2596-2603

    This paper presents a GPU (Graphics Processing Units) implementation of dynamic programming for the optimal polygon triangulation. Recently, GPUs can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture) provided by NVIDIA. The optimal polygon triangulation problem for a convex polygon is an optimization problem to find a triangulation with minimum total weight. It is known that this problem for a convex n-gon can be solved using the dynamic programming technique in O(n3) time using a work space of size O(n2). In this paper, we propose an efficient parallel implementation of this O(n3)-time algorithm on the GPU. In our implementation, we have used two new ideas to accelerate the dynamic programming. The first idea (adaptive granularity) is to partition the dynamic programming algorithm into many sequential kernel calls of CUDA, and to select the best parameters for the size and the number of blocks for each kernel call. The second idea (sliding and mirroring arrangements) is to arrange the working data for coalesced access of the global memory in the GPU to minimize the memory access overhead. Our implementation using these two ideas solves the optimal polygon triangulation problem for a convex 8192-gon in 5.57 seconds on the NVIDIA GeForce GTX 680, while a conventional CPU implementation runs in 1939.02 seconds. Thus, our GPU implementation attains a speedup factor of 348.02.

  • HiCrypt: A Specialized Translator for Symmetric Block Cipher and GPGPU

    Keisuke IWAI  Naoki NISHIKAWA  Takakazu KUROKAWA  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2575-2586

    Many-core computer systems with GPUs are coming into mainstream use from high-end computing, including supercomputers, to embedded processors. Consequently, the implementation of cryptographic methods on GPGPU is also becoming popular because of such systems' performance. However, many factors affect the performance of GPUs. To cope with this problem, we developed a new translator, HiCrypt, which can generate an optimized GPGPU program written in both of CUDA and OpenCL from a cipher program written in standard C language with directives. Users must annotate only variables and an encoding/decoding function, which are characteristics of cipher programs, with directives. To evaluate the translator, five representative cipher programs are translated into CUDA and OpenCL programs by the translator. Generated programs perform high throughput almost identical to hand optimized programs for all five cipher programs. HiCrypt will contribute to development and evaluate of new and various symmetric block ciphers using GPGPU.

  • Periodic Pattern Coding for Last Level Cache Data Compression

    Haruhiko KANEKO  

     
    PAPER-Data Compression

      Vol:
    E96-A No:12
      Page(s):
    2351-2359

    In spite of continuous improvement of computational power of multi/many-core processors, the memory access performance of the processors has not been improved sufficiently, and thus the overall performance of recent processors is often restricted by the delay of off-chip memory accesses. Low-delay data compression for last level cache (LLC) would be effective to improve the processor performance because the compression increases the effective size of LLC, and thus reduces the number of off-chip memory accesses. This paper proposes a novel data compression method suitable for high-speed parallel decoding in the LLC. Since cache line data often have periodicity of certain lengths, such as 32- or 64-bit instructions, 32-bit integers, and 64-bit floating point numbers, an information word is encoded as a base pattern and a differential pattern between the original word and the base pattern. Evaluation using a GPU simulator shows that the compression ratio of the proposed coding is comparable to LZSS coding and X-Match Pro and superior to other conventional compression algorithms for cache memories. Also this paper presents an experimental decoder designed for ASIC, and the synthesized result shows that the decoder can decompress cache line data of length 32bytes in four clock cycles. Evaluation of the IPC on the GPU simulator shows that, for several benchmark programs, the IPC achieved by the proposed coding is higher than that by the conventional BΔI coding, where the maximum improvement of the IPC is 20%.

  • GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

    Fumihiko INO  Shinta NAKAGAWA  Kenichi HAGIHARA  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2604-2616

    This paper presents a stream programming framework, named GPU-chariot, for accelerating stream applications running on graphics processing units (GPUs). The main contribution of our framework is that it realizes efficient software pipelines on multi-GPU systems by enabling out-of-order execution of CPU functions, kernels, and data transfers. To achieve this out-of-order execution, we apply a runtime scheduler that not only maximizes the utilization of system resources but also encapsulates the number of GPUs available in the system. In addition, we implement a load-balancing capability to flow data efficiently through multiple GPUs. Furthermore, a callback interface enables overlapping execution of functions in third-party libraries. By using kernels with different performance bottlenecks, we show that our out-of-order execution is up to 20% faster than in-order execution. Finally, we conduct several case studies on a 4-GPU system and demonstrate the advantages of GPU-chariot over a manually pipelined code. We conclude that GPU-chariot can be useful when developing stream applications with software pipelines on multiple GPUs and CPUs.

  • Lossless Compression of Double-Precision Floating-Point Data for Numerical Simulations: Highly Parallelizable Algorithms for GPU Computing

    Mamoru OHARA  Takashi YAMAGUCHI  

     
    PAPER-Parallel and Distributed Computing

      Vol:
    E95-D No:12
      Page(s):
    2778-2786

    In numerical simulations using massively parallel computers like GPGPU (General-Purpose computing on Graphics Processing Units), we often need to transfer computational results from external devices such as GPUs to the main memory or secondary storage of the host machine. Since size of the computation results is sometimes unacceptably large to hold them, it is desired that the data is compressed and stored. In addition, considering overheads for transferring data between the devices and host memories, it is preferable that the data is compressed in a part of parallel computation performed on the devices. Traditional compression methods for floating-point numbers do not always show good parallelism. In this paper, we propose a new compression method for massively-parallel simulations running on GPUs, in which we combine a few successive floating-point numbers and interleave them to improve compression efficiency. We also present numerical examples of compression ratio and throughput obtained from experimental implementations of the proposed method runnig on CPUs and GPUs.

  • Asymptotically Optimal Merging on ManyCore GPUs

    Arne KUTZNER  Pok-Son KIM  Won-Kwang PARK  

     
    PAPER-Parallel and Distributed Computing

      Vol:
    E95-D No:12
      Page(s):
    2769-2777

    We propose a family of algorithms for efficiently merging on contemporary GPUs, so that each algorithm requires O(m log (+1)) element comparisons, where m and n are the sizes of the input sequences with m ≤ n. According to the lower bounds for merging all proposed algorithms are asymptotically optimal regarding the number of necessary comparisons. First we introduce a parallely structured algorithm that splits a merging problem of size 2l into 2i subproblems of size 2l-i, for some arbitrary i with (0 ≤ i ≤ l). This algorithm represents a merger for i=l but it is rather inefficient in this case. The efficiency is boosted by moving to a two stage approach where the splitting process stops at some predetermined level and transfers control to several parallely operating block-mergers. We formally prove the asymptotic optimality of the splitting process and show that for symmetrically sized inputs our approach delivers up to 4 times faster runtimes than the thrust::merge function that is part of the Thrust library. For assessing the value of our merging technique in the context of sorting we construct and evaluate a MergeSort on top of it. In the context of our benchmarking the resulting MergeSort clearly outperforms the MergeSort implementation provided by the Thrust library as well as Cederman's GPU optimized variant of QuickSort.

  • Implementation and Optimization of Image Processing Algorithms on Embedded GPU

    Nitin SINGHAL  Jin Woo YOO  Ho Yeol CHOI  In Kyu PARK  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E95-D No:5
      Page(s):
    1475-1484

    In this paper, we analyze the key factors underlying the implementation, evaluation, and optimization of image processing and computer vision algorithms on embedded GPU using OpenGL ES 2.0 shader model. First, we present the characteristics of the embedded GPU and its inherent advantage when compared to embedded CPU. Additionally, we propose techniques to achieve increased performance with optimized shader design. To show the effectiveness of the proposed techniques, we employ cartoon-style non-photorealistic rendering (NPR), speeded-up robust feature (SURF) detection, and stereo matching as our example algorithms. Performance is evaluated in terms of the execution time and speed-up achieved in comparison with the implementation on embedded CPU.

  • Implementation of Scale and Rotation Invariant On-Line Object Tracking Based on CUDA

    Quan MIAO  Guijin WANG  Xinggang LIN  

     
    LETTER-Image Recognition, Computer Vision

      Vol:
    E94-D No:12
      Page(s):
    2549-2552

    Object tracking is a major technique in image processing and computer vision. Tracking speed will directly determine the quality of applications. This paper presents a parallel implementation for a recently proposed scale- and rotation-invariant on-line object tracking system. The algorithm is based on NVIDIA's Graphics Processing Units (GPU) using Compute Unified Device Architecture (CUDA), following the model of single instruction multiple threads. Specifically, we analyze the original algorithm and propose the GPU-based parallel design. Emphasis is placed on exploiting the data parallelism and memory usage. In addition, we apply optimization technique to maximize the utilization of NVIDIA's GPU and reduce the data transfer time. Experimental results show that our GPGPU-based method running on a GTX480 graphics card could achieve up to 12X speed-up compared with the efficiency equivalence on an Intel E8400 3.0 GHz CPU, including I/O time.

  • Design and Implementation of a Real-Time Video-Based Rendering System Using a Network Camera Array

    Yuichi TAGUCHI  Keita TAKAHASHI  Takeshi NAEMURA  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E92-D No:7
      Page(s):
    1442-1452

    We present a real-time video-based rendering system using a network camera array. Our system consists of 64 commodity network cameras that are connected to a single PC through a gigabit Ethernet. To render a high-quality novel view, our system estimates a view-dependent per-pixel depth map in real time by using a layered representation. The rendering algorithm is fully implemented on the GPU, which allows our system to efficiently perform capturing and rendering processes as a pipeline by using the CPU and GPU independently. Using QVGA input video resolution, our system renders a free-viewpoint video at up to 30 frames per second, depending on the output video resolution and the number of depth layers. Experimental results show high-quality images synthesized from various scenes.