IEICE global.ieice.org Site

Keyword Search Result

[Keyword] GPGPU(17hit)

1-17hit

GPGPU Implementation of Variational Bayesian Gaussian Mixture Models
Hiroki NISHIMOTO Renyuan ZHANG Yasuhiko NAKASHIMA

PAPER-Fundamentals of Information Systems

Pubricized:
2021/11/24
Vol:
E105-D No:3
Page(s):
611-622
The efficient implementation strategy for speeding up high-quality clustering algorithms is developed on the basis of general purpose graphic processing units (GPGPUs) in this work. Among various clustering algorithms, a sophisticated Gaussian mixture model (GMM) by estimating parameters through variational Bayesian (VB) mechanism is conducted due to its superior performances. Since the VB-GMM methodology is computation-hungry, the GPGPU is employed to carry out massive matrix-computations. To efficiently migrate the conventional CPU-oriented schemes of VB-GMM onto GPGPU platforms, an entire migration-flow with thirteen stages is presented in detail. The CPU-GPGPU co-operation scheme, execution re-order, and memory access optimization are proposed for optimizing the GPGPU utilization and maximizing the clustering speed. Five types of real-world applications along with relevant data-sets are introduced for the cross-validation. From the experimental results, the feasibility of implementing VB-GMM algorithm by GPGPU is verified with practical benefits. The proposed GPGPU migration achieves 192x speedup in maximum. Furthermore, it succeeded in identifying the proper number of clusters, which is hardly conducted by the EM-algotihm.
Instruction Prefetch for Improving GPGPU Performance
Jianli CAO Zhikui CHEN Yuxin WANG He GUO Pengcheng WANG

PAPER-VLSI Design Technology and CAD

Pubricized:
2020/11/16
Vol:
E104-A No:5
Page(s):
773-785
Like many processors, GPGPU suffers from memory wall. The traditional solution for this issue is to use efficient schedulers to hide long memory access latency or use data prefetch mech-anism to reduce the latency caused by data transfer. In this paper, we study the instruction fetch stage of GPU's pipeline and analyze the relationship between the capacity of GPU kernel and instruction miss rate. We improve the next line prefetch mechanism to fit the SIMT model of GPU and determine the optimal parameters of prefetch mechanism on GPU through experiments. The experimental result shows that the prefetch mechanism can achieve 12.17% performance improvement on average. Compared with the solution of enlarging I-Cache, prefetch mechanism has the advantages of more beneficiaries and lower cost.
A Rabin-Karp Implementation for Handling Multiple Pattern-Matching on the GPU
Lucas Saad Nogueira NUNES Jacir Luiz BORDIM Yasuaki ITO Koji NAKANO

PAPER-Fundamentals of Information Systems

Pubricized:
2020/09/24
Vol:
E103-D No:12
Page(s):
2412-2420
The volume of digital information is growing at an extremely fast pace which, in turn, exacerbates the need of efficient mechanisms to find the presence of a pattern in an input text or a set of input strings. Combining the processing power of Graphics Processing Unit (GPU) with matching algorithms seems a natural alternative to speedup the string-matching process. This work proposes a Parallel Rabin-Karp implementation (PRK) that encompasses a fast-parallel prefix-sums algorithm to maximize parallelization and accelerate the matching verification. Given an input text T of length n and p patterns of length m, the proposed implementation finds all occurrences of p in T in O(m+q+n/τ+nm/q) time, where q is a sufficiently large prime number and τ is the available number of threads. Sequential and parallel versions of the PRK have been implemented. Experiments have been executed on p≥1 patterns of length m comprising of m=10, 20, 30 characters which are compared against a text string of length n=227. The results show that the parallel implementation of the PRK algorithm on NVIDIA V100 GPU provides speedup surpassing 372 times when compared to the sequential implementation and speedup of 12.59 times against an OpenMP implementation running on a multi-core server with 128 threads. Compared to another prominent GPU implementation, the PRK implementation attained speedup surpassing 37 times.
Accelerating Large-Scale Interconnection Network Simulation by Cellular Automata Concept
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA

PAPER-Computer System

Pubricized:
2018/10/05
Vol:
E102-D No:1
Page(s):
52-74
State-of-the-art parallel systems employ a huge number of computing nodes that are connected by an interconnection network. An interconnection network (ICN) plays an important role in a parallel system, since it is responsible to communication capability. In general, an ICN shows non-linear phenomena in its communication performance, most of them are caused by congestion. Thus, designing a large-scale parallel system requires sufficient discussions through repetitive simulation runs. This causes another problem in simulating large-scale systems within a reasonable cost. This paper shows a promising solution by introducing the cellular automata concept, which is originated in our prior work. Assuming 2D-torus topologies for simplification of discussion, this paper discusses fundamental design of router functions in terms of cellular automata, data structure of packets, alternative modeling of a router function, and miscellaneous optimization. The proposed models have a good affinity to GPGPU technology and, as representative speed-up results, the GPU-based simulator accelerates simulation upto about 1264 times from sequential execution on a single CPU. Furthermore, since the proposed models are applicable in the shared memory model, multithread implementation of the proposed methods achieve about 162 times speed-ups at the maximum.
Cooperative GPGPU Scheduling for Consolidating Server Workloads
Yusuke SUZUKI Hiroshi YAMADA Shinpei KATO Kenji KONO

PAPER-Software System

Pubricized:
2018/08/30
Vol:
E101-D No:12
Page(s):
3019-3037
Graphics processing units (GPUs) have become an attractive platform for general-purpose computing (GPGPU) in various domains. Making GPUs a time-multiplexing resource is a key to consolidating GPGPU applications (apps) in multi-tenant cloud platforms. However, advanced GPGPU apps pose a new challenge for consolidation. Such highly functional GPGPU apps, referred to as GPU eaters, can easily monopolize a shared GPU and starve collocated GPGPU apps. This paper presents GLoop, which is a software runtime that enables us to consolidate GPGPU apps including GPU eaters. GLoop offers an event-driven programming model, which allows GLoop-based apps to inherit the GPU eaters' high functionality while proportionally scheduling them on a shared GPU in an isolated manner. We implemented a prototype of GLoop and ported eight GPU eaters on it. The experimental results demonstrate that our prototype successfully schedules the consolidated GPGPU apps on the basis of its scheduling policy and isolates resources among them.
GPU-Accelerated Bulk Execution of Multiple-Length Multiplication with Warp-Synchronous Programming Technique
Takumi HONDA Yasuaki ITO Koji NAKANO

PAPER-GPU computing

Pubricized:
2016/08/24
Vol:
E99-D No:12
Page(s):
3004-3012
In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.
Accuracy Assessment of FDTD Method for the Analysis of Sub-Wavelength Photonic Structures
Yasuo OHTERA

PAPER

Vol:
E99-C No:7
Page(s):
780-787
FDTD (Finite-Difference Time-Domain) method has been widely used for the analysis of photonic devices consisting of sub-wavelength structures. In recent years, increasing efforts have been made to implement the FDTD on GPGPUs (General-Purpose Graphic Processing Units), to shorten simulation time. On the other hand, it is widely recognized that most of the middle- and low-end GPGPUs have difference of computational performance, between single-precision and double-precision type arithmetics. Therefore the type selection of single/double precision for electromagnetic field variables in FDTD becomes a key issue from the viewpoint of the total simulation performance. In this study we investigated the difference of results between the use of single-precision and double-precision. As a most fundamental sub-wavelength photonic structure, we focused on an alternating multilayer (one dimensional periodic structure). Obtained results indicate that significant difference appears for the amplitudes of higher order spatial harmonic waves.
3D Objects Tracking by MapReduce GPGPU-Enhanced Particle Filter
Jieyun ZHOU Xiaofeng LI Haitao CHEN Rutong CHEN Masayuki NUMAO

PAPER

Pubricized:
2015/01/21
Vol:
E98-D No:5
Page(s):
1035-1044
Objects tracking methods have been wildly used in the field of video surveillance, motion monitoring, robotics and so on. Particle filter is one of the promising methods, but it is difficult to apply to real-time objects tracking because of its high computation cost. In order to reduce the processing cost without sacrificing the tracking quality, this paper proposes a new method for real-time 3D objects tracking, using parallelized particle filter algorithms by MapReduce architecture which is running on GPGPU. Our methods are as follows. First, we use a Kinect to get the 3D information of objects. Unlike the conventional 2D-based objects tracking, 3D objects tracking adds depth information. It can track not only from the x and y axis but also from the z axis, and the depth information can correct some errors in 2D objects tracking. Second, to solve the high computation cost problem, we use the MapReduce architecture on GPGPU to parallelize the particle filter algorithm. We implement the particle filter algorithms on GPU and evaluate the performance by actually running a program on CUDA5.5.
A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation
Yasuaki ITO Koji NAKANO

PAPER

Vol:
E96-D No:12
Page(s):
2596-2603
This paper presents a GPU (Graphics Processing Units) implementation of dynamic programming for the optimal polygon triangulation. Recently, GPUs can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture) provided by NVIDIA. The optimal polygon triangulation problem for a convex polygon is an optimization problem to find a triangulation with minimum total weight. It is known that this problem for a convex n-gon can be solved using the dynamic programming technique in O(n3) time using a work space of size O(n2). In this paper, we propose an efficient parallel implementation of this O(n3)-time algorithm on the GPU. In our implementation, we have used two new ideas to accelerate the dynamic programming. The first idea (adaptive granularity) is to partition the dynamic programming algorithm into many sequential kernel calls of CUDA, and to select the best parameters for the size and the number of blocks for each kernel call. The second idea (sliding and mirroring arrangements) is to arrange the working data for coalesced access of the global memory in the GPU to minimize the memory access overhead. Our implementation using these two ideas solves the optimal polygon triangulation problem for a convex 8192-gon in 5.57 seconds on the NVIDIA GeForce GTX 680, while a conventional CPU implementation runs in 1939.02 seconds. Thus, our GPU implementation attains a speedup factor of 348.02.
HiCrypt: A Specialized Translator for Symmetric Block Cipher and GPGPU
Keisuke IWAI Naoki NISHIKAWA Takakazu KUROKAWA

PAPER

Vol:
E96-D No:12
Page(s):
2575-2586
Many-core computer systems with GPUs are coming into mainstream use from high-end computing, including supercomputers, to embedded processors. Consequently, the implementation of cryptographic methods on GPGPU is also becoming popular because of such systems' performance. However, many factors affect the performance of GPUs. To cope with this problem, we developed a new translator, HiCrypt, which can generate an optimized GPGPU program written in both of CUDA and OpenCL from a cipher program written in standard C language with directives. Users must annotate only variables and an encoding/decoding function, which are characteristics of cipher programs, with directives. To evaluate the translator, five representative cipher programs are translated into CUDA and OpenCL programs by the translator. Generated programs perform high throughput almost identical to hand optimized programs for all five cipher programs. HiCrypt will contribute to development and evaluate of new and various symmetric block ciphers using GPGPU.
Periodic Pattern Coding for Last Level Cache Data Compression
Haruhiko KANEKO

PAPER-Data Compression

Vol:
E96-A No:12
Page(s):
2351-2359
In spite of continuous improvement of computational power of multi/many-core processors, the memory access performance of the processors has not been improved sufficiently, and thus the overall performance of recent processors is often restricted by the delay of off-chip memory accesses. Low-delay data compression for last level cache (LLC) would be effective to improve the processor performance because the compression increases the effective size of LLC, and thus reduces the number of off-chip memory accesses. This paper proposes a novel data compression method suitable for high-speed parallel decoding in the LLC. Since cache line data often have periodicity of certain lengths, such as 32- or 64-bit instructions, 32-bit integers, and 64-bit floating point numbers, an information word is encoded as a base pattern and a differential pattern between the original word and the base pattern. Evaluation using a GPU simulator shows that the compression ratio of the proposed coding is comparable to LZSS coding and X-Match Pro and superior to other conventional compression algorithms for cache memories. Also this paper presents an experimental decoder designed for ASIC, and the synthesized result shows that the decoder can decompress cache line data of length 32bytes in four clock cycles. Evaluation of the IPC on the GPU simulator shows that, for several benchmark programs, the IPC achieved by the proposed coding is higher than that by the conventional BΔI coding, where the maximum improvement of the IPC is 20%.
GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems
Fumihiko INO Shinta NAKAGAWA Kenichi HAGIHARA

PAPER

Vol:
E96-D No:12
Page(s):
2604-2616
This paper presents a stream programming framework, named GPU-chariot, for accelerating stream applications running on graphics processing units (GPUs). The main contribution of our framework is that it realizes efficient software pipelines on multi-GPU systems by enabling out-of-order execution of CPU functions, kernels, and data transfers. To achieve this out-of-order execution, we apply a runtime scheduler that not only maximizes the utilization of system resources but also encapsulates the number of GPUs available in the system. In addition, we implement a load-balancing capability to flow data efficiently through multiple GPUs. Furthermore, a callback interface enables overlapping execution of functions in third-party libraries. By using kernels with different performance bottlenecks, we show that our out-of-order execution is up to 20% faster than in-order execution. Finally, we conduct several case studies on a 4-GPU system and demonstrate the advantages of GPU-chariot over a manually pipelined code. We conclude that GPU-chariot can be useful when developing stream applications with software pipelines on multiple GPUs and CPUs.
Lossless Compression of Double-Precision Floating-Point Data for Numerical Simulations: Highly Parallelizable Algorithms for GPU Computing
Mamoru OHARA Takashi YAMAGUCHI

PAPER-Parallel and Distributed Computing

Vol:
E95-D No:12
Page(s):
2778-2786
In numerical simulations using massively parallel computers like GPGPU (General-Purpose computing on Graphics Processing Units), we often need to transfer computational results from external devices such as GPUs to the main memory or secondary storage of the host machine. Since size of the computation results is sometimes unacceptably large to hold them, it is desired that the data is compressed and stored. In addition, considering overheads for transferring data between the devices and host memories, it is preferable that the data is compressed in a part of parallel computation performed on the devices. Traditional compression methods for floating-point numbers do not always show good parallelism. In this paper, we propose a new compression method for massively-parallel simulations running on GPUs, in which we combine a few successive floating-point numbers and interleave them to improve compression efficiency. We also present numerical examples of compression ratio and throughput obtained from experimental implementations of the proposed method runnig on CPUs and GPUs.
Asymptotically Optimal Merging on ManyCore GPUs
Arne KUTZNER Pok-Son KIM Won-Kwang PARK

PAPER-Parallel and Distributed Computing

Vol:
E95-D No:12
Page(s):
2769-2777
We propose a family of algorithms for efficiently merging on contemporary GPUs, so that each algorithm requires O(m log (+1)) element comparisons, where m and n are the sizes of the input sequences with m ≤ n. According to the lower bounds for merging all proposed algorithms are asymptotically optimal regarding the number of necessary comparisons. First we introduce a parallely structured algorithm that splits a merging problem of size 2l into 2i subproblems of size 2l-i, for some arbitrary i with (0 ≤ i ≤ l). This algorithm represents a merger for i=l but it is rather inefficient in this case. The efficiency is boosted by moving to a two stage approach where the splitting process stops at some predetermined level and transfers control to several parallely operating block-mergers. We formally prove the asymptotic optimality of the splitting process and show that for symmetrically sized inputs our approach delivers up to 4 times faster runtimes than the thrust::merge function that is part of the Thrust library. For assessing the value of our merging technique in the context of sorting we construct and evaluate a MergeSort on top of it. In the context of our benchmarking the resulting MergeSort clearly outperforms the MergeSort implementation provided by the Thrust library as well as Cederman's GPU optimized variant of QuickSort.
Implementation and Optimization of Image Processing Algorithms on Embedded GPU
Nitin SINGHAL Jin Woo YOO Ho Yeol CHOI In Kyu PARK

PAPER-Image Processing and Video Processing

Vol:
E95-D No:5
Page(s):
1475-1484
In this paper, we analyze the key factors underlying the implementation, evaluation, and optimization of image processing and computer vision algorithms on embedded GPU using OpenGL ES 2.0 shader model. First, we present the characteristics of the embedded GPU and its inherent advantage when compared to embedded CPU. Additionally, we propose techniques to achieve increased performance with optimized shader design. To show the effectiveness of the proposed techniques, we employ cartoon-style non-photorealistic rendering (NPR), speeded-up robust feature (SURF) detection, and stereo matching as our example algorithms. Performance is evaluated in terms of the execution time and speed-up achieved in comparison with the implementation on embedded CPU.
Implementation of Scale and Rotation Invariant On-Line Object Tracking Based on CUDA
Quan MIAO Guijin WANG Xinggang LIN

LETTER-Image Recognition, Computer Vision

Vol:
E94-D No:12
Page(s):
2549-2552
Object tracking is a major technique in image processing and computer vision. Tracking speed will directly determine the quality of applications. This paper presents a parallel implementation for a recently proposed scale- and rotation-invariant on-line object tracking system. The algorithm is based on NVIDIA's Graphics Processing Units (GPU) using Compute Unified Device Architecture (CUDA), following the model of single instruction multiple threads. Specifically, we analyze the original algorithm and propose the GPU-based parallel design. Emphasis is placed on exploiting the data parallelism and memory usage. In addition, we apply optimization technique to maximize the utilization of NVIDIA's GPU and reduce the data transfer time. Experimental results show that our GPGPU-based method running on a GTX480 graphics card could achieve up to 12X speed-up compared with the efficiency equivalence on an Intel E8400 3.0 GHz CPU, including I/O time.
Design and Implementation of a Real-Time Video-Based Rendering System Using a Network Camera Array
Yuichi TAGUCHI Keita TAKAHASHI Takeshi NAEMURA

PAPER-Image Processing and Video Processing

Vol:
E92-D No:7
Page(s):
1442-1452
We present a real-time video-based rendering system using a network camera array. Our system consists of 64 commodity network cameras that are connected to a single PC through a gigabit Ethernet. To render a high-quality novel view, our system estimates a view-dependent per-pixel depth map in real time by using a layered representation. The rendering algorithm is fully implemented on the GPU, which allows our system to efficiently perform capturing and rendering processes as a pipeline by using the CPU and GPU independently. Using QVGA input video resolution, our system renders a free-viewpoint video at up to 30 frames per second, depending on the output video resolution and the number of depth layers. Experimental results show high-quality images synthesized from various scenes.

Keyword Search Result

[Keyword] GPGPU(17hit)

GPGPU Implementation of Variational Bayesian Gaussian Mixture Models

Instruction Prefetch for Improving GPGPU Performance

A Rabin-Karp Implementation for Handling Multiple Pattern-Matching on the GPU

Accelerating Large-Scale Interconnection Network Simulation by Cellular Automata Concept

Cooperative GPGPU Scheduling for Consolidating Server Workloads

GPU-Accelerated Bulk Execution of Multiple-Length Multiplication with Warp-Synchronous Programming Technique

Accuracy Assessment of FDTD Method for the Analysis of Sub-Wavelength Photonic Structures

3D Objects Tracking by MapReduce GPGPU-Enhanced Particle Filter

A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

HiCrypt: A Specialized Translator for Symmetric Block Cipher and GPGPU

Periodic Pattern Coding for Last Level Cache Data Compression

GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

Lossless Compression of Double-Precision Floating-Point Data for Numerical Simulations: Highly Parallelizable Algorithms for GPU Computing

Asymptotically Optimal Merging on ManyCore GPUs

Implementation and Optimization of Image Processing Algorithms on Embedded GPU

Implementation of Scale and Rotation Invariant On-Line Object Tracking Based on CUDA

Design and Implementation of a Real-Time Video-Based Rendering System Using a Network Camera Array

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles