The search functionality is under construction.

Keyword Search Result

[Keyword] GPU(89hit)

61-80hit(89hit)

  • Periodic Pattern Coding for Last Level Cache Data Compression

    Haruhiko KANEKO  

     
    PAPER-Data Compression

      Vol:
    E96-A No:12
      Page(s):
    2351-2359

    In spite of continuous improvement of computational power of multi/many-core processors, the memory access performance of the processors has not been improved sufficiently, and thus the overall performance of recent processors is often restricted by the delay of off-chip memory accesses. Low-delay data compression for last level cache (LLC) would be effective to improve the processor performance because the compression increases the effective size of LLC, and thus reduces the number of off-chip memory accesses. This paper proposes a novel data compression method suitable for high-speed parallel decoding in the LLC. Since cache line data often have periodicity of certain lengths, such as 32- or 64-bit instructions, 32-bit integers, and 64-bit floating point numbers, an information word is encoded as a base pattern and a differential pattern between the original word and the base pattern. Evaluation using a GPU simulator shows that the compression ratio of the proposed coding is comparable to LZSS coding and X-Match Pro and superior to other conventional compression algorithms for cache memories. Also this paper presents an experimental decoder designed for ASIC, and the synthesized result shows that the decoder can decompress cache line data of length 32bytes in four clock cycles. Evaluation of the IPC on the GPU simulator shows that, for several benchmark programs, the IPC achieved by the proposed coding is higher than that by the conventional BΔI coding, where the maximum improvement of the IPC is 20%.

  • GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

    Fumihiko INO  Shinta NAKAGAWA  Kenichi HAGIHARA  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2604-2616

    This paper presents a stream programming framework, named GPU-chariot, for accelerating stream applications running on graphics processing units (GPUs). The main contribution of our framework is that it realizes efficient software pipelines on multi-GPU systems by enabling out-of-order execution of CPU functions, kernels, and data transfers. To achieve this out-of-order execution, we apply a runtime scheduler that not only maximizes the utilization of system resources but also encapsulates the number of GPUs available in the system. In addition, we implement a load-balancing capability to flow data efficiently through multiple GPUs. Furthermore, a callback interface enables overlapping execution of functions in third-party libraries. By using kernels with different performance bottlenecks, we show that our out-of-order execution is up to 20% faster than in-order execution. Finally, we conduct several case studies on a 4-GPU system and demonstrate the advantages of GPU-chariot over a manually pipelined code. We conclude that GPU-chariot can be useful when developing stream applications with software pipelines on multiple GPUs and CPUs.

  • Simulating Cardiac Electrophysiology in the Era of GPU-Cluster Computing

    Jun CHAI  Mei WEN  Nan WU  Dafei HUANG  Jing YANG  Xing CAI  Chunyuan ZHANG  Qianming YANG  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2587-2595

    This paper presents a study of the applicability of clusters of GPUs to high-resolution 3D simulations of cardiac electrophysiology. By experimenting with representative cardiac cell models and ODE solvers, in association with solving the monodomain equation, we quantitatively analyze the obtainable computational capacity of GPU clusters. It is found that for a 501×501×101 3D mesh, which entails a 0.1mm spatial resolution, a 128-GPU cluster only needs a few minutes to carry out a 100,000-time-step cardiac excitation simulation that involves a four-variable cell model. Even higher spatial and temporal resolutions are achievable for such simplified mathematical models. On the other hand, our experiments also show that a dramatically larger cluster of GPUs is needed to handle a very detailed cardiac cell model.

  • Auto-Tuning of Thread Assignment for Matrix-Vector Multiplication on GPUs

    Jinwei WANG  Xirong MA  Yuanping ZHU  Jizhou SUN  

     
    PAPER-Fundamentals of Information Systems

      Vol:
    E96-D No:11
      Page(s):
    2319-2326

    Modern GPUs have evolved to become a more general processor capable of executing scientific and engineering computations. It provides a highly parallel computing environment due to its large number of computing cores, which are suitable for numerous data parallel arithmetic computations, particularly linear algebra operations. The matrix-vector multiplication is one of the most important dense linear algebraic operations. It is applied to a diverse set of applications in many fields and must therefore be fully optimized to achieve a high-performance. In this paper, we proposed a novel auto-tuning method for matrix-vector multiplication on GPUs, where the number of assigned threads that are used to compute one element of the result vector can be auto-tuned according to the size of matrix. On the Nvidia's GPU GTX 650 with the most recent Kepler architecture, we developed an auto-tuner that can automatically select the optimal number of assigned threads for calculation. Based on the auto-tuner's result, we developed a versatile generic matrix-vector multiplication kernel with the CUDA programming model. A series of experiments on different shapes and sizes of matrices were conducted for comparing the performance of our kernel with that of the kernels from CUBLAS 5.0, MAGMA 1.3 and a warp method. The experiments results show that the performance of our matrix-vector multiplication kernel is close to the optimal behavior with increasing of the size of the matrix and has very little dependency on the shape of the matrix, which is a significant improvement compared to the other three kernels that exhibit unstable performance behavior for different shapes of matrices.

  • Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs

    Shuai MU  Dongdong LI  Yubei CHEN  Yangdong DENG  Zhihua WANG  

     
    PAPER-Computer System

      Vol:
    E96-D No:10
      Page(s):
    2194-2207

    By exploiting data-level parallelism, Graphics Processing Units (GPUs) have become a high-throughput, general purpose computing platform. Many real-world applications especially those following a stream processing pattern, however, feature interleaved task-pipelined and data parallelism. Current GPUs are ill equipped for such applications due to the insufficient usage of computing resources and/or the excessive off-chip memory traffic. In this paper, we focus on microarchitectural enhancements to enable task-pipelined execution of data-parallel kernels on GPUs. We propose an efficient adaptive dynamic scheduling mechanism and a moderately modified L2 design. With minor hardware overhead, our techniques orchestrate both task-pipeline and data parallelisms in a unified manner. Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.

  • High Throughput Parallelization of AES-CTR Algorithm

    Nhat-Phuong TRAN  Myungho LEE  Sugwon HONG  Seung-Jae LEE  

     
    PAPER-Fundamentals of Information Systems

      Vol:
    E96-D No:8
      Page(s):
    1685-1695

    Data encryption and decryption are common operations in network-based application programs that must offer security. In order to keep pace with the high data input rate of network-based applications such as the multimedia data streaming, real-time processing of the data encryption/decryption is crucial. In this paper, we propose a new parallelization approach to improve the throughput performance for the de-facto standard data encryption and decryption algorithm, AES-CTR (Counter mode of AES). The new approach extends the size of the block encrypted at one time across the unit block boundaries, thus effectively encrypting multiple unit blocks at the same time. This reduces the associated parallelization overheads such as the number of procedure calls, the scheduling and the synchronizations compared with previous approaches. Therefore, this leads to significant throughput performance improvements on a computing platform with a general-purpose multi-core processor and a Graphic Processing Unit (GPU).

  • Fast and Robust 3D Correspondence Matching and Its Application to Volume Registration Open Access

    Yuichiro TAJIMA  Kinya FUDANO  Koichi ITO  Takafumi AOKI  

     
    PAPER-Medical Image Processing

      Vol:
    E96-D No:4
      Page(s):
    826-835

    This paper presents a fast and accurate volume correspondence matching method using 3D Phase-Only Correlation (POC). The proposed method employs (i) a coarse-to-fine strategy using multi-scale volume pyramids for correspondence search and (ii) high-accuracy POC-based local block matching for finding dense volume correspondence with sub-voxel displacement accuracy. This paper also proposes its GPU implementation to achieve fast and practical computation of volume registration. Experimental evaluation shows that the proposed approach exhibits higher accuracy and lower computational cost compared with conventional method. We also demonstrate that the GPU implementation of the proposed method can align two volume data in several seconds, which is suitable for practical use in the image-guided radiation therapy.

  • Parallel Sparse Cholesky Factorization on a Heterogeneous Platform

    Dan ZOU  Yong DOU  Rongchun LI  

     
    LETTER-Algorithms and Data Structures

      Vol:
    E96-A No:4
      Page(s):
    833-834

    We present a new approach for sparse Cholesky factorization on a heterogeneous platform with a graphics processing unit (GPU). The sparse Cholesky factorization is one of the core algorithms of numerous computing applications. We tuned the supernode data structure and used a parallelization method for GPU tasks to increase GPU utilization. Results show that our approach substantially reduces computational time.

  • Implementation of a GPU-Oriented Absorbing Boundary Condition for 3D-FDTD Electromagnetic Simulation

    Keisuke DOHI  Yuichiro SHIBATA  Kiyoshi OGURI  Takafumi FUJIMOTO  

     
    PAPER-Parallel and Distributed Computing

      Vol:
    E95-D No:12
      Page(s):
    2787-2795

    In this paper, we propose and discuss efficient GPU implementation techniques of absorbing boundary conditions (ABCs) for a 3D finite-difference time-domain (FDTD) electromagnetic field simulation for antenna design. In view of architectural nature of GPUs, the idea of a periodic boundary condition is introduced to implementation of perfect matched layers (PMLs) as well as a transformation technique of PML equations for partial boundaries. We also present efficient implementation method of a non-uniform grid. The evaluation results with a typical simulation model reveal that our proposed technique almost double the simulation performance and eventually achieve the 55.8% of the peak memory bandwidth of a target GPU.

  • Blocked United Algorithm for the All-Pairs Shortest Paths Problem on Hybrid CPU-GPU Systems

    Kazuya MATSUMOTO  Naohito NAKASATO  Stanislav G. SEDUKHIN  

     
    PAPER-Parallel and Distributed Computing

      Vol:
    E95-D No:12
      Page(s):
    2759-2768

    This paper presents a blocked united algorithm for the all-pairs shortest paths (APSP) problem. This algorithm simultaneously computes both the shortest-path distance matrix and the shortest-path construction matrix for a graph. It is designed for a high-speed APSP solution on hybrid CPU-GPU systems. In our implementation, two most compute intensive parts of the algorithm are performed on the GPU. The first part is to solve the APSP sub-problem for a block of sub-matrices, and the other part is a matrix-matrix “multiplication” for the APSP problem. Moreover, the amount of data communication between CPU (host) memory and GPU memory is reduced by reusing blocks once sent to the GPU. When a problem size (the number of vertices in a graph) is large enough compared to a block size, our implementation of the blocked algorithm requires CPU GPU exchanging of three blocks during a block computation on the GPU. We measured the performance of the algorithm implementation on two different CPU-GPU systems. A system containing an Intel Sandy Bridge CPU (Core i7 2600K) and an AMD Cayman GPU (Radeon HD 6970) achieves the performance up to 1.1 TFlop/s in a single precision.

  • Lossless Compression of Double-Precision Floating-Point Data for Numerical Simulations: Highly Parallelizable Algorithms for GPU Computing

    Mamoru OHARA  Takashi YAMAGUCHI  

     
    PAPER-Parallel and Distributed Computing

      Vol:
    E95-D No:12
      Page(s):
    2778-2786

    In numerical simulations using massively parallel computers like GPGPU (General-Purpose computing on Graphics Processing Units), we often need to transfer computational results from external devices such as GPUs to the main memory or secondary storage of the host machine. Since size of the computation results is sometimes unacceptably large to hold them, it is desired that the data is compressed and stored. In addition, considering overheads for transferring data between the devices and host memories, it is preferable that the data is compressed in a part of parallel computation performed on the devices. Traditional compression methods for floating-point numbers do not always show good parallelism. In this paper, we propose a new compression method for massively-parallel simulations running on GPUs, in which we combine a few successive floating-point numbers and interleave them to improve compression efficiency. We also present numerical examples of compression ratio and throughput obtained from experimental implementations of the proposed method runnig on CPUs and GPUs.

  • Asymptotically Optimal Merging on ManyCore GPUs

    Arne KUTZNER  Pok-Son KIM  Won-Kwang PARK  

     
    PAPER-Parallel and Distributed Computing

      Vol:
    E95-D No:12
      Page(s):
    2769-2777

    We propose a family of algorithms for efficiently merging on contemporary GPUs, so that each algorithm requires O(m log (+1)) element comparisons, where m and n are the sizes of the input sequences with m ≤ n. According to the lower bounds for merging all proposed algorithms are asymptotically optimal regarding the number of necessary comparisons. First we introduce a parallely structured algorithm that splits a merging problem of size 2l into 2i subproblems of size 2l-i, for some arbitrary i with (0 ≤ i ≤ l). This algorithm represents a merger for i=l but it is rather inefficient in this case. The efficiency is boosted by moving to a two stage approach where the splitting process stops at some predetermined level and transfers control to several parallely operating block-mergers. We formally prove the asymptotic optimality of the splitting process and show that for symmetrically sized inputs our approach delivers up to 4 times faster runtimes than the thrust::merge function that is part of the Thrust library. For assessing the value of our merging technique in the context of sorting we construct and evaluate a MergeSort on top of it. In the context of our benchmarking the resulting MergeSort clearly outperforms the MergeSort implementation provided by the Thrust library as well as Cederman's GPU optimized variant of QuickSort.

  • Parallel Dynamic Cloud Rendering Method Based on Physical Cellular Automata Model

    Liqiang ZHANG  Chao LI  Haoliang SUN  Changwen ZHENG  Pin LV  

     
    PAPER-Parallel and Distributed Computing

      Vol:
    E95-D No:12
      Page(s):
    2750-2758

    Due to the complicated composition of cloud and its disordered transformation, the rendering of cloud does not perfectly meet actual prospect by current methods. Based on physical characteristics of cloud, a physical cellular automata model of Dynamic cloud is designed according to intrinsic factor of cloud, which describes the rules of hydro-movement, deposition and accumulation and diffusion. Then a parallel computing architecture is designed to compute the large-scale data set required by the rendering of dynamical cloud, and a GPU-based ray-casting algorithm is implemented to render the cloud volume data. The experiment shows that cloud rendering method based on physical cellular automata model is very efficient and able to adequately exhibit the detail of cloud.

  • Implementation and Optimization of Image Processing Algorithms on Embedded GPU

    Nitin SINGHAL  Jin Woo YOO  Ho Yeol CHOI  In Kyu PARK  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E95-D No:5
      Page(s):
    1475-1484

    In this paper, we analyze the key factors underlying the implementation, evaluation, and optimization of image processing and computer vision algorithms on embedded GPU using OpenGL ES 2.0 shader model. First, we present the characteristics of the embedded GPU and its inherent advantage when compared to embedded CPU. Additionally, we propose techniques to achieve increased performance with optimized shader design. To show the effectiveness of the proposed techniques, we employ cartoon-style non-photorealistic rendering (NPR), speeded-up robust feature (SURF) detection, and stereo matching as our example algorithms. Performance is evaluated in terms of the execution time and speed-up achieved in comparison with the implementation on embedded CPU.

  • A Parallel Implementation of the Gustafson-Kessel Clustering Algorithm with CUDA

    Jeong Bong SEO  Dae-Won KIM  

     
    LETTER-Artificial Intelligence, Data Mining

      Vol:
    E95-D No:4
      Page(s):
    1162-1165

    Despite the benefits of the Gustafson-Kessel (GK) clustering algorithm, it becomes computationally inefficient when applied to high-dimensional data. In this letter, a parallel implementation of the GK algorithm on the GPU with CUDA is proposed. Using an optimized matrix multiplication algorithm with fast access to shared memory, the CUDA version achieved a maximum 240-fold speedup over the single-CPU version.

  • Implementation of Scale and Rotation Invariant On-Line Object Tracking Based on CUDA

    Quan MIAO  Guijin WANG  Xinggang LIN  

     
    LETTER-Image Recognition, Computer Vision

      Vol:
    E94-D No:12
      Page(s):
    2549-2552

    Object tracking is a major technique in image processing and computer vision. Tracking speed will directly determine the quality of applications. This paper presents a parallel implementation for a recently proposed scale- and rotation-invariant on-line object tracking system. The algorithm is based on NVIDIA's Graphics Processing Units (GPU) using Compute Unified Device Architecture (CUDA), following the model of single instruction multiple threads. Specifically, we analyze the original algorithm and propose the GPU-based parallel design. Emphasis is placed on exploiting the data parallelism and memory usage. In addition, we apply optimization technique to maximize the utilization of NVIDIA's GPU and reduce the data transfer time. Experimental results show that our GPGPU-based method running on a GTX480 graphics card could achieve up to 12X speed-up compared with the efficiency equivalence on an Intel E8400 3.0 GHz CPU, including I/O time.

  • Computation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster

    Junichi OHMURA  Takefumi MIYOSHI  Hidetsugu IRIE  Tsutomu YOSHINAGA  

     
    PAPER

      Vol:
    E94-D No:12
      Page(s):
    2319-2327

    In this paper, we propose an approach to obtaining enhanced performance of the Linpack benchmark on a GPU-accelerated PC cluster connected via relatively slow inter-node connections. For one node with a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060 GPU card, we implement a CPU–GPU parallel double-precision general matrix–matrix multiplication (dgemm) operation, and achieve a performance improvement of 34% compared with the GPU-only case and 64% compared with the CPU-only case. For an entire 16-node cluster, each node of which is the same as the above and is connected with two gigabit Ethernet links, we use a computation-communication overlap scheme with GPU acceleration for the Linpack benchmark, and achieve a performance improvement of 28% compared with the GPU-accelerated high-performance Linpack benchmark (HPL) without overlapping. Our overlap GPU acceleration solution uses overlaps in which the main inter-node communication and data transfer to the GPU device memory are overlapped with the main computation task on the CPU cores. These overlaps use multi-core processors, which almost all of today's high-performance computers use. In particular, as well as using a CPU core for communication tasks, we also simultaneously use other CPU cores and the GPU for computation tasks. In order to enable overlap between inter-node communication and computation tasks, we eliminate their close dependence by breaking the main computation task into smaller tasks and rescheduling. Based on a scheme in which part of the CPU computation power is simultaneously used for tasks other than computation tasks, we experimentally find the optimal computation ratio for CPUs; this ratio differs from the case of parallel dgemm operation of one node.

  • Evaluation of GPU-Based Empirical Mode Decomposition for Off-Line Analysis

    Pulung WASKITO  Shinobu MIWA  Yasue MITSUKURA  Hironori NAKAJO  

     
    PAPER

      Vol:
    E94-D No:12
      Page(s):
    2328-2337

    In off-line analysis, the demand for high precision signal processing has introduced a new method called Empirical Mode Decomposition (EMD), which is used for analyzing a complex set of data. Unfortunately, EMD is highly compute-intensive. In this paper, we show parallel implementation of Empirical Mode Decomposition on a GPU. We propose the use of “partial+total” switching method to increase performance while keeping the precision. We also focused on reducing the computation complexity in the above method from O(N) on a single CPU to O(N/P log (N)) on a GPU. Evaluation results show our single GPU implementation using Tesla C2050 (Fermi architecture) achieves a 29.9x speedup partially, and a 11.8x speedup totally when compared to a single Intel dual core CPU.

  • Parallel Implementation Strategy for CoHOG-Based Pedestrian Detection Using a Multi-Core Processor

    Ryusuke MIYAMOTO  Hiroki SUGANO  

     
    PAPER-Image Processing

      Vol:
    E94-A No:11
      Page(s):
    2315-2322

    Pedestrian detection from visual images, which is used for driver assistance or video surveillance, is a recent challenging problem. Co-occurrence histograms of oriented gradients (CoHOG) is a powerful feature descriptor for pedestrian detection and achieves the highest detection accuracy. However, its calculation cost is too large to calculate it in real-time on state-of-the-art processors. In this paper, to obtain optimal parallel implementation for an NVIDIA GPU, several kinds of parallelism of CoHOG-based detection are shown and evaluated suitability for implementation. The experimental result shows that the detection process can be performed at 16.5 fps in QVGA images on NVIDIA Tesla C1060 by optimized parallel implementation. By our evaluation, it is shown that the optimal strategy of parallel implementation for an NVIDIA GPU is different from that of FPGA. We discuss about the reason and show the advantages of each device. To show the scalability and portability of GPU implementation, the same object code is executed on other NVIDA GPUs. The experimental result shows that GTX570 can perform the CoHOG-based pedestiran detection 21.3 fps in QVGA images.

  • Optimized Implementation of Pedestrian Tracking Using Multiple Cues on GPU

    Ryusuke MIYAMOTO  Hiroki SUGANO  

     
    PAPER-Image Processing

      Vol:
    E94-A No:11
      Page(s):
    2323-2333

    Nowadays, pedestrian recognition for automotive and security applications that require accurate recognition in images taken from distant observation points is a recent challenging problem in the field of computer vision. To achieve accurate recognition, both detection and tracking must be precise. For detection, some excellent schemes suitable for pedestrian recognition from distant observation points are proposed, however, no tracking schemes can achieve sufficient performance. To construct an accurate tracking scheme suitable for pedestrian recognition from distant observation points, we propose a novel pedestrian tracking scheme using multiple cues: HSV histograms and HOG features. Experimental results show that the proposed scheme can properly track a target pedestrian where tracking schemes using only a single cue fails. Moreover, we implement the proposed scheme on NVIDIA® TeslaTM C1060 processor, one of the latest GPU, to achieve real-time processing of the proposed scheme. Experimental results show that computation time required for tracking of a frame by our implementation is reduced to 8.80 ms even though Intel® CoreTM i7 CPU 975 @ 3.33 GHz spends 111 ms.

61-80hit(89hit)