Masayuki KUROSAKI Masateru MATSUO Yoshimitsu KUROKI Yuhei NAGAO Baiko SAI Hiroshi OCHI
In this paper, we propose a CUDA implementation of DWT for JPEG 2000 codec. We show that the performance of JPEG 2000 codec implemented by CUDA is better than CPU based implementation. The performance of the DWT implemented by CUDA is achieved 27.7 frame/second in 4K digital cinema.
Amedeo CAPOZZOLI Claudio CURCIO Antonio DI VICO Angelo LISENO
We develop an effective algorithm, based on the filtered backprojection (FBP) approach, for the imaging of vegetation. Under the FBP scheme, the reconstruction amounts at a non-trivial Fourier inversion, since the data are Fourier samples arranged on a non-Cartesian grid. The computational issue is efficiently tackled by Non-Uniform Fast Fourier Transforms (NUFFTs), whose complexity grows asymptotically as that of a standard FFT. Furthermore, significant speed-ups, as compared to fast CPU implementations, are obtained by a parallel versions of the NUFFT algorithm, purposely designed to be run on Graphic Processing Units (GPUs) by using the CUDA language. The performance of the parallel algorithm has been assessed in comparison to a CPU-multicore accelerated, Matlab implementation of the same routine, to other CPU-multicore accelerated implementations based on standard FFT and employing linear, cubic, spline and sinc interpolations and to a different, parallel algorithm exploiting a parallel linear interpolation stage. The proposed approach has resulted the most computationally convenient. Furthermore, an indoor, polarimetric experimental setup is developed, capable to isolate and introduce, one at a time, different non-idealities of a real acquisition, as the sources (wind, rain) of temporal decorrelation. Experimental far-field polarimetric measurements on a thuja plicata (western redcedar) tree point out the performance of the set up algorithm, its robustness against data truncation and temporal decorrelation as well as the possibility of discriminating scatterers with different features within the investigated scene.
This research aims to accelerate the computation module in max-plus algebra using CUDA technology on graphics processing units (GPUs) designed for high-performance computing. Our target is the Kleene star of a weighted adjacency matrix for directed acyclic graphs (DAGs). Using a inexpensive GPU card for our experiments, we obtained more than a 16-fold speedup compared with an Athlon 64 X2.
Sung Jae LEE Seog Chung SEO Dong-Guk HAN Seokhie HONG Sangjin LEE
This paper proposes methods for accelerating DPA by using the CPU and the GPU in a parallel manner. The overhead of naive DPA evaluation software increases excessively as the number of points in a trace or the number of traces is enlarged due to the rapid increase of file I/O overhead. This paper presents some techniques, with respect to DPA-arithmetic and file handling, which can make the overhead of DPA software become not extreme but gradual as the increase of the amount of trace data to be processed. Through generic experiments, we show that the software, equipped with the proposed methods, using both CPU and GPU can shorten the time for evaluating the DPA resistance of devices by almost half.
Yuma MUNEKAWA Fumihiko INO Kenichi HAGIHARA
This paper presents a fast method capable of accelerating the Smith-Waterman algorithm for biological database search on a cluster of graphics processing units (GPUs). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. As compared with previous methods, our method has four major contributions. (1) The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip video memory and processing elements in the GPU. (2) It also reduces the number of data fetches by applying a data reuse technique to query and database sequences. (3) A pipelined method is also implemented to overlap GPU execution with database access. (4) Finally, a master/worker paradigm is employed to accelerate hundreds of database searches on a cluster system. In experiments, the peak performance on a GeForce GTX 280 card reaches 8.32 giga cell updates per second (GCUPS). We also find that our method reduces the amount of data fetches to 1/140, achieving approximately three times higher performance than a previous CUDA-based method. Our 32-node cluster version is approximately 28 times faster than a single GPU version. Furthermore, the effective performance reaches 75.6 giga instructions per second (GIPS) using 32 GeForce 8800 GTX cards.