Dongwann KANG Sang-Hyun SEO Seung-Taek RYOO Kyung-Hyun YOON
Main bottleneck of photomosaic algorithm is a search for a best matched image. Unlike several techniques which use fast approximation search for increasing the speed, we propose a parallel framework for fast photomosaic using a programmable GPU. This paper suggests a design of vertex structure for a best match searching on each cell of photomosaic grid and shows a texture representation of image database. The shader programs which are used for searching a best match and rendering image tiles into a display are presented. In addition, a simple duplicate reduction and color correction methods are proposed. Our algorithm not only offers dramatic enhancement of speed, but also always guarantees the 'exact' result.
Amedeo CAPOZZOLI Claudio CURCIO Antonio DI VICO Angelo LISENO
We develop an effective algorithm, based on the filtered backprojection (FBP) approach, for the imaging of vegetation. Under the FBP scheme, the reconstruction amounts at a non-trivial Fourier inversion, since the data are Fourier samples arranged on a non-Cartesian grid. The computational issue is efficiently tackled by Non-Uniform Fast Fourier Transforms (NUFFTs), whose complexity grows asymptotically as that of a standard FFT. Furthermore, significant speed-ups, as compared to fast CPU implementations, are obtained by a parallel versions of the NUFFT algorithm, purposely designed to be run on Graphic Processing Units (GPUs) by using the CUDA language. The performance of the parallel algorithm has been assessed in comparison to a CPU-multicore accelerated, Matlab implementation of the same routine, to other CPU-multicore accelerated implementations based on standard FFT and employing linear, cubic, spline and sinc interpolations and to a different, parallel algorithm exploiting a parallel linear interpolation stage. The proposed approach has resulted the most computationally convenient. Furthermore, an indoor, polarimetric experimental setup is developed, capable to isolate and introduce, one at a time, different non-idealities of a real acquisition, as the sources (wind, rain) of temporal decorrelation. Experimental far-field polarimetric measurements on a thuja plicata (western redcedar) tree point out the performance of the set up algorithm, its robustness against data truncation and temporal decorrelation as well as the possibility of discriminating scatterers with different features within the investigated scene.
Ayaka YAMAMOTO Yoshio IWAI Hiroshi ISHIGURO
Background subtraction is widely used in detecting moving objects; however, changing illumination conditions, color similarity, and real-time performance remain important problems. In this paper, we introduce a sequential method for adaptively estimating background components using Kalman filters, and a novel method for detecting objects using margined sign correlation (MSC). By applying MSC to our adaptive background model, the proposed system can perform object detection robustly and accurately. The proposed method is suitable for implementation on a graphics processing unit (GPU) and as such, the system realizes real-time performance efficiently. Experimental results demonstrate the performance of the proposed system.
This research aims to accelerate the computation module in max-plus algebra using CUDA technology on graphics processing units (GPUs) designed for high-performance computing. Our target is the Kleene star of a weighted adjacency matrix for directed acyclic graphs (DAGs). Using a inexpensive GPU card for our experiments, we obtained more than a 16-fold speedup compared with an Athlon 64 X2.
Sung Jae LEE Seog Chung SEO Dong-Guk HAN Seokhie HONG Sangjin LEE
This paper proposes methods for accelerating DPA by using the CPU and the GPU in a parallel manner. The overhead of naive DPA evaluation software increases excessively as the number of points in a trace or the number of traces is enlarged due to the rapid increase of file I/O overhead. This paper presents some techniques, with respect to DPA-arithmetic and file handling, which can make the overhead of DPA software become not extreme but gradual as the increase of the amount of trace data to be processed. Through generic experiments, we show that the software, equipped with the proposed methods, using both CPU and GPU can shorten the time for evaluating the DPA resistance of devices by almost half.
Yuma MUNEKAWA Fumihiko INO Kenichi HAGIHARA
This paper presents a fast method capable of accelerating the Smith-Waterman algorithm for biological database search on a cluster of graphics processing units (GPUs). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. As compared with previous methods, our method has four major contributions. (1) The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip video memory and processing elements in the GPU. (2) It also reduces the number of data fetches by applying a data reuse technique to query and database sequences. (3) A pipelined method is also implemented to overlap GPU execution with database access. (4) Finally, a master/worker paradigm is employed to accelerate hundreds of database searches on a cluster system. In experiments, the peak performance on a GeForce GTX 280 card reaches 8.32 giga cell updates per second (GCUPS). We also find that our method reduces the amount of data fetches to 1/140, achieving approximately three times higher performance than a previous CUDA-based method. Our 32-node cluster version is approximately 28 times faster than a single GPU version. Furthermore, the effective performance reaches 75.6 giga instructions per second (GIPS) using 32 GeForce 8800 GTX cards.
Yuichi TAGUCHI Keita TAKAHASHI Takeshi NAEMURA
We present a real-time video-based rendering system using a network camera array. Our system consists of 64 commodity network cameras that are connected to a single PC through a gigabit Ethernet. To render a high-quality novel view, our system estimates a view-dependent per-pixel depth map in real time by using a layered representation. The rendering algorithm is fully implemented on the GPU, which allows our system to efficiently perform capturing and rendering processes as a pipeline by using the CPU and GPU independently. Using QVGA input video resolution, our system renders a free-viewpoint video at up to 30 frames per second, depending on the output video resolution and the number of depth layers. Experimental results show high-quality images synthesized from various scenes.
This paper presents media processor architectures for automotive applications. Media processing applications with their requirements for LSI implementations are first described for vision based driver assistance as well as graphical user interface for car navigation using 3D graphics. Then, parallel processing architectures for vision and graphics in these applications are reviewed with their performance and cost. After that, future trends of automotive media processing such as integration of vision and 3D graphics functions are shown with their applications and the required performance. Moreover, parallel processing architectures are discussed for the integration of vision and graphics. Finally, an prospect of a next-generation media processing LSI for automotives is provided.
Christian NITSCHKE Atsushi NAKAZAWA Haruo TAKEMURA
Reconstruction of real-world scenes from a set of multiple images is a topic in computer vision and 3D computer graphics with many interesting applications. Attempts have been made to real-time reconstruction on PC cluster systems. While these provide enough performance, they are expensive and less flexible. Approaches that use a GPU hardware-acceleration on single workstations achieve real-time framerates for novel-view synthesis, but do not provide an explicit volumetric representation. This work shows our efforts in developing a GPU hardware-accelerated framework for providing a photo-consistent reconstruction of a dynamic 3D scene. High performance is achieved by employing a shape from silhouette technique in advance. Since the entire processing is done on a single PC, the framework can be applied in mobile environments, enabling a wide range of further applications. We explain our approach using programmable vertex and fragment processors and compare it to highly optimized CPU implementations. We show that the new approach can outperform the latter by more than one magnitude and give an outlook for interesting future enhancements.