IEICE global.ieice.org Site

Keyword Search Result

[Keyword] GPU(89hit)

21-40hit(89hit)

OpenACC Parallelization of Stochastic Simulations on GPUs
Pilsung KANG

LETTER-Fundamentals of Information Systems

Pubricized:
2019/05/17
Vol:
E102-D No:8
Page(s):
1565-1568
We present an OpenACC-based parallelization implementation of stochastic algorithms for simulating biochemical reaction networks on modern GPUs (graphics processing units). To investigate the effectiveness of using OpenACC for leveraging the massive hardware parallelism of the GPU architecture, we carefully apply OpenACC's language constructs and mechanisms to implementing a parallel version of stochastic simulation algorithms on the GPU. Using our OpenACC implementation in comparison to both the NVidia CUDA and the CPU-based implementations, we report our initial experiences on OpenACC's performance and programming productivity in the context of GPU-accelerated scientific computing.
Fast Computation with Efficient Object Data Distribution for Large-Scale Hologram Generation on a Multi-GPU Cluster Open Access
Takanobu BABA Shinpei WATANABE Boaz JESSIE JACKIN Kanemitsu OOTSU Takeshi OHKAWA Takashi YOKOTA Yoshio HAYASAKI Toyohiko YATAGAI

PAPER-Human-computer Interaction

Pubricized:
2019/03/29
Vol:
E102-D No:7
Page(s):
1310-1320
The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include a change of the way of object decomposition, reduction of data transfer between the CPU and GPU, kernel integration, stream processing, and utilization of multiple GPUs within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. Experimental results show that intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain an execution time of 4.28 sec for generating a 1.6 giga-pixel hologram from a 3.2 giga-pixel object. It means a 237.92 times speed-up of the sequential processing by CPU and 41.78 times speed-up of multi-threaded execution on multicore-CPU, using a conventional FFT-based algorithm.
Accelerating Large-Scale Interconnection Network Simulation by Cellular Automata Concept
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA

PAPER-Computer System

Pubricized:
2018/10/05
Vol:
E102-D No:1
Page(s):
52-74
State-of-the-art parallel systems employ a huge number of computing nodes that are connected by an interconnection network. An interconnection network (ICN) plays an important role in a parallel system, since it is responsible to communication capability. In general, an ICN shows non-linear phenomena in its communication performance, most of them are caused by congestion. Thus, designing a large-scale parallel system requires sufficient discussions through repetitive simulation runs. This causes another problem in simulating large-scale systems within a reasonable cost. This paper shows a promising solution by introducing the cellular automata concept, which is originated in our prior work. Assuming 2D-torus topologies for simplification of discussion, this paper discusses fundamental design of router functions in terms of cellular automata, data structure of packets, alternative modeling of a router function, and miscellaneous optimization. The proposed models have a good affinity to GPGPU technology and, as representative speed-up results, the GPU-based simulator accelerates simulation upto about 1264 times from sequential execution on a single CPU. Furthermore, since the proposed models are applicable in the shared memory model, multithread implementation of the proposed methods achieve about 162 times speed-ups at the maximum.
Real-Time and Energy-Efficient Face Detection on CPU-GPU Heterogeneous Embedded Platforms
Chanyoung OH Saehanseul YI Youngmin YI

PAPER-Real-time Systems

Pubricized:
2018/09/18
Vol:
E101-D No:12
Page(s):
2878-2888
As energy efficiency has become a major design constraint or objective, heterogeneous manycore architectures have emerged as mainstream target platforms not only in server systems but also in embedded systems. Manycore accelerators such as GPUs are getting also popular in embedded domains, as well as the heterogeneous CPU cores. However, as the number of cores in an embedded GPU is far less than that of a server GPU, it is important to utilize both heterogeneous multi-core CPUs and GPUs to achieve the desired throughput with the minimal energy consumption. In this paper, we present a case study of mapping LBP-based face detection onto a recent CPU-GPU heterogeneous embedded platform, which exploits both task parallelism and data parallelism to achieve maximal energy efficiency with a real-time constraint. We first present the parallelization technique of each task for the GPU execution, then we propose performance and energy models for both task-parallel and data-parallel executions on heterogeneous processors, which are used in design space exploration for the optimal mapping. The design space is huge since not only processor heterogeneity such as CPU-GPU and big.LITTLE, but also various data partitioning ratios for the data-parallel execution on these heterogeneous processors are considered. In our case study of LBP face detection on Exynos 5422, the estimation error of the proposed performance and energy models were on average -2.19% and -3.67% respectively. By systematically finding the optimal mappings with the proposed models, we could achieve 28.6% less energy consumption compared to the manual mapping, while still meeting the real-time constraint.
View Priority Based Threads Allocation and Binary Search Oriented Reweight for GPU Accelerated Real-Time 3D Ball Tracking
Yilin HOU Ziwei DENG Xina CHENG Takeshi IKENAGA

PAPER-Image Recognition, Computer Vision

Pubricized:
2018/08/31
Vol:
E101-D No:12
Page(s):
3190-3198
In real-time 3D ball tracking of sports analysis in computer vision technology, complex algorithms which assure the accuracy could be time-consuming. Particle filter based algorithm has a large potential to accelerate since the algorithm between particles has the chance to be paralleled in heterogeneous CPU-GPU platform. Still, with the target multi-view 3D ball tracking algorithm, challenges exist: 1) serial flowchart for each step in the algorithm; 2) repeated processing for multiple views' processing; 3) the low degree of parallelism in reweight and resampling steps for sequential processing. On the CPU-GPU platform, this paper proposes the double stream system flow, the view priority based threads allocation, and the binary search oriented reweight. Double stream system flow assigns tasks which there is no data dependency exists into different streams for each frame processing to achieve parallelism in system structure level. View priority based threads allocation manipulates threads in multi-view observation task. Threads number is view number multiplied by particles number, and with view priority assigning, which could help both memory accessing and computing achieving parallelism. Binary search oriented reweight reduces the time complexity by avoiding to generate cumulative distribution function and uses an unordered array to implement a binary search. The experiment is based on videos which record the final game of an official volleyball match (2014 Inter-High School Games of Men's Volleyball held in Tokyo Metropolitan Gymnasium in Aug. 2014) and the test sequences are taken by multiple-view system which is made of 4 cameras locating at the four corners of the court. The success rate achieves 99.23% which is the same as target algorithm while the time consumption has been accelerated from 75.1ms/frame in CPU environment to 3.05ms/frame in the proposed system which is 24.62 times speed up, also, it achieves 2.33 times speedup compared with basic GPU implemented work.
Cooperative GPGPU Scheduling for Consolidating Server Workloads
Yusuke SUZUKI Hiroshi YAMADA Shinpei KATO Kenji KONO

PAPER-Software System

Pubricized:
2018/08/30
Vol:
E101-D No:12
Page(s):
3019-3037
Graphics processing units (GPUs) have become an attractive platform for general-purpose computing (GPGPU) in various domains. Making GPUs a time-multiplexing resource is a key to consolidating GPGPU applications (apps) in multi-tenant cloud platforms. However, advanced GPGPU apps pose a new challenge for consolidation. Such highly functional GPGPU apps, referred to as GPU eaters, can easily monopolize a shared GPU and starve collocated GPGPU apps. This paper presents GLoop, which is a software runtime that enables us to consolidate GPGPU apps including GPU eaters. GLoop offers an event-driven programming model, which allows GLoop-based apps to inherit the GPU eaters' high functionality while proportionally scheduling them on a shared GPU in an isolated manner. We implemented a prototype of GLoop and ported eight GPU eaters on it. The experimental results demonstrate that our prototype successfully schedules the consolidated GPGPU apps on the basis of its scheduling policy and isolates resources among them.
Multi-Peak Estimation for Real-Time 3D Ping-Pong Ball Tracking with Double-Queue Based GPU Acceleration
Ziwei DENG Yilin HOU Xina CHENG Takeshi IKENAGA

PAPER-Machine Vision and its Applications

Pubricized:
2018/02/16
Vol:
E101-D No:5
Page(s):
1251-1259
3D ball tracking is of great significance in ping-pong game analysis, which can be utilized to applications such as TV contents and tactic analysis, with some of them requiring real-time implementation. This paper proposes a CPU-GPU platform based Particle Filter for multi-view ball tracking including 4 proposals. The multi-peak estimation and the ball-like observation model are proposed in the algorithm design. The multi-peak estimation aims at obtaining a precise ball position in case the particles' likelihood distribution has multiple peaks under complex circumstances. The ball-like observation model with 4 different likelihood evaluation, utilizes the ball's unique features to evaluate the particle's similarity with the target. In the GPU implementation, the double-queue structure and the vectorized data combination are proposed. The double-queue structure aims at achieving task parallelism between some data-independent tasks. The vectorized data combination reduces the time cost in memory access by combining 3 different image data to 1 vector data. Experiments are based on ping-pong videos recorded in an official match taken by 4 cameras located in 4 corners of the court. The tracking success rate reaches 99.59% on CPU. With the GPU acceleration, the time consumption is 8.8 ms/frame, which is sped up by a factor of 98 compared with its CPU version.
GPU-Accelerated Stochastic Simulation of Biochemical Networks
Pilsung KANG

LETTER-Fundamentals of Information Systems

Pubricized:
2017/12/20
Vol:
E101-D No:3
Page(s):
786-790
We present a GPU (graphics processing unit) accelerated stochastic algorithm implementation for simulating biochemical reaction networks using the latest NVidia architecture. To effectively utilize the massive parallelism offered by the NVidia Pascal hardware, we apply a set of performance tuning methods and guidelines such as exploiting the architecture's memory hierarchy in our algorithm implementation. Based on our experimentation results as well as comparative analysis using CPU-based implementations, we report our initial experiences on the performance of modern GPUs in the context of scientific computing.
A GPU-Based Rasterization Algorithm for Boolean Operations on Polygons
Yi GAO Jianxin LUO Hangping QIU Bin TANG Bo WU Weiwei DUAN

LETTER-Fundamentals of Information Systems

Pubricized:
2017/09/29
Vol:
E101-D No:1
Page(s):
234-238
This paper presents a new GPU-based rasterization algorithm for Boolean operations that handles arbitary closed polygons. We construct an efficient data structure for interoperation of CPU and GPU and propose a fast GPU-based contour extraction method to ensure the performance of our algorithm. We then design a novel traversing strategy to achieve an error-free calculation of intersection point for correct Boolean operations. We finally give a detail evaluation and the results show that our algorithm has a higher performance than exsiting algorithms on processing polygons with large amount of vertices.
Virtualizing Graphics Architecture of Android Mobile Platforms in KVM/ARM Environment
Sejin PARK Byungsu PARK Unsung LEE Chanik PARK

PAPER-Software System

Pubricized:
2017/04/18
Vol:
E100-D No:7
Page(s):
1403-1415
With the availability of virtualization extension in mobile processors, e.g. ARM Cortex A-15, multiple virtual execution domains are efficiently supported in a mobile platform. Each execution domain requires high-performance graphics services for full-featured user interfaces such as smooth scrolling, background image blurring, and 3D images. However, graphics service is hard to be virtualized because multiple service components (e.g. ION and Fence) are involved. Moreover, the complexity of Graphical Processing Unit (GPU) device driver also makes harder virtualizing graphics service. In this paper, we propose a technique to virtualize the graphics architecture of Android mobile platform in KVM/ARM environment. The Android graphics architecture relies on underlying Linux kernel services such as the frame buffer memory allocator ION, the buffer synchronization service Fence, GPU device driver, and the display synchronization service VSync. These kernel services are provided as device files in Linux kernel. Our approach is to para-virtualize these device files based on a split device driver model. A major challenge is to translate guest-view of information into host-view of information, e.g. memory address translation, file descriptor management, and GPU Memory Management Unit (MMU) manipulation. The experimental results show that the proposed graphics virtualization technique achieved almost 84%-100% performance of native applications.
A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access
Heungseop AHN Seungwon CHOI

PAPER-Communication Theory and Signals

Vol:
E100-A No:5
Page(s):
1188-1196
The sub-blocking algorithm has been known as a core component in implementing a turbo decoder using a Graphic Processing Unit (GPU) to use as many cores in the GPU as possible for parallel processing. However, even though the sub-blocking algorithm allows a large number of threads in a given GPU to be adopted for processing a large number of sub-blocks in parallel, each thread must access the global memory with strided addresses, which results in uncoalesced memory access. Because uncoalesced memory access causes a lot of unnecessary memory transactions, the memory bandwidth efficiency drops significantly, possibly as low as 1/8 in the case of an Long Term Evolution (LTE) turbo decoder, depending upon the compute capability of a GPU. In this paper, we present a novel method for converting uncoalesced memory access into coalesced access in a way that completely recovers the memory bandwidth efficiency to 100% without additional overhead. Our experimental tests, performed with NVIDIA's Geforce GTX 780 Ti GPU, show that the proposed method can enhance the throughput by nearly 30% compared with a conventional turbo decoder that suffers from uncoalesced memory access. Throughput provided by the proposed method has been observed to be 51.4Mbps when the number of iterations and that of sub-blocks are set to 6 and 32, respectively, in our experimental tests, which far exceeds the performance of previous works implemented the Max-Log-MAP algorithm.
Cache-Aware, In-Place Rotation Method for Texture-Based Volume Rendering
Yuji MISAKI Fumihiko INO Kenichi HAGIHARA

PAPER-Fundamentals of Information Systems

Pubricized:
2016/12/12
Vol:
E100-D No:3
Page(s):
452-461
We propose a cache-aware method to accelerate texture-based volume rendering on a graphics processing unit (GPU) that is compatible with the compute unified device architecture. The proposed method extends a previous method such that it can maximize the average rendering performance while rotating the viewing direction around a volume. To realize this, the proposed method performs in-place rotation of volume data, which rearranges the order of voxels to allow consecutive threads (warps) to refer to voxels with the minimum access strides. Experiments indicate that the proposed method replaces the worst texture cache (TC) hit rate of 42% with the best TC hit rate of 93% for a 10243-voxel volume. Thus, the average frame rate increases by a factor of 1.6 in the proposed method compared with that in the previous method. Although the overhead of in-place rotation slightly decreases the frame rate from 2.0 frames per second (fps) to 1.9 fps, this slowdown occurs only with a few viewing directions.
An Efficient Soft Shadow Mapping for Area Lights in Various Shapes and Colors
Youngjae CHUN Kyoungsu OH

LETTER-Computer Graphics

Pubricized:
2016/11/11
Vol:
E100-D No:2
Page(s):
396-400
Shadow is an important effect that makes virtual 3D scenes more realistic. In this paper, we propose a fast and correct soft shadow generation method for area lights of various shapes and colors. To conduct efficient as well as accurate visibility tests, we exploit the complexity of shadow and area light color.
Geometry Clipmaps Terrain Rendering Using Hardware Tessellation
Ge SONG Hongyu YANG Yulong JI

LETTER-Computer Graphics

Pubricized:
2016/11/09
Vol:
E100-D No:2
Page(s):
401-404
Due to heavy rendering load and unstable frame rate when rendering large terrain, this paper proposes a geometry clipmaps based algorithm. Triangle meshes are generated by few tessellation control points in GPU tessellation shader. ‘Cracks’ caused by different resolution between adjacent levels are eliminated by modifying outer tessellation level factor of shared edges between levels. Experimental results show the algorithm is able to improve rendering efficiency and frame rate stability in terrain navigation.
GPU-Accelerated Bulk Execution of Multiple-Length Multiplication with Warp-Synchronous Programming Technique
Takumi HONDA Yasuaki ITO Koji NAKANO

PAPER-GPU computing

Pubricized:
2016/08/24
Vol:
E99-D No:12
Page(s):
3004-3012
In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.
Cache-Aware GPU Optimization for Out-of-Core Cone Beam CT Reconstruction of High-Resolution Volumes
Yuechao LU Fumihiko INO Kenichi HAGIHARA

PAPER-Computer System

Pubricized:
2016/09/05
Vol:
E99-D No:12
Page(s):
3060-3071
This paper proposes a cache-aware optimization method to accelerate out-of-core cone beam computed tomography reconstruction on a graphics processing unit (GPU) device. Our proposed method extends a previous method by increasing the cache hit rate so as to speed up the reconstruction of high-resolution volumes that exceed the capacity of device memory. More specifically, our approach accelerates the well-known Feldkamp-Davis-Kress algorithm by utilizing the following three strategies: (1) a loop organization strategy that identifies the best tradeoff point between the cache hit rate and the number of off-chip memory accesses; (2) a data structure that exploits high locality within a layered texture; and (3) a fully pipelined strategy for hiding file input/output (I/O) time with GPU execution and data transfer times. We implement our proposed method on NVIDIA's latest Maxwell architecture and provide tuning guidelines for adjusting the execution parameters, which include the granularity and shape of thread blocks as well as the granularity of I/O data to be streamed through the pipeline, which maximizes reconstruction performance. Our experimental results show that it took less than three minutes to reconstruct a 20483-voxel volume from 1200 20482-pixel projection images on a single GPU; this translates to a speedup of approximately 1.47 as compared to the previous method.
A Memory-Access-Efficient Implementation for Computing the Approximate String Matching Algorithm on GPUs
Lucas Saad Nogueira NUNES Jacir Luiz BORDIM Yasuaki ITO Koji NAKANO

PAPER-GPU computing

Pubricized:
2016/08/24
Vol:
E99-D No:12
Page(s):
2995-3003
The closeness of a match is an important measure with a number of practical applications, including computational biology, signal processing and text retrieval. The approximate string matching (ASM) problem asks to find a substring of string Y of length n that is most similar to string X of length m. It is well-know that the ASM can be solved by dynamic programming technique by computing a table of size m×n. The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU. The proposed GPU implementation relies on warp shuffle instructions which are used to accelerate the communication between threads without resorting to shared memory access. Despite the fact that O(mn) memory access operations are necessary to access all elements of a table with size n×m, the proposed implementation performs only $O(rac{mn}{w})$ memory access operations, where w is the warp size. Experimental results carried out on a GeForce GTX 980 GPU show that the proposed implementation, called w-SCAN, provides speed-up of over two fold in computing the ASM as compared to another prominent alternative.
Fully Parallelized LZW Decompression for CUDA-Enabled GPUs
Shunji FUNASAKA Koji NAKANO Yasuaki ITO

PAPER-GPU computing

Pubricized:
2016/08/25
Vol:
E99-D No:12
Page(s):
2986-2994
The main contribution of this paper is to present a work-optimal parallel algorithm for LZW decompression and to implement it in a CUDA-enabled GPU. Since sequential LZW decompression creates a dictionary table by reading codes in a compressed file one by one, it is not easy to parallelize it. We first present a work-optimal parallel LZW decompression algorithm on the CREW-PRAM (Concurrent-Read Exclusive-Write Parallel Random Access Machine), which is a standard theoretical parallel computing model with a shared memory. We then go on to present an efficient implementation of this parallel algorithm on a GPU. The experimental results show that our GPU implementation performs LZW decompression in 1.15 milliseconds for a gray scale TIFF image with 4096×3072 pixels stored in the global memory of GeForce GTX 980. On the other hand, sequential LZW decompression for the same image stored in the main memory of Intel Core i7 CPU takes 50.1 milliseconds. Thus, our parallel LZW decompression on the global memory of the GPU is 43.6 times faster than a sequential LZW decompression on the main memory of the CPU for this image. To show the applicability of our GPU implementation for LZW decompression, we evaluated the SSD-GPU data loading time for three scenarios. The experimental results show that the scenario using our LZW decompression on the GPU is faster than the others.
Performance Optimization of Light-Field Applications on GPU
Yuttakon YUTTAKONKIT Shinya TAKAMAEDA-YAMAZAKI Yasuhiko NAKASHIMA

PAPER-Computer System

Pubricized:
2016/08/24
Vol:
E99-D No:12
Page(s):
3072-3081
Light-field image processing has been widely employed in many areas, from mobile devices to manufacturing applications. The fundamental process to extract the usable information requires significant computation with high-resolution raw image data. A graphics processing unit (GPU) is used to exploit the data parallelism as in general image processing applications. However, the sparse memory access pattern of the applications reduced the performance of GPU devices for both systematic and algorithmic reasons. Thus, we propose an optimization technique which redesigns the memory access pattern of the applications to alleviate the memory bottleneck of rendering application and to increase the data reusability for depth extraction application. We evaluated our optimized implementations with the state-of-the-art algorithm implementations on several GPUs where all implementations were optimally configured for each specific device. Our proposed optimization increased the performance of rendering application on GTX-780 GPU by 30% and depth extraction application on GTX-780 and GTX-980 GPUs by 82% and 18%, respectively, compared with the original implementations.
Accuracy Assessment of FDTD Method for the Analysis of Sub-Wavelength Photonic Structures
Yasuo OHTERA

PAPER

Vol:
E99-C No:7
Page(s):
780-787
FDTD (Finite-Difference Time-Domain) method has been widely used for the analysis of photonic devices consisting of sub-wavelength structures. In recent years, increasing efforts have been made to implement the FDTD on GPGPUs (General-Purpose Graphic Processing Units), to shorten simulation time. On the other hand, it is widely recognized that most of the middle- and low-end GPGPUs have difference of computational performance, between single-precision and double-precision type arithmetics. Therefore the type selection of single/double precision for electromagnetic field variables in FDTD becomes a key issue from the viewpoint of the total simulation performance. In this study we investigated the difference of results between the use of single-precision and double-precision. As a most fundamental sub-wavelength photonic structure, we focused on an alternating multilayer (one dimensional periodic structure). Obtained results indicate that significant difference appears for the amplitudes of higher order spatial harmonic waves.

21-40hit(89hit)

Keyword Search Result

[Keyword] GPU(89hit)

OpenACC Parallelization of Stochastic Simulations on GPUs

Fast Computation with Efficient Object Data Distribution for Large-Scale Hologram Generation on a Multi-GPU Cluster Open Access

Accelerating Large-Scale Interconnection Network Simulation by Cellular Automata Concept

Real-Time and Energy-Efficient Face Detection on CPU-GPU Heterogeneous Embedded Platforms

View Priority Based Threads Allocation and Binary Search Oriented Reweight for GPU Accelerated Real-Time 3D Ball Tracking

Cooperative GPGPU Scheduling for Consolidating Server Workloads

Multi-Peak Estimation for Real-Time 3D Ping-Pong Ball Tracking with Double-Queue Based GPU Acceleration

GPU-Accelerated Stochastic Simulation of Biochemical Networks

A GPU-Based Rasterization Algorithm for Boolean Operations on Polygons

Virtualizing Graphics Architecture of Android Mobile Platforms in KVM/ARM Environment

A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access

Cache-Aware, In-Place Rotation Method for Texture-Based Volume Rendering

An Efficient Soft Shadow Mapping for Area Lights in Various Shapes and Colors

Geometry Clipmaps Terrain Rendering Using Hardware Tessellation

GPU-Accelerated Bulk Execution of Multiple-Length Multiplication with Warp-Synchronous Programming Technique

Cache-Aware GPU Optimization for Out-of-Core Cone Beam CT Reconstruction of High-Resolution Volumes

A Memory-Access-Efficient Implementation for Computing the Approximate String Matching Algorithm on GPUs

Fully Parallelized LZW Decompression for CUDA-Enabled GPUs

Performance Optimization of Light-Field Applications on GPU

Accuracy Assessment of FDTD Method for the Analysis of Sub-Wavelength Photonic Structures

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles