A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access

Heungseop AHN; Seungwon CHOI

doi:10.1587/transfun.E100.A.1188

A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access

Heungseop AHN, Seungwon CHOI

Full Text Views

0

Cite this

Summary :

The sub-blocking algorithm has been known as a core component in implementing a turbo decoder using a Graphic Processing Unit (GPU) to use as many cores in the GPU as possible for parallel processing. However, even though the sub-blocking algorithm allows a large number of threads in a given GPU to be adopted for processing a large number of sub-blocks in parallel, each thread must access the global memory with strided addresses, which results in uncoalesced memory access. Because uncoalesced memory access causes a lot of unnecessary memory transactions, the memory bandwidth efficiency drops significantly, possibly as low as 1/8 in the case of an Long Term Evolution (LTE) turbo decoder, depending upon the compute capability of a GPU. In this paper, we present a novel method for converting uncoalesced memory access into coalesced access in a way that completely recovers the memory bandwidth efficiency to 100% without additional overhead. Our experimental tests, performed with NVIDIA's Geforce GTX 780 Ti GPU, show that the proposed method can enhance the throughput by nearly 30% compared with a conventional turbo decoder that suffers from uncoalesced memory access. Throughput provided by the proposed method has been observed to be 51.4Mbps when the number of iterations and that of sub-blocks are set to 6 and 32, respectively, in our experimental tests, which far exceeds the performance of previous works implemented the Max-Log-MAP algorithm.

Publication: IEICE TRANSACTIONS on Fundamentals Vol.E100-A No.5 pp.1188-1196

Publication Date: 2017/05/01

Publicized

Online ISSN: 1745-1337

DOI: 10.1587/transfun.E100.A.1188

Type of Manuscript: PAPER

Category: Communication Theory and Signals

Authors

Heungseop AHN
Hanyang University
Seungwon CHOI
Hanyang University

Keyword

GPU, CUDA, turbo decoder, coalesced memory access, SDR

Cite this

Copy

Heungseop AHN, Seungwon CHOI, "A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access" in IEICE TRANSACTIONS on Fundamentals, vol. E100-A, no. 5, pp. 1188-1196, May 2017, doi: 10.1587/transfun.E100.A.1188.
Abstract: The sub-blocking algorithm has been known as a core component in implementing a turbo decoder using a Graphic Processing Unit (GPU) to use as many cores in the GPU as possible for parallel processing. However, even though the sub-blocking algorithm allows a large number of threads in a given GPU to be adopted for processing a large number of sub-blocks in parallel, each thread must access the global memory with strided addresses, which results in uncoalesced memory access. Because uncoalesced memory access causes a lot of unnecessary memory transactions, the memory bandwidth efficiency drops significantly, possibly as low as 1/8 in the case of an Long Term Evolution (LTE) turbo decoder, depending upon the compute capability of a GPU. In this paper, we present a novel method for converting uncoalesced memory access into coalesced access in a way that completely recovers the memory bandwidth efficiency to 100% without additional overhead. Our experimental tests, performed with NVIDIA's Geforce GTX 780 Ti GPU, show that the proposed method can enhance the throughput by nearly 30% compared with a conventional turbo decoder that suffers from uncoalesced memory access. Throughput provided by the proposed method has been observed to be 51.4Mbps when the number of iterations and that of sub-blocks are set to 6 and 32, respectively, in our experimental tests, which far exceeds the performance of previous works implemented the Max-Log-MAP algorithm.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/transfun.E100.A.1188/_p

Copy

@ARTICLE{e100-a_5_1188,
author={Heungseop AHN, Seungwon CHOI, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access},
year={2017},
volume={E100-A},
number={5},
pages={1188-1196},
abstract={The sub-blocking algorithm has been known as a core component in implementing a turbo decoder using a Graphic Processing Unit (GPU) to use as many cores in the GPU as possible for parallel processing. However, even though the sub-blocking algorithm allows a large number of threads in a given GPU to be adopted for processing a large number of sub-blocks in parallel, each thread must access the global memory with strided addresses, which results in uncoalesced memory access. Because uncoalesced memory access causes a lot of unnecessary memory transactions, the memory bandwidth efficiency drops significantly, possibly as low as 1/8 in the case of an Long Term Evolution (LTE) turbo decoder, depending upon the compute capability of a GPU. In this paper, we present a novel method for converting uncoalesced memory access into coalesced access in a way that completely recovers the memory bandwidth efficiency to 100% without additional overhead. Our experimental tests, performed with NVIDIA's Geforce GTX 780 Ti GPU, show that the proposed method can enhance the throughput by nearly 30% compared with a conventional turbo decoder that suffers from uncoalesced memory access. Throughput provided by the proposed method has been observed to be 51.4Mbps when the number of iterations and that of sub-blocks are set to 6 and 32, respectively, in our experimental tests, which far exceeds the performance of previous works implemented the Max-Log-MAP algorithm.},
keywords={},
doi={10.1587/transfun.E100.A.1188},
ISSN={1745-1337},
month={May},}

Copy

TY - JOUR
TI - A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1188
EP - 1196
AU - Heungseop AHN
AU - Seungwon CHOI
PY - 2017
DO - 10.1587/transfun.E100.A.1188
JO - IEICE TRANSACTIONS on Fundamentals
SN - 1745-1337
VL - E100-A
IS - 5
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - May 2017
AB - The sub-blocking algorithm has been known as a core component in implementing a turbo decoder using a Graphic Processing Unit (GPU) to use as many cores in the GPU as possible for parallel processing. However, even though the sub-blocking algorithm allows a large number of threads in a given GPU to be adopted for processing a large number of sub-blocks in parallel, each thread must access the global memory with strided addresses, which results in uncoalesced memory access. Because uncoalesced memory access causes a lot of unnecessary memory transactions, the memory bandwidth efficiency drops significantly, possibly as low as 1/8 in the case of an Long Term Evolution (LTE) turbo decoder, depending upon the compute capability of a GPU. In this paper, we present a novel method for converting uncoalesced memory access into coalesced access in a way that completely recovers the memory bandwidth efficiency to 100% without additional overhead. Our experimental tests, performed with NVIDIA's Geforce GTX 780 Ti GPU, show that the proposed method can enhance the throughput by nearly 30% compared with a conventional turbo decoder that suffers from uncoalesced memory access. Throughput provided by the proposed method has been observed to be 51.4Mbps when the number of iterations and that of sub-blocks are set to 6 and 32, respectively, in our experimental tests, which far exceeds the performance of previous works implemented the Max-Log-MAP algorithm.
ER -