In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.
Takumi HONDA
Hiroshima University
Yasuaki ITO
Hiroshima University
Koji NAKANO
Hiroshima University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Takumi HONDA, Yasuaki ITO, Koji NAKANO, "GPU-Accelerated Bulk Execution of Multiple-Length Multiplication with Warp-Synchronous Programming Technique" in IEICE TRANSACTIONS on Information,
vol. E99-D, no. 12, pp. 3004-3012, December 2016, doi: 10.1587/transinf.2016PAP0027.
Abstract: In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016PAP0027/_p
Copy
@ARTICLE{e99-d_12_3004,
author={Takumi HONDA, Yasuaki ITO, Koji NAKANO, },
journal={IEICE TRANSACTIONS on Information},
title={GPU-Accelerated Bulk Execution of Multiple-Length Multiplication with Warp-Synchronous Programming Technique},
year={2016},
volume={E99-D},
number={12},
pages={3004-3012},
abstract={In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.},
keywords={},
doi={10.1587/transinf.2016PAP0027},
ISSN={1745-1361},
month={December},}
Copy
TY - JOUR
TI - GPU-Accelerated Bulk Execution of Multiple-Length Multiplication with Warp-Synchronous Programming Technique
T2 - IEICE TRANSACTIONS on Information
SP - 3004
EP - 3012
AU - Takumi HONDA
AU - Yasuaki ITO
AU - Koji NAKANO
PY - 2016
DO - 10.1587/transinf.2016PAP0027
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E99-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2016
AB - In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.
ER -