Fast Computation with Efficient Object Data Distribution for Large-Scale Hologram Generation on a Multi-GPU Cluster

Takanobu BABA; Shinpei WATANABE; Boaz JESSIE JACKIN; Kanemitsu OOTSU; Takeshi OHKAWA; Takashi YOKOTA; Yoshio HAYASAKI; Toyohiko YATAGAI

doi:10.1587/transinf.2018EDP7346

IEICE TRANSACTIONS on Information

Open Access
Fast Computation with Efficient Object Data Distribution for Large-Scale Hologram Generation on a Multi-GPU Cluster

Takanobu BABA, Shinpei WATANABE, Boaz JESSIE JACKIN, Kanemitsu OOTSU, Takeshi OHKAWA, Takashi YOKOTA, Yoshio HAYASAKI, Toyohiko YATAGAI

Full Text Views

31

Cite this

Free PDF (1.3MB)

Summary :

The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include a change of the way of object decomposition, reduction of data transfer between the CPU and GPU, kernel integration, stream processing, and utilization of multiple GPUs within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. Experimental results show that intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain an execution time of 4.28 sec for generating a 1.6 giga-pixel hologram from a 3.2 giga-pixel object. It means a 237.92 times speed-up of the sequential processing by CPU and 41.78 times speed-up of multi-threaded execution on multicore-CPU, using a conventional FFT-based algorithm.

Publication: IEICE TRANSACTIONS on Information Vol.E102-D No.7 pp.1310-1320

Publication Date: 2019/07/01

Publicized: 2019/03/29

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2018EDP7346

Type of Manuscript: PAPER

Category: Human-computer Interaction

Authors

Takanobu BABA
  Utsunomiya University
Shinpei WATANABE
  Acs Co., Ltd.
Boaz JESSIE JACKIN
  National Institute of Information and Communications Technology
Kanemitsu OOTSU
  Utsunomiya University
Takeshi OHKAWA
  Utsunomiya University
Takashi YOKOTA
  Utsunomiya University
Yoshio HAYASAKI
  Utsunomiya University
Toyohiko YATAGAI
  Utsunomiya University

Keyword

computer generated holography, large-scale CGH, GPU cluster

Cite this

Copy

Takanobu BABA, Shinpei WATANABE, Boaz JESSIE JACKIN, Kanemitsu OOTSU, Takeshi OHKAWA, Takashi YOKOTA, Yoshio HAYASAKI, Toyohiko YATAGAI, "Fast Computation with Efficient Object Data Distribution for Large-Scale Hologram Generation on a Multi-GPU Cluster" in IEICE TRANSACTIONS on Information, vol. E102-D, no. 7, pp. 1310-1320, July 2019, doi: 10.1587/transinf.2018EDP7346.
Abstract: The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include a change of the way of object decomposition, reduction of data transfer between the CPU and GPU, kernel integration, stream processing, and utilization of multiple GPUs within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. Experimental results show that intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain an execution time of 4.28 sec for generating a 1.6 giga-pixel hologram from a 3.2 giga-pixel object. It means a 237.92 times speed-up of the sequential processing by CPU and 41.78 times speed-up of multi-threaded execution on multicore-CPU, using a conventional FFT-based algorithm.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2018EDP7346/_p

Copy

@ARTICLE{e102-d_7_1310,
author={Takanobu BABA, Shinpei WATANABE, Boaz JESSIE JACKIN, Kanemitsu OOTSU, Takeshi OHKAWA, Takashi YOKOTA, Yoshio HAYASAKI, Toyohiko YATAGAI, },
journal={IEICE TRANSACTIONS on Information},
title={Fast Computation with Efficient Object Data Distribution for Large-Scale Hologram Generation on a Multi-GPU Cluster},
year={2019},
volume={E102-D},
number={7},
pages={1310-1320},
abstract={The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include a change of the way of object decomposition, reduction of data transfer between the CPU and GPU, kernel integration, stream processing, and utilization of multiple GPUs within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. Experimental results show that intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain an execution time of 4.28 sec for generating a 1.6 giga-pixel hologram from a 3.2 giga-pixel object. It means a 237.92 times speed-up of the sequential processing by CPU and 41.78 times speed-up of multi-threaded execution on multicore-CPU, using a conventional FFT-based algorithm.},
keywords={},
doi={10.1587/transinf.2018EDP7346},
ISSN={1745-1361},
month={July},}

Copy

TY - JOUR
TI - Fast Computation with Efficient Object Data Distribution for Large-Scale Hologram Generation on a Multi-GPU Cluster
T2 - IEICE TRANSACTIONS on Information
SP - 1310
EP - 1320
AU - Takanobu BABA
AU - Shinpei WATANABE
AU - Boaz JESSIE JACKIN
AU - Kanemitsu OOTSU
AU - Takeshi OHKAWA
AU - Takashi YOKOTA
AU - Yoshio HAYASAKI
AU - Toyohiko YATAGAI
PY - 2019
DO - 10.1587/transinf.2018EDP7346
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E102-D
IS - 7
JA - IEICE TRANSACTIONS on Information
Y1 - July 2019
AB - The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include a change of the way of object decomposition, reduction of data transfer between the CPU and GPU, kernel integration, stream processing, and utilization of multiple GPUs within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. Experimental results show that intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain an execution time of 4.28 sec for generating a 1.6 giga-pixel hologram from a 3.2 giga-pixel object. It means a 237.92 times speed-up of the sequential processing by CPU and 41.78 times speed-up of multi-threaded execution on multicore-CPU, using a conventional FFT-based algorithm.
ER -

IEICE TRANSACTIONS on Information