An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs

Tomoya ITSUBO; Michihiro KOIBUCHI; Hideharu AMANO; Hiroki MATSUTANI

doi:10.1587/transinf.2021PAP0008

IEICE TRANSACTIONS on Information

An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs

Tomoya ITSUBO, Michihiro KOIBUCHI, Hideharu AMANO, Hiroki MATSUTANI

Full Text Views

0

Cite this

Summary :

Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.

Publication: IEICE TRANSACTIONS on Information Vol.E104-D No.12 pp.2057-2067

Publication Date: 2021/12/01

Publicized: 2021/07/01

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2021PAP0008

Type of Manuscript: Special Section PAPER (Special Section on Parallel, Distributed, and Reconfigurable Computing, and Networking)

Category

Authors

Tomoya ITSUBO
  Keio University
Michihiro KOIBUCHI
  National Institute of Informatics
Hideharu AMANO
  Keio University
Hiroki MATSUTANI
  Keio University

Keyword

deep learning, FPGA switch, remote GPU

Cite this

Copy

Tomoya ITSUBO, Michihiro KOIBUCHI, Hideharu AMANO, Hiroki MATSUTANI, "An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs" in IEICE TRANSACTIONS on Information, vol. E104-D, no. 12, pp. 2057-2067, December 2021, doi: 10.1587/transinf.2021PAP0008.
Abstract: Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021PAP0008/_p

Copy

@ARTICLE{e104-d_12_2057,
author={Tomoya ITSUBO, Michihiro KOIBUCHI, Hideharu AMANO, Hiroki MATSUTANI, },
journal={IEICE TRANSACTIONS on Information},
title={An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs},
year={2021},
volume={E104-D},
number={12},
pages={2057-2067},
abstract={Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.},
keywords={},
doi={10.1587/transinf.2021PAP0008},
ISSN={1745-1361},
month={December},}

Copy

TY - JOUR
TI - An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs
T2 - IEICE TRANSACTIONS on Information
SP - 2057
EP - 2067
AU - Tomoya ITSUBO
AU - Michihiro KOIBUCHI
AU - Hideharu AMANO
AU - Hiroki MATSUTANI
PY - 2021
DO - 10.1587/transinf.2021PAP0008
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E104-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2021
AB - Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.
ER -

IEICE TRANSACTIONS on Information