The search functionality is under construction.
The search functionality is under construction.

Hybrid Electrical/Optical Switch Architectures for Training Distributed Deep Learning in Large-Scale

Thao-Nguyen TRUONG, Ryousei TAKANO

  • Full Text Views

    0

  • Cite this

Summary :

Data parallelism is the dominant method used to train deep learning (DL) models on High-Performance Computing systems such as large-scale GPU clusters. When training a DL model on a large number of nodes, inter-node communication becomes bottle-neck due to its relatively higher latency and lower link bandwidth (than intra-node communication). Although some communication techniques have been proposed to cope with this problem, all of these approaches target to deal with the large message size issue while diminishing the effect of the limitation of the inter-node network. In this study, we investigate the benefit of increasing inter-node link bandwidth by using hybrid switching systems, i.e., Electrical Packet Switching and Optical Circuit Switching. We found that the typical data-transfer of synchronous data-parallelism training is long-lived and rarely changed that can be speed-up with optical switching. Simulation results on the Simgrid simulator show that our approach speed-up the training time of deep learning applications, especially in a large-scale manner.

Publication
IEICE TRANSACTIONS on Information Vol.E104-D No.8 pp.1332-1339
Publication Date
2021/08/01
Publicized
2021/04/23
Online ISSN
1745-1361
DOI
10.1587/transinf.2020EDP7201
Type of Manuscript
PAPER
Category
Information Network

Authors

Thao-Nguyen TRUONG
  National Institute of Advanced Industrial Science and Technology (AIST)
Ryousei TAKANO
  National Institute of Advanced Industrial Science and Technology (AIST)

Keyword