Blocked United Algorithm for the All-Pairs Shortest Paths Problem on Hybrid CPU-GPU Systems

Kazuya MATSUMOTO; Naohito NAKASATO; Stanislav G. SEDUKHIN

doi:10.1587/transinf.E95.D.2759

Blocked United Algorithm for the All-Pairs Shortest Paths Problem on Hybrid CPU-GPU Systems

Kazuya MATSUMOTO, Naohito NAKASATO, Stanislav G. SEDUKHIN

Full Text Views

0

Cite this

Summary :

This paper presents a blocked united algorithm for the all-pairs shortest paths (APSP) problem. This algorithm simultaneously computes both the shortest-path distance matrix and the shortest-path construction matrix for a graph. It is designed for a high-speed APSP solution on hybrid CPU-GPU systems. In our implementation, two most compute intensive parts of the algorithm are performed on the GPU. The first part is to solve the APSP sub-problem for a block of sub-matrices, and the other part is a matrix-matrix “multiplication” for the APSP problem. Moreover, the amount of data communication between CPU (host) memory and GPU memory is reduced by reusing blocks once sent to the GPU. When a problem size (the number of vertices in a graph) is large enough compared to a block size, our implementation of the blocked algorithm requires CPU GPU exchanging of three blocks during a block computation on the GPU. We measured the performance of the algorithm implementation on two different CPU-GPU systems. A system containing an Intel Sandy Bridge CPU (Core i7 2600K) and an AMD Cayman GPU (Radeon HD 6970) achieves the performance up to 1.1 TFlop/s in a single precision.

Publication: IEICE TRANSACTIONS on Information Vol.E95-D No.12 pp.2759-2768

Publication Date: 2012/12/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E95.D.2759

Type of Manuscript: Special Section PAPER (Special Section on Parallel and Distributed Computing and Networking)

Category: Parallel and Distributed Computing

Cite this

Copy

Kazuya MATSUMOTO, Naohito NAKASATO, Stanislav G. SEDUKHIN, "Blocked United Algorithm for the All-Pairs Shortest Paths Problem on Hybrid CPU-GPU Systems" in IEICE TRANSACTIONS on Information, vol. E95-D, no. 12, pp. 2759-2768, December 2012, doi: 10.1587/transinf.E95.D.2759.
Abstract: This paper presents a blocked united algorithm for the all-pairs shortest paths (APSP) problem. This algorithm simultaneously computes both the shortest-path distance matrix and the shortest-path construction matrix for a graph. It is designed for a high-speed APSP solution on hybrid CPU-GPU systems. In our implementation, two most compute intensive parts of the algorithm are performed on the GPU. The first part is to solve the APSP sub-problem for a block of sub-matrices, and the other part is a matrix-matrix “multiplication” for the APSP problem. Moreover, the amount of data communication between CPU (host) memory and GPU memory is reduced by reusing blocks once sent to the GPU. When a problem size (the number of vertices in a graph) is large enough compared to a block size, our implementation of the blocked algorithm requires CPU GPU exchanging of three blocks during a block computation on the GPU. We measured the performance of the algorithm implementation on two different CPU-GPU systems. A system containing an Intel Sandy Bridge CPU (Core i7 2600K) and an AMD Cayman GPU (Radeon HD 6970) achieves the performance up to 1.1 TFlop/s in a single precision.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E95.D.2759/_p

Copy

@ARTICLE{e95-d_12_2759,
author={Kazuya MATSUMOTO, Naohito NAKASATO, Stanislav G. SEDUKHIN, },
journal={IEICE TRANSACTIONS on Information},
title={Blocked United Algorithm for the All-Pairs Shortest Paths Problem on Hybrid CPU-GPU Systems},
year={2012},
volume={E95-D},
number={12},
pages={2759-2768},
abstract={This paper presents a blocked united algorithm for the all-pairs shortest paths (APSP) problem. This algorithm simultaneously computes both the shortest-path distance matrix and the shortest-path construction matrix for a graph. It is designed for a high-speed APSP solution on hybrid CPU-GPU systems. In our implementation, two most compute intensive parts of the algorithm are performed on the GPU. The first part is to solve the APSP sub-problem for a block of sub-matrices, and the other part is a matrix-matrix “multiplication” for the APSP problem. Moreover, the amount of data communication between CPU (host) memory and GPU memory is reduced by reusing blocks once sent to the GPU. When a problem size (the number of vertices in a graph) is large enough compared to a block size, our implementation of the blocked algorithm requires CPU GPU exchanging of three blocks during a block computation on the GPU. We measured the performance of the algorithm implementation on two different CPU-GPU systems. A system containing an Intel Sandy Bridge CPU (Core i7 2600K) and an AMD Cayman GPU (Radeon HD 6970) achieves the performance up to 1.1 TFlop/s in a single precision.},
keywords={},
doi={10.1587/transinf.E95.D.2759},
ISSN={1745-1361},
month={December},}

Copy

TY - JOUR
TI - Blocked United Algorithm for the All-Pairs Shortest Paths Problem on Hybrid CPU-GPU Systems
T2 - IEICE TRANSACTIONS on Information
SP - 2759
EP - 2768
AU - Kazuya MATSUMOTO
AU - Naohito NAKASATO
AU - Stanislav G. SEDUKHIN
PY - 2012
DO - 10.1587/transinf.E95.D.2759
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E95-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2012
AB - This paper presents a blocked united algorithm for the all-pairs shortest paths (APSP) problem. This algorithm simultaneously computes both the shortest-path distance matrix and the shortest-path construction matrix for a graph. It is designed for a high-speed APSP solution on hybrid CPU-GPU systems. In our implementation, two most compute intensive parts of the algorithm are performed on the GPU. The first part is to solve the APSP sub-problem for a block of sub-matrices, and the other part is a matrix-matrix “multiplication” for the APSP problem. Moreover, the amount of data communication between CPU (host) memory and GPU memory is reduced by reusing blocks once sent to the GPU. When a problem size (the number of vertices in a graph) is large enough compared to a block size, our implementation of the blocked algorithm requires CPU GPU exchanging of three blocks during a block computation on the GPU. We measured the performance of the algorithm implementation on two different CPU-GPU systems. A system containing an Intel Sandy Bridge CPU (Core i7 2600K) and an AMD Cayman GPU (Radeon HD 6970) achieves the performance up to 1.1 TFlop/s in a single precision.
ER -