A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing

Hoang-Gia VU; Shinya TAKAMAEDA-YAMAZAKI; Takashi NAKADA; Yasuhiko NAKASHIMA

doi:10.1587/transinf.2017RCP0010

A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing

Hoang-Gia VU, Shinya TAKAMAEDA-YAMAZAKI, Takashi NAKADA, Yasuhiko NAKASHIMA

Full Text Views

0

Cite this

Summary :

Modern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide “fine-grained” management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).

Publication: IEICE TRANSACTIONS on Information Vol.E101-D No.2 pp.288-302

Publication Date: 2018/02/01

Publicized: 2017/11/17

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2017RCP0010

Type of Manuscript: Special Section PAPER (Special Section on Reconfigurable Systems)

Category: Device and Architecture

Authors

Hoang-Gia VU
  Nara Institute of Science and Technology
Shinya TAKAMAEDA-YAMAZAKI
  Hokkaido University
Takashi NAKADA
  Nara Institute of Science and Technology
Yasuhiko NAKASHIMA
  Nara Institute of Science and Technology

Keyword

checkpointing, FPGA, dependability, tree-based

Cite this

Copy

Hoang-Gia VU, Shinya TAKAMAEDA-YAMAZAKI, Takashi NAKADA, Yasuhiko NAKASHIMA, "A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing" in IEICE TRANSACTIONS on Information, vol. E101-D, no. 2, pp. 288-302, February 2018, doi: 10.1587/transinf.2017RCP0010.
Abstract: Modern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide “fine-grained” management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2017RCP0010/_p

Copy

@ARTICLE{e101-d_2_288,
author={Hoang-Gia VU, Shinya TAKAMAEDA-YAMAZAKI, Takashi NAKADA, Yasuhiko NAKASHIMA, },
journal={IEICE TRANSACTIONS on Information},
title={A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing},
year={2018},
volume={E101-D},
number={2},
pages={288-302},
abstract={Modern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide “fine-grained” management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).},
keywords={},
doi={10.1587/transinf.2017RCP0010},
ISSN={1745-1361},
month={February},}

Copy

TY - JOUR
TI - A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing
T2 - IEICE TRANSACTIONS on Information
SP - 288
EP - 302
AU - Hoang-Gia VU
AU - Shinya TAKAMAEDA-YAMAZAKI
AU - Takashi NAKADA
AU - Yasuhiko NAKASHIMA
PY - 2018
DO - 10.1587/transinf.2017RCP0010
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E101-D
IS - 2
JA - IEICE TRANSACTIONS on Information
Y1 - February 2018
AB - Modern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide “fine-grained” management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).
ER -