A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems

Yonghwan KIM; Tadashi ARARAGI; Junya NAKAMURA; Toshimitsu MASUZAWA

doi:10.1587/transinf.E97.D.65

A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems

Yonghwan KIM, Tadashi ARARAGI, Junya NAKAMURA, Toshimitsu MASUZAWA

Full Text Views

0

Cite this

Summary :

Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.

Publication: IEICE TRANSACTIONS on Information Vol.E97-D No.1 pp.65-76

Publication Date: 2014/01/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E97.D.65

Type of Manuscript: PAPER

Category: Dependable Computing

Authors

Yonghwan KIM
  Osaka University
Tadashi ARARAGI
  NTT Corporation
Junya NAKAMURA
  Osaka University
Toshimitsu MASUZAWA
  Osaka University

Keyword

fault-tolerance, large-scale distributed system, concurrent snapshot, checkpoint, rollback

Cite this

Copy

Yonghwan KIM, Tadashi ARARAGI, Junya NAKAMURA, Toshimitsu MASUZAWA, "A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems" in IEICE TRANSACTIONS on Information, vol. E97-D, no. 1, pp. 65-76, January 2014, doi: 10.1587/transinf.E97.D.65.
Abstract: Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E97.D.65/_p

Copy

@ARTICLE{e97-d_1_65,
author={Yonghwan KIM, Tadashi ARARAGI, Junya NAKAMURA, Toshimitsu MASUZAWA, },
journal={IEICE TRANSACTIONS on Information},
title={A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems},
year={2014},
volume={E97-D},
number={1},
pages={65-76},
abstract={Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.},
keywords={},
doi={10.1587/transinf.E97.D.65},
ISSN={1745-1361},
month={January},}

Copy

TY - JOUR
TI - A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems
T2 - IEICE TRANSACTIONS on Information
SP - 65
EP - 76
AU - Yonghwan KIM
AU - Tadashi ARARAGI
AU - Junya NAKAMURA
AU - Toshimitsu MASUZAWA
PY - 2014
DO - 10.1587/transinf.E97.D.65
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E97-D
IS - 1
JA - IEICE TRANSACTIONS on Information
Y1 - January 2014
AB - Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.
ER -