Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.
Yonghwan KIM
Osaka University
Tadashi ARARAGI
NTT Corporation
Junya NAKAMURA
Osaka University
Toshimitsu MASUZAWA
Osaka University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Yonghwan KIM, Tadashi ARARAGI, Junya NAKAMURA, Toshimitsu MASUZAWA, "A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems" in IEICE TRANSACTIONS on Information,
vol. E97-D, no. 1, pp. 65-76, January 2014, doi: 10.1587/transinf.E97.D.65.
Abstract: Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E97.D.65/_p
Copy
@ARTICLE{e97-d_1_65,
author={Yonghwan KIM, Tadashi ARARAGI, Junya NAKAMURA, Toshimitsu MASUZAWA, },
journal={IEICE TRANSACTIONS on Information},
title={A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems},
year={2014},
volume={E97-D},
number={1},
pages={65-76},
abstract={Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.},
keywords={},
doi={10.1587/transinf.E97.D.65},
ISSN={1745-1361},
month={January},}
Copy
TY - JOUR
TI - A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems
T2 - IEICE TRANSACTIONS on Information
SP - 65
EP - 76
AU - Yonghwan KIM
AU - Tadashi ARARAGI
AU - Junya NAKAMURA
AU - Toshimitsu MASUZAWA
PY - 2014
DO - 10.1587/transinf.E97.D.65
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E97-D
IS - 1
JA - IEICE TRANSACTIONS on Information
Y1 - January 2014
AB - Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.
ER -