Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Namyoon WOO, Hyungsoo JUNG, Heon Young YEOM, Taesoon PARK, Hyungwoo PARK, "MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes" in IEICE TRANSACTIONS on Information,
vol. E87-D, no. 7, pp. 1820-1828, July 2004, doi: .
Abstract: Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.
URL: https://global.ieice.org/en_transactions/information/10.1587/e87-d_7_1820/_p
Copy
@ARTICLE{e87-d_7_1820,
author={Namyoon WOO, Hyungsoo JUNG, Heon Young YEOM, Taesoon PARK, Hyungwoo PARK, },
journal={IEICE TRANSACTIONS on Information},
title={MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes},
year={2004},
volume={E87-D},
number={7},
pages={1820-1828},
abstract={Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.},
keywords={},
doi={},
ISSN={},
month={July},}
Copy
TY - JOUR
TI - MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes
T2 - IEICE TRANSACTIONS on Information
SP - 1820
EP - 1828
AU - Namyoon WOO
AU - Hyungsoo JUNG
AU - Heon Young YEOM
AU - Taesoon PARK
AU - Hyungwoo PARK
PY - 2004
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E87-D
IS - 7
JA - IEICE TRANSACTIONS on Information
Y1 - July 2004
AB - Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.
ER -