The search functionality is under construction.
The search functionality is under construction.

MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes

Namyoon WOO, Hyungsoo JUNG, Heon Young YEOM, Taesoon PARK, Hyungwoo PARK

  • Full Text Views

    0

  • Cite this

Summary :

Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.

Publication
IEICE TRANSACTIONS on Information Vol.E87-D No.7 pp.1820-1828
Publication Date
2004/07/01
Publicized
Online ISSN
DOI
Type of Manuscript
Special Section PAPER (Special Section on Hardware/Software Support for High Performance Scientific and Engineering Computing)
Category
Distributed, Grid and P2P Computing

Authors

Keyword