The search functionality is under construction.

Author Search Result

[Author] Hyungsoo JUNG(1hit)

1-1hit
  • MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes

    Namyoon WOO  Hyungsoo JUNG  Heon Young YEOM  Taesoon PARK  Hyungwoo PARK  

     
    PAPER-Distributed, Grid and P2P Computing

      Vol:
    E87-D No:7
      Page(s):
    1820-1828

    Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.