The search functionality is under construction.

IEICE TRANSACTIONS on Information

WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs

Xinhai XU, Xuejun YANG, Yufei LIN

  • Full Text Views

    0

  • Cite this

Summary :

As supercomputers increase in size, the mean time between failures (MTBF) of a system becomes shorter, and the reliability problem of supercomputers becomes more and more serious. MPI is currently the de facto standard used to build high-performance applications, and researches on the fault tolerance methods of MPI are always hot topics. However, due to the characteristics of MPI programs, most current checkpointing methods for MPI programs need to modify the MPI library (even operating system), or implement a complicated protocol by logging lots of messages. In this paper, we carry forward the idea of Application-Level Checkpointing (ALC). Based on the general fact that programmers are familiar with the communication characteristics of applications, we have developed BC-ALC, a new portable blocking coordinated ALC for MPI programs. BC-ALC neither modifies the MPI library (even operating system) nor logs any message. It implements coordination only by the Barrier operations instead of any complicated protocol. Furthermore, in order to reduce the cost of fault-tolerance, we reduce the synchronization range of the barrier, and design WBC-ALC, a weak blocking coordinated ALC utilizing group synchronization instead of global synchronization based on the communication relationship between processes. We also propose a fault-tolerance framework developed on top of WBC-ALC and discuss an implementation of it. Experimental results on NPB3.3-MPI benchmarks validate BC-ALC and WBC-ALC, and show that compared with BC-ALC, the average coordination time and the average backup time of a single checkpoint in WBC-ALC are reduced by 44.5% and 5.7% respectively.

Publication
IEICE TRANSACTIONS on Information Vol.E95-D No.3 pp.786-796
Publication Date
2012/03/01
Publicized
Online ISSN
1745-1361
DOI
10.1587/transinf.E95.D.786
Type of Manuscript
PAPER
Category
Computer System

Authors

Keyword