This work presents two novel algorithms to prevent rollback propagation for independent checkpointing: an efficient adaptive independent checkpointing algorithm and an optimized adaptive independent checkpointing algorithm. The last opportunity strategy that yields a better performance than the conservation strategy is also employed to prevent useless checkpoints for both causal rewinding paths and non-causal rewinding paths. The two methods proposed herein are domino effect-free and require only a limited amount of control information. They also take less unnecessary adaptive checkpoints than other algorithms. Furthermore, experimental results indicate that the checkpoint overhead of our techniques is lower than that of the coordinated checkpointing and domino effect-free algorithms for service-providing applications.
Tadashi DOHI Kouji NOMURA Naoto KAIO Shunji OSAKI
This paper considers two simulation models for simple unreliable file systems with checkpointing and rollback recovery. In Model 1, the checkpoint is generated at a pre-specified time and the information on the main memory since the last checkpoint is back-uped in a secondary medium. On the other hand, in Model 2, the checkpointing is executed at the time when the number of transactions completed for processing is achieved at a pre-determined level. However, it is difficult to treat such models analytically without employing any approximation method, if queueing effects related with arrival and processing of transactions can not be ignored. We apply the generalized stochastic Petri net (GSPN) to represent the stochastic behaviour of systems under two checkpointing schemes. Throughout GSPN simulation, we evaluate quantitatively the maintainability of checkpoint models under consideration and examine the dependence of model parameters in the optimal checkpoint policies and their associated system availabilities.
Adel CHERIF Masato SUZUKI Takuya KATAYAMA
We present a novel replication technique for parallel applications where instances of the replicated application are active on different group of processors called replicas. The replication technique is based on the FTAG (Fault Tolerant Attribute Grammar) computation model. FTAG is a functional and attribute based model. The developed replication technique implements "active parallel replication," that is, all replicas are active and compute concurrently a different piece of the application parallel code. In our model replicas cooperate not only to detect and mask failures but also to perform parallel computation. The replication mechanisms are supported by FTAG run time system and are fully application-transparent. Different novel mechanisms for checkpointing and recovery are developed. In our model during rollback recovery only that part of the computation that was detected faulty is discarded. The replication technique takes full advantage of parallel computing to reduce overall computation time.
Masanori ODAGIRI Naoto KAIO Shunji OSAKI
Checkpointing is one of the most powerful tools to operate a computer system with high reliability. We should execute the optimal checkpointing in some sense. This note shows the optimal checkpoint sequence minimizing the expected loss, Numerical examples are shown for illustration.