1-5hit |
All the existing sender-based message logging (SBML) protocols share a well-known limitation that they cannot tolerate concurrent failures. In this paper, we analyze the cause for this limitation in a unicast network environment, and present an enhanced SBML protocol to overcome this shortcoming while preserving the strengths of SBML. When the processes on different nodes execute a distributed application together in a broadcast network, this new protocol replicates the log information of each message to volatile storages of other processes within the same broadcast network. It may reduce the communication overhead for the log replication by taking advantage of the broadcast nature of the network. Simulation results show our protocol performs better than the traditional one modified to tolerate concurrent failures in terms of failure-free execution time regardless of distributed application communication pattern.
MaengSoon BAIK SungJin CHOI ChongSun HWANG JoonMin GIL ChanYeol PARK HeonChang YOO
Optimistic log-based rollback recovery protocols have been regarded as an attractive fault-tolerant solution in distributed systems based on message-passing paradigm due to low overhead in failure-free time. These protocols are based on a Piecewise Deterministic (PWD) Assumption model. They, however, assumed that all logged non-deterministic events in a consistent global recovery line must be determinately replayed in recovery time. In this paper, we give the impossibility of deterministic replaying of logged non-deterministic event in a consistent global recovery line as a Ω Line Problem, because of asynchronous properties of distributed systems: no bound on the relative speeds of processes, no bound on message transmission delays and no global time source. In addition, we propose a new optimistic log-based rollback recovery protocol, which guarantees the deterministic replaying of all logged non-deterministic events belonged in a consistent global recovery line and solves a Ω Line Problem in recovery time.
Hyochang NAM Jong KIM Sung Je HONG Sunggu LEE
For checkpointing to be practical, it has to introduce low overhead for the targeted application. As a means of reducing the overhead of checkpointing, this paper proposes a probabilistic checkpointing method, which uses block encoding to detect the modified memory area between two consecutive checkpoints. Since the proposed technique uses block encoding to detect the modified area, the possibility of aliasing exists in encoded words. However, this paper shows that the aliasing probability is near zero when an 8-byte encoded word is used. The performance of the proposed technique is analyzed and measured by using experiments. An analytic model which predicts the checkpointing overhead is first constructed. By using this model, the block size that produces the best performance for a given target program is estimated. In most cases, medium block sizes, i.e., 128 or 256 bytes, show the best performance. The proposed technique has also been implemented on Unix based systems, and its performance has been measured in real environments. According to the experimental results, the proposed technique reduces the overhead by 11.7% in the best case and increases the overhead by 0.5% in the worst case in comparison with page-based incremental checkpointing.
Tadashi DOHI Kouji NOMURA Naoto KAIO Shunji OSAKI
This paper considers two simulation models for simple unreliable file systems with checkpointing and rollback recovery. In Model 1, the checkpoint is generated at a pre-specified time and the information on the main memory since the last checkpoint is back-uped in a secondary medium. On the other hand, in Model 2, the checkpointing is executed at the time when the number of transactions completed for processing is achieved at a pre-determined level. However, it is difficult to treat such models analytically without employing any approximation method, if queueing effects related with arrival and processing of transactions can not be ignored. We apply the generalized stochastic Petri net (GSPN) to represent the stochastic behaviour of systems under two checkpointing schemes. Throughout GSPN simulation, we evaluate quantitatively the maintainability of checkpoint models under consideration and examine the dependence of model parameters in the optimal checkpoint policies and their associated system availabilities.
Tadashi DOHI Takashi AOKI Naoto KAIO Shunji OSAKI
This paper considers a probabilistic model for a database recovery action with checkpoint generations when system failures occur according to a renewal process whose renewal density depends on the cumulative operation period since the last checkpoint. Necessary and sufficient conditions on the existence of the optimal checkpoint interval which maximizes the ergodic availability are analytically derived, and solvable examples are given for the well-known failure time distributions. Further, several methods to be needed for numerical calculations are proposed when the information on system failures is not sufficient. We use four analytical/tractable approximation methods to calculate the optimal checkpoint schedule. Finally, it is shown through numerical comparisons that the gamma approximation method is the best to seek the approximate solution precisely.