MaengSoon BAIK SungJin CHOI ChongSun HWANG JoonMin GIL ChanYeol PARK HeonChang YOO
Optimistic log-based rollback recovery protocols have been regarded as an attractive fault-tolerant solution in distributed systems based on message-passing paradigm due to low overhead in failure-free time. These protocols are based on a Piecewise Deterministic (PWD) Assumption model. They, however, assumed that all logged non-deterministic events in a consistent global recovery line must be determinately replayed in recovery time. In this paper, we give the impossibility of deterministic replaying of logged non-deterministic event in a consistent global recovery line as a Ω Line Problem, because of asynchronous properties of distributed systems: no bound on the relative speeds of processes, no bound on message transmission delays and no global time source. In addition, we propose a new optimistic log-based rollback recovery protocol, which guarantees the deterministic replaying of all logged non-deterministic events belonged in a consistent global recovery line and solves a Ω Line Problem in recovery time.
Namyoon WOO Hyungsoo JUNG Heon Young YEOM Taesoon PARK Hyungwoo PARK
Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.
Because it has desirable features such as no cascading rollback, fast output commit and asynchronous logging, causal message logging needs a consistent recovery algorithm to tolerate concurrent failures. For this purpose, Elnozahy proposed a centralized recovery algorithm to have two practical benefits, i.e. reducing the number of stable storage accesses and imposing no restriction on the execution of live processes during recovery. However, the algorithm with independent checkpointing may force the system to be in an inconsistent state when processes fail concurrently. In this paper, we identify these inconsistent cases and then present a recovery algorithm to have the two benefits and ensure the system consistency when integrated with any kind of checkpointing protocol. Also, our algorithm requires no additional message compared with Elnozahy's algorithm.
In this paper we study a classical firing squad synchronization problem on a model of fault-tolerant cellular automata that have possibly some defective cells. Several fault-tolerant time-efficient synchronization algorithms are developed based on a simple freezing-thawing technique. It is shown that, under some constraints on the distribution of defective cells, any cellular array of length n with p defective cell segments can be synchronized in 2n - 2 + p steps.
Determining consistent global checkpoints of a distributed computation has applications in the areas such as rollback recovery, distributed debugging, output commit and others. Netzer and Xu introduced the notion of zigzag paths and presented necessary and sufficient conditions for a set of checkpoints to be part of a consistent global checkpoint. This result also reveals that determining the existence of zigzag paths between checkpoints is crucial for determining consistent global checkpoints. Recent research also reveals that determining zigzag paths on-line is not possible. In this paper, we present an off-line method for determining the existence of zigzag paths between checkpoints.
Tadayoshi HORITA Itsuo TAKANAMI
The E-1-track switch torus array model and the "EAR" reconfiguration method are proposed for fault tolerance of mesh or torus-connected processor arrays, where the original idea of EAR is in EAM. The comparison among these and others is described in terms of the (run-time) array reliability, hardware overhead, and/or reconfiguration time. When a designer chooses one among fault tolerant methods, he should consider their features synthetically case by case, and we consider that the results given by this paper are useful for the choice.
In this paper, we propose a fault-tolerance mechanism for microprocessors, which detects transient faults and recovers from them. The investigation of fault-tolerance techniques for microprocessors is driven by two issues: One regards deep submicron fabrication technologies. Future semiconductor technologies could become more susceptible to alpha particles and other cosmic radiation. The other is the increasing popularity of mobile platforms. Cellular telephones are currently used for applications which are critical to our financial security, such as mobile banking, mobile trading, and making airline ticket reservations. Such applications demand that computer systems work correctly. In light of this, we propose a mechanism which is based on an instruction reissue technique for incorrect data speculation recovery and utilizes time redundancy, and evaluate our proposal using a timing simulator.
Tae-Suh PARK Chong-Ho LEE Duck-Jin CHUNG
This paper presents an evolutionary technique to build and maintain fault-recoverable digital circuits. As the synthesis of a circuit by genetic algorithm is progressed according to the circuit behavioral objectives and interactions with the environments, the knowledge regarding the architecture as well as the placement and routing processes is not the major concern of the proposed method. The evolutionary behavior of the circuit also prevents the circuit from stuck-at faults by continuously modifying the neighboring circuit blocks accordingly. This is done without the prior knowledge of where and how the faults occur because of the evolutionary nature. Thus, the overhead circuit blocks for fault diagnosis and redundancy are minimized with this design. The fault-recoverable evolvable hardware circuits are synthesized to build a few combinational logics by evolution and the fault recovery capabilities are shown with the reconfigurable FPGA.
In this paper we develop a robust control theory to achieve fault-tolerant behaviors of timed discrete event systems (DESs) with model uncertainty represented as a set of some possible models. To demonstrate the effectiveness of the proposed theory, we provide a case study of a resistance spot welding process.
Hiroyoshi MATSUI Michiko INOUE Toshimitsu MASUZAWA Hideo FUJIWARA
We investigate possibility of fault-tolerant and self-stabilizing protocols (ftss protocols) using an unreliable failure detector. Our main contribution is (1) to newly introduce k-accuracy of an unreliable failure detector, (2) to show that k-accuracy of a failure detector is necessary for any ftss k-group consensus protocol, and (3) to present three ftss k-group consensus protocols using a k-accurate and weakly complete failure detector under the read/write daemon on complete networks and on (n-k+1)-connected networks, and under the central daemon on complete networks.
Hagbae KIM Sangmoon LEE Taewha HONG
The reconfiguration latency defined as the time taken for reconfiguring a system upon failure detection or mode change. We evaluate it quantitatively for backup sparing, which is one of the most popular reconfiguration methods, by investigating the effects of key parameters.
Toshimitsu MASUZAWA Michiko INOUE
Distributed computation has attracted considerable attention and large-scale distributed systems have been designed and developed. A distributed system inherently has possibility of fault tolerance because of its redundancy. Thus, a great deal of investigation has been made to design fault-tolerant distributed algorithms. This paper introduces two promising paradigms, self-stabilization and wait-freedom, for designing fault-tolerant distributed algorithms and discusses some subjects important from the point of view of algorithm engineering.
Leqiang BAI Hiroyuki EBARA Hideo NAKANO Hajime MAEDA
This paper introduces an adaptive distributed routing algorithm for the faulty star graph. The algorithm is based on that the n-star graph has uniform node degree n-1 and is n-1-connected. By giving two routing rules based on the properties of nodes, an optimal routing function for the fault-free star graph is presented. For a given destination in the n-star graph, n-1 node-disjoint and edge-disjoint subgraphs, which are derived from n-1 adjacent edges of the destination, can be constructed by this routing function and the concept of Breadth First Search. When faults are encountered, according to that there are n-1 node-disjoint paths between two arbitrary nodes, the algorithm can route messages to the destination by finding a fault-free subgraphs based on the local failure information (the status of all its incident edges). As long as the number f of faults (node faults and/or edge faults) is less than the degree n-1 of the n-star graph, the algorithm can adaptively find a path of length at most d+4f to route messages successfully from a source to a destination, where d is the distance between source and destination.
Pierre U. TAGLE Neeraj K. SHARMA
Multicasting is an important feature for any switching network being intended to support broadband integrated services digital networks (B-ISDN). This paper proposes an improved multicast packet switch based on Lee's nonblocking copy network. The improved design retains the desirable features of Lee's network including its nonblocking property while adopting techniques to overcome the various limitations mentioned in various literature. The proposed network architecture utilizes d-dilated banyan networks to increase the amount of cells that can be replicated within the copy network. Cell splitting is used to optimize the utilization of the network's available bandwidth. Furthermore, the proposed architecture allows for the modular expansion in capacity to accomodate changing traffic patterns. The modular design of the proposed switch likewise offers easy handling and replacement of faulty modules.
In supervisory control, discrete event dynamic systems (DEDSs) are modeled by finite-state automata, and their behaviors described by the associated formal languages; control is exercised by a supervisor, whose control action is to enable or disable the controllable events. In this paper we present a general stability concept for DEDSs, stability in the sense of Lyapunov with resiliency, by incorporating Lyapunov stability concepts with the concept of stability in the sense of error recovery. We also provide algorithms for verifying stability and obtaining a domain of attraction. Relations between the notion of stability and the notion of fault-tolerance are addressed.
Yukihiro HAMADA Feng BAO Aohan MEI Yoshihide IGARASHI
A directed graph G = (V,E) is called the n-rotator graph if V = {a1a2
Feng BAO Yoshihide IGARASHI Sabine R. OHRING
In this paper we analyze the reliability of a simple broadcasting scheme for hypercubes (HCCAST) with random faults. We prove that HCCAST (n) (HCCAST for the n-dimensional hypercube) can tolerate Θ(2n/n) random faulty nodes with a very high probability although it can tolerate only n - 1 faulty nodes in the worst case. By showing that most of the f-fault configurations of the n dimensional hypercube cannot make HCCAST (n) fail unless f is too large, we illustrate that hypercubes are inherently strong enough for tolerating random faults. For a realistic n, the reliability of HCCAST (n) is much better than that of the broadcasting algorithm described in [6] although the latter can asymptotically tolerate faulty links of a constant fraction of all the links. Finally, we compare the fault-tolerant performance of the two broadcasting schemes for n = 15, 16, 17, 18, 19, 20, and we find that for those practical valuse, HCCAST (n) is very reliable.
Feng BAO Yoshihide IGARASHI Keiko KATANO
We study all-to-all broadcasting in hypercubes with randomly distributed Byzantine faults. We construct an efficient broadcasting scheme BC1-n-cube running on the n-dimensional hypercube (n-cube for short) in 2n rounds, where for communication by each node of the n-cube, only one of its links is used in each round. The scheme BC1-n-cube can tolerate (n-1)/2 Byzantine faults of nodes and/or links in the worst case. If there are exactly f Byzantine faulty nodes randomly distributed in the n-cabe, BC1-n-cube succeeds with a probability higher than 1(64nf/2n) n/2. In other words, if 1/(64nk) of all the nodes(i.e., 2n/(64nk) nodes) fail in Byzantine manner randomly in the n-cube, then the scheme succeeds with a probability higher than 1kn/2. We also consider the case where all nodes are faultless but links may fail randomly in the n-cube. Broadcasting by BC1-n-cube is successful with a probability hig her than 1kn/2 provided that not more than 1/(64(n1)k) of all the links in the n-cube fail in Byzantine manner randomly. For the case where only links may fail, we give another broadcasting scheme BC2-n-cube which runs in 2n2 rounds. Broadcasting by BC2-n-cube is successful with a high probability if the number of Byzantine faulty links randomly distributed in the n-cube is not more than a constant fraction of the total number of links. That is, it succeeds with a probability higher than 1nkn/2 if 1/(48k) of all the links in the n-cube fail randomly in Byzantine manner.
Management of control functions in large computer networks is a very difficult problem. One of the effective way to overcome the difficulty is to introduce hierarchical control structure (network cluster) in the management. When a fault occurred in the cluster, routing information at some nodes in the network must be updated in order to react the fault. However, the number of such nodes can be reduced by introducing ingenious topology into the cluster. This paper presents a fundamental discussion on network topology for a network cluster. First, L-FT is defined to represent a degree of fault-tolerance in a cluster with respect to link failures. Secondly, the minimum link problem M is defined to find the minimum number of links to make the cluster L-FT. The following results are obtained. (1) For a network cluster with the fault-tolerant topology 1-FT, at least 2n-2 links have to exist in the cluster where n is the number of border nodes in the cluster. (2) As far as connectivity of the whole network is held, for multiple L link failures in a L-FT cluster, the update of routing information at each node is localized within only the cluster containing the failed links. (3) Several hierarchical networks with fault-tolerant conditions are presented as case studies for a LAN and a MAN.
Masaharu AKATSU Tomohiro MURATA Kenzo KURIHARA
This paper proposes the Total High Performance Time as a performance-related reliability measure in degradable/recoverable real-time systems. This measure reflects the effect of system behavior in pending states that are temporary states between the normal state and degraded states where the system operates in a degraded mode as a consequence of component failures. Such systems have to perform not only normal procedures but also error/recovery procedures in pending states, so the performance there is lower than that in the degraded states. In real-time systems, if performance is less than a lower limit, the response time for on-line transactions cannot meet the deadline. The consequences of failing to meet the deadline could be system failure. Therefore, the system reliability is affected significantly by whether the performance there is higher than the lower limit or not. A state where the level of performance is higher than the lower limit is called a High Performance State. We define the Total High Performance Time as the total time that the system spends operating in High Performance States. Moreover, this paper explains how to utilize the Total High Performance Time in system design. We model a method of controlling a system in pending states by using Extended Stochastic Petri Nets and obtain the characteristics necessary for evaluating the Total High Performance Time by analyzing the model. This approach is applied to a storage system that controls mirrored disks, and shown to be helpful for designing a method of controlling a system in pending states, which has been considered difficult because of the trade-off between performance and reliability.