Joaquín GRACIA Juan C. BARAZA Daniel GIL Pedro J. GIL
Nowadays, the use of dependable systems is generalising, and diagnosis is an important step during their design . A diagnosis in early phases of the design cycle allows to save time and money. Fault injection can be used during the design process of the system, and using Hardware Description Languages, particularly VHDL, it is possible to accomplish this early diagnosis. During last years, the Time-Triggered Architecture (TTA) has emerged as a hard real-time fault-tolerant architecture for embedded systems. This novel architecture is gaining adepts mainly in the avionics and automotive industries ( x-by-wire ). The TTA implements a synchronous protocol with static scheduling that has been specifically targeted at hard real-time fault-tolerant distributed system. In this work, we present the study of the VHDL model of a communication controller based on the TTA, where a number of fault injection campaigns have been carried out. We comment the results produced and suggest some solutions to problems detected.
Fault-tolerant execution of a mobile agent is an important design issue to build a reliable mobile agent system. Several fault-tolerant schemes for a single agent system have been proposed, however, there has been little research result on the multi-agent system. For the cooperating mobile agents, fault-tolerant schemes should consider the inter-agent dependency as well as the mobility; and try to localize the effect of a failure. In this paper, we investigate properties of inter-agent dependency and agent mobility; and then characterize rollback propagation caused by the dependency and the mobility. We then suggest some schemes to localize rollback propagation.
Toshinori TAKABATAKE Masato KITAKAMI Hideo ITO
In interconnection networks, deadlock recovery has been studied in routing strategy. The routing strategy for the deadlock recovery is intended to optimize the routing performance when deadlocks do not occur. On the other hand, it is important to improve the routing performance by handling deadlocks if they occur. In this paper, a routing strategy for suspensive deadlock recovery called an escape-restoration routing is proposed and its performance is evaluated. In the principle of the proposed techniques, a small amount of exclusive buffer (escape-buffer) at each router is prepared for handling one of deadlocked packets. The transmission of the packet is suspended by temporarily escaping it to the escape-buffer. After the other deadlocked packets were sent, the suspended transmission resumes by restoring the escaped packet. Evaluation results show that the proposed techniques can improve the routing performance more than that of the previous recovery-based techniques in handling deadlocks.
Deogkyoo LEE Daekeun MOON Ilgu YUN Hagbae KIM
Since components faults occurring at arbitrary places (primarily on the links) affect seriously network performance and reliability, the multicomputers operating in harsh environments should be designed to guarantee normal network-missions in presence of those faults. One solution to the end is a fault-tolerant routing scheme, which enables messages to safely reach their destinations avoiding failed links when transmission of messages is blocked by certain faults. In the paper, we develop a fault-tolerant routing algorithm with deadlock freedom in an n-dimensional meshed network, and validate its efficiency and effectiveness through proper simulations. The aspects of fault-tolerance is adopted by appending partial-adaptiveness and detouring to the e-cube algorithm, while using a wormhole routing for the backbone routing method. The phenomenon of deadlock incurred due to its adaptiveness is eliminated by classifying a physical channel into a couple of virtual channels.
Koji HASHIMOTO Tatsuhiro TSUCHIYA Tohru KIKUNO
In this paper, we propose a new scheduling algorithm to achieve fault tolerance in multiprocessor systems. This algorithm first partitions a parallel program into subsets of tasks, based on the notion of height of a task graph. For each subset, the algorithm then duplicates and schedules the tasks in the subset successively. We prove that schedules obtained by the proposed algorithm can tolerate a single processor failure and show that the computational complexity of the algorithm is O(|V|4) where V is the set of nodes of a task graph. We conduct simulations by applying the algorithm to two kinds of practical task graphs (Gaussian elimination and LU-decomposition). The results of this experiment show that fault tolerance can be achieved at the cost of small degree of time redundancy, and that performance in the case of a processor failure is improved compared to a previous algorithm.
One major breakthrough on the communication society recently is the extension of networking from wired to wireless networks. This has made possible creating a mobile distributed computing environment and has brought us several new challenges in distributed protocol design. Obviously, wireless networks do have some fundamental differences from wired networks that need to be paid special attention of, such as lower communication bandwidth compared to wired networks, limited electrical power due to battery capacity, and mobility of processes. These new issues make traditional recovery algorithm unsuitable. In this paper, we propose an efficient algorithm with O(nr) message complexity where O(nr) is the total number of mobile hosts (MHs) related to the failed MH. In addition, these MHs only need to rollback once and can immediately resume its operation without waiting for any coordination message from other MHs. During normal operation, the application message needs O(1) additional information when it transmitted between MHs and mobile support stations (MSSs). Each MSS must keep an ntotal_h*n cell_h dependency matrix, where O(ntotal_h) is the total number of MHs in the system and ncell_h is the total number of MHs in its cell. Finally, one related issue of resending lost messages is also considered.
Many researchers have used hypercube interconnection networks for their good properties to construct many parallel processing systems. However, as the number of processors increases, the probability of occurrences of faulty nodes also increases. Hence, for hypercube interconnection networks which have faulty nodes, several efficient dynamic routing algorithms have been proposed which allow each node to hold status information of its neighbor nodes. In this paper, we propose an improved version of the algorithm proposed by Chiu and Wu by introducing the notion of full reachability. A fully reachable node is a node that can reach all nonfaulty nodes which have Hamming distance l from the node via paths of length l. In addition, we further improve the algorithm by classifying the possibilities of detours with respect to each Hamming distance between current and target nodes. We propose an initialization procedure which makes use of an equivalent condition to perform this classification efficiently. Moreover, we conduct a simulation to measure the improvement ratio and to compare our algorithms with others. The simulation results show that the algorithms are effective when they are applied to low-dimensional hypercube interconnection networks.
Heejo LEE Jong KIM Sung Je HONG
In this paper, we show two process allocation schemes to tolerate multiple faults when the primary-backup replication method is used. The first scheme, called multiple backup scheme, is running multiple backup processes for each process to tolerate multiple faults. The second scheme, called regenerative backup scheme, is running only one backup process for each process, but re-generates backup processes for processes that do not have a backup process after a fault occurrence to keep the primary-backup process pair available. In both schemes, we propose heuristic process allocation methods for balancing loads in spite of the occurrence of faults. Then we evaluate and compare the performance of the proposed heuristic process allocation methods using simulation. Next, we analyze the reliability of two schemes based on their fault-tolerance capability. For the analysis of fault-tolerance capability, we find the degree of fault tolerance for each scheme. Then we find the reliability of each scheme using Markov chains. The comparison results of two schemes indicate that the regenerative single backup process allocation scheme is more suitable than the multiple backup allocation scheme.
A fault-tolerant wormhole routing algorithm on mesh-connected processors is proposed. The proposed algorithm is based on the solid fault model and allows the fault polygons to be overlapped. The algorithm compares the position of fault region relative to current channel with the fault direction field of a misrouted message to route around overlapped fault polygons. A node deactivating algorithm to convert non-solid fault region into solid fault region is also proposed. The proposed routing algorithm uses four virtual channels and is deadlock and livelock free.
Boon Keat TAN Toru OGAWA Ryuji YOSHIMURA Kenji TANIGUCHI
This paper describes a new architecture-based DSP processor, which consists of n n mesh multiprocessor for digital signal processing. A prototype chip, RCDSP9701 has been designed and implemented using a CMOS 0. 6 µm process. This architecture has better performance compare to the traditional microprocessor solution to Digital Signal Processing. The proposed method poses remarkable flexibility compare to ASIC (Application Specified Integrated Circuits) approach for Digital Signal Processing applications. In addition, the proposed architecture is fault tolerant and suitable for parallel computing applications. In this paper, an implementation into a silicon chip of the new architecture is presented to give a better understanding of our work.
Jinsoo KIM Ji-Yun KIM Hyunsoo YOON Seung Ryoul MAENG Jung Wan CHO
We propose a fault-tolerant routing algorithm for 2D meshes. Our routing algorithm can tolerate any number of concave fault regions. It is based on xy-routing and uses the concept of the fault ring/chain composed of fault-free elements surrounding faults. Three virtual channels per physical link are used for deadlock-free routing on a fault ring. Four virtual channels are needed for a fault chain. For a concave fault ring, fault-free nodes in the concave region have been deactivated to avoid deadlock in the previous algorithms, which results in excessive loss of the computational power. Our algorithm ensures deadlock-freedom by restricting the virtual channel usage in the concave region, and it minimizes the loss of the computational power. We also extend the proposed routing scheme for adaptive fault-tolerant routing. The adaptive version requires the same number of virtual channels as the deterministic one.
For a given N-vertex graph H, a graph G obtained from H by adding t vertices and some edges is called a t-FT (t-fault-tolerant) graph for H if even after deleting any t vertices from G, the remaining graph contains H as a subgraph. For the n-dimensional cube Q(n) with N vertices, a t-FT graph with an optimal number O(tN+t2) of added edges and maximum degree of O(N+t), and a t-FT graph with O(tNlog N) added edges and maximum degree of O(tlog N) have been known. In this paper, we introduce some t-FT graphs for Q(n) with an optimal number O(tN+t2) of added edges and small maximum degree. In particular, we show a t-FT graph for Q(n) with 2ctN+ct2((logN)/C)C added edges and maximum degree of O(N/(logC/2N))+4ct.
This paper presents a practical fault-tolerant architecture for mesh parallel machines that has t spare processors and has 2(t+2) communication links per processor while tolerating at most t+1 processor and link faults. We also show that the architecture presented here can be laid out efficiently in a linear area with wire length at most O(t).
Toshinori YAMADA Koji YAMAMOTO Shuichi UENO
Motivated by the design of fault-tolerant multiprocessor interconnection networks, this paper considers the following problem: Given a positive integer t and a graph H, construct a graph G from H by adding a minimum number Δ(t, H) of edges such that even after deleting any t edges from G the remaining graph contains H as a subgraph. We estimate Δ(t, H) for the hypercube and torus, which are well-known as important interconnection networks for multiprocessor systems. If we denote the hypercube and the square torus on N vertices by QN and DN respectively, we show, among others, that Δ(t, QN) = O(tN log(log N/t + log 2e)) for any t and N (t 2), and Δ(1, DN) = N/2 for N even.
A new high-performance fault-tolerant ATM switching network is proposed. This network contains the baseline network and has many redundant switching elements to enhance the fault tolerance and throughput of the conventional multistage interconnection networks. The presented routing algorithm is very simple and can support a very huge number of paths between each input-output pair. The paths can be used to route cells when internal cell contentions occur in switching elements. The redundant switching elements at the last stage offer two access points to the output ports to resolve the output conflict. Performance analysis and simulation results show that this network has better maximum throughput even for faulty conditions. Among various networks, it has the largest number of redundant paths, and the greatest unit node contribution and unit edge contribution.
Seongbae EUN Seung Ryoul MAENG Jung Wan CHO
The integration of both real-time systems and fault-tolerant systems has been emerged as one of the greatest challenges of this decade. It is called a responsive system, which has the objective to optimeze both timeliness and reliability. The performance measure in responsive systems is responsiveness that tells how probable a system executes correctly on time with faults occurred. While there have been some achievements in communication protocols and specification, we believe that scheduling problems in responsive systems are not understood deeply and sufficiently, yet. In this paper, we discuss the scheduling problem in responsive systems. At first, we investigate the issues in the scheduling and propose the precise definition of the responsiveness. We also suggest a scheduling algorithm called Responsive Earliest Deadline First (REDF) for preemptive aperiodic tasks in a uniprocessor system. We show that REDF is optimal to obtain the maximum responsiveness, and the time complexity is analyzed to be
Hiroshi MASUYAMA Tetsuo ICHIMORI
In this paper we estimate the number of permutations realizable in fault-tolerant multistage interconnection networks designed to tolerate faults on any switching element. The Parallel Omega network and the INDRA network are representative types of fault-tolerate multistage interconnection networks designed to tolerate a single fault. In order to evaluate the enhancement in the function of network by preparing the hardware redundancy for fault-tolerance, we estimate the number of permutations realizable in fault-tolerant networks. This result enables us to set up a standard to evaluate the hardware redundancy required to tolerate multifaults from the viewpoint of the enhancement of network function. This paper concludes that in the case where the number of inputs is up to 32 the increase ratio of the number of realizable permutations is no more than 1/0.73 even if the tolerance to multifaults is prepared instead of the tolerance to a single fault.
Takeshi KASUGA Michitaka KAMEYAMA Tatsuo HIGUCHI
Robust-fault tolerance is a property that a computational result becomes nearly equal to the correct one at the occurrence of faults in digital system. There are many cases where the safety of digital control systems can be maintained if the property is satisfied. In this paper, robust-fault-tolerant three-valued arithmetic modules such as an adder and a multiplier are proposed. The positive and negative integers are represented by the number of 1's and 1's, respectively. The design concept of the arithmetic modules is that a fault makes linearly additive effect with a small value to the final result. Each arithmetic module consists of identical submodules linearly connected, so that multi-stage structure is formed to generate the final output from the last submodule. Between the input and output digits in the submodule some simple functional relation is satisfied with respect to the number of 1's and 1's. Moreover, the output digit value depends on very small portion of the submodules including the input digits. These properties make the linearly additive effect with a small value to the final result in the arithmetic modules even if multiple faults are occurred at the input and output of any gates in the submodules. Not only direct three-valued representation but also the use of three-valued logic circuits is inherently suitable for efficient implementation of the arithmetic VLSI system. The evaluation of the robust-fault-tolerant three-valued arithmetic modules is done with regard to the chip size and the speed using the standard CMOS design rule. As a result, it is made clear that the chip size can be greatly reduced.
An FENG Tohru KIKUNO Koji TORII
When a group of developers are involved in the distributed development of some software product, they must communicate with one another frequently to exchange information about the product. To reduce the penalty of communication, the support environment should provide developers with their necessary information and update the information automatically while the product is modified by developers. Furthermore, the environment must meet the following requirements despite of workstation failures: whether a specific information is correct or not should always be decidable; as much information as possible should be updated correctly and efficiently. This paper presents a framework to construct such a fault-tolerant environment based on attribute grammars. In the framework, a product is represented by an attributed tree, which is partitioned into several subtrees {T1,,Tm}. Attribute values in each subtree Ti(1im) express the information about the product required by a developer. We introduce a set of redundant data and algorithms to meet the fault-tolerance requirements mentioned above. The correctness of an attribute value in Ti can then be decided in O(mn0log n) time, where n0n, and n is the number of attribute instances in Ti. All available attribute values can be updated with time complexity O(m2n1 log n) and communication complexity O(m2), where n1 is the number of attribute instances that must be reevaluated.
Manfred J. PFLUEGL Douglas M. BLOUGH
Synchronous clocks are an essential requirement for a variety of distributed system applications. Many of these applications are safety-critical and require fault tolerance. In this paper, a general probabilistic clock synchronization model is presented. This model is uniformly probabilistic, incorporating random message delays, random clock drifts, and random fault occurrences. The model allows faults in any system component and of any type. Also, a new Sliding Window Clock Synchronization Algorithm (SWA) providing increased fault tolerance is proposed. The probabilistic model is used for an evaluation of SWA which shows that SWA is capable of tolerating significantly more faults than other algorithms and that the synchronization tightness is as good or better than that of other algorithms.