The previous communication-induced checkpointing may considerably induce worthless forced checkpoints because each process receiving messages cannot obtain sufficient information related to non-causal Z-paths. This paper presents an enhanced sender-based message logging protocol applicable to any communication-induced checkpointing to lead to a high decrease of the forced checkpointing overhead of communication-induced checkpointing in an effective way while permitting no useless checkpoint. The protocol allows each process sending a message to know the exact timestamp of the receiver of the message in its logging procedures without any extra message. Simulation verifies their great efficiency of overhead alleviation regardless of communication patterns.
Takeharu IKEZOE Takuya KOJIMA Hideharu AMANO
Recent IoT devices require extremely low standby power consumption, while a certain performance is needed during the active time, and Coarse-Grained Reconfigurable Arrays (CGRAs) have received attention because of their high energy efficiency. For further reduction of the standby energy consumption of CGRAs, the leakage power for their configuration memory must be reduced. Although the power gating is a common technique, the lost data in flip-flops and memory must be retrieved after the wake-up. Recovering everything requires numerous state transitions and considerable overhead both on its execution time and energy. To address the problem, Non-volatile Cool Mega Array (NVCMA), a CGRA providing non-volatile flip-flops (NVFFs) with spin transfer torque type non-volatile memory (NVM) technology has been developed. However, in general, non-volatile memory technologies have problems with reliability. Some NVFFs are stacked-at-0/1, and cannot store the data in a certain possibility. To improve the chip yield, we propose a mapping algorithm to avoid faulty processing elements of the CGRA caused by the erroneous configuration data. Next, we also propose a method to add an error-correcting code (ECC) mechanism to NVFFs for the configuration and constant memory. The proposed method was applied to NVCMA to evaluate the availability rate and reduction of write time. By using both methods, the average availability ratio of 94.2% was achieved, while the average availability ratio of the nine applications was 0.056% when the probability of failure of the FF was 0.01. The energy for storing data becomes about 2.3 times because of the hardware overhead of ECC but the proposed method can save 8.6% of the writing power on average.
A vertex subset F ⊆ V(G) is called a cyclic vertex-cut set of a connected graph G if G-F is disconnected such that at least two components in G-F contain cycles. The cyclic vertex connectivity is the cardinality of a minimum cyclic vertex-cut set. In this paper, we show that the cyclic vertex connectivity of the trivalent Cayley graphs TGn is equal to eight for n ≥ 4.
Antoine BOSSARD Keiichi KANEKO
Extending the very popular tori interconnection networks[1]-[3], Torus-Connected Cycles (TCC) have been proposed as a novel network topology for massively parallel systems [5]. Here, the set-to-set disjoint paths routing problem in a TCC is solved. In a TCC(k,n), it is proved that paths of lengths at most kn2+2n can be selected in O(kn2) time.
Carlos PEREZ-LEGUIZAMO P. Josue HERNANDEZ-TORRES J.S. Guadalupe GODINEZ-BORJA Victor TAPIA-TEC
Recently, the Services Oriented Architectures (SOA) have been recognized as the key to the integration and interoperability of different applications and systems that coexist in an organization. However, even though the use of SOA has increased, some applications are unable to use it. That is the case of mission critical information applications, whose requirements such as high reliability, non-stop operation, high flexibility and high performance are not satisfied by conventional SOA infrastructures. In this article we present a novel approach of combining SOA with Autonomous Decentralized Systems (ADS) in order to provide an infrastructure that can satisfy those requirements. We have named this infrastructure Autonomous Decentralized Service Oriented Architecture (ADSOA). We present the concept and architecture of ADSOA, as well as the Loosely Couple Delivery Transaction and Synchronization Technology for assuring the data consistency and high reliability of the application. Moreover, a real implementation and evaluation of the proposal in a mission critical information system, the Uniqueness Verifying Public Key Infrastructure (UV-PKI), is shown in order to prove its effectiveness.
Go MATSUKAWA Yohei NAKATA Yasuo SUGURE Shigeru OHO Yuta KIMI Masafumi SHIMOZAWA Shuhei YOSHIDA Hiroshi KAWAGUCHI Masahiko YOSHIMOTO
This paper presents a novel architecture for a fault-tolerant and dual modular redundancy (DMR) system using a checkpoint recovery approach. The architecture features exploitation of SRAM with simultaneous copy and instantaneous compare function. It can perform low-latency data copying between dual cores. Therefore, it can carry out fast backup and rollback. Furthermore, it can reduce the power consumption during data comparison process compared to the cyclic redundancy check (CRC). Evaluation results show that, compared with the conventional checkpoint/restart DMR, the proposed architecture reduces the cycle overhead by 97.8% and achieves a 3.28% low-latency execution cycle even if a one-time fault occurs when executing the task. The proposed architecture provides high reliability for systems with a real-time requirement.
Yonghwan KIM Tadashi ARARAGI Junya NAKAMURA Toshimitsu MASUZAWA
Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.
This paper presents a new scalable method to considerably reduce the rollback propagation effect of the conventional optimistic message logging by utilizing positive features of reliable FIFO group communication links. To satisfy this goal, the proposed method forces group members to replicate different receive sequence numbers (RSNs), which they assigned for each identical message to their group respectively, into their volatile memories. As the degree of redundancy of RSNs increases, the possibility of local recovery for each crashed process may significantly be higher. Experimental results show that our method can outperform the previous one in terms of the rollback distance of non-faulty processes with a little normal time overhead.
In this paper, we derive a simple formula to generate a wide-sense systematic generator matrix(we call it quasi-systematic) B for a Reed-Solomon code. This formula can be utilized to construct an efficient interpolation based erasure-only decoder with time complexity O(n2) and space complexity O(n). Specifically, the decoding algorithm requires 3kr + r2 - 2r field additions, kr + r2 + r field negations, 2kr + r2 - r + k field multiplications and kr + r field inversions. Compared to another interpolation based erasure-only decoding algorithm derived by D.J.J. Versfeld et al., our algorithm is much more efficient for high-rate Reed-Solomon codes.
Chaochao FENG Zhonghai LU Axel JANTSCH Minxuan ZHANG Xianju YANG
In this paper, we propose three Deflection-Routing-based Multicast (DRM) schemes for a bufferless NoC. The DRM scheme without packets replication (DRM_noPR) sends multicast packet through a non-deterministic path. The DRM schemes with adaptive packets replication (DRM_PR_src and DRM_PR_all) replicate multicast packets at the source or intermediate node according to the destination position and the state of output ports to reduce the average multicast latency. We also provide fault-tolerant supporting in these schemes through a reinforcement-learning-based method to reconfigure the routing table to tolerate permanent faulty links in the network. Simulation results illustrate that the DRM_PR_all scheme achieves 41%, 43% and 37% less latency on average than that of the DRM_noPR scheme and 27%, 29% and 25% less latency on average than that of the DRM_PR_src scheme under three synthetic traffic patterns respectively. In addition, all three fault-tolerant DRM schemes achieve acceptable performance degradation at various link fault rates without any packet lost.
Fuyuan XIAO Teruaki KITASUKA Masayoshi ARITSUGI
We present an economical and fault-tolerant load balancing strategy (EFTLBS) based on an operator replication mechanism and a load shedding method, that fully utilizes the network resources to realize continuous and highly-available data stream processing without dynamic operator migration over wide area networks. In this paper, we first design an economical operator distribution (EOD) plan based on a bin-packing model under the constraints of each stream bandwidth as well as each server's CPU capacity. Next, we devise super-operator (SO) that load balances multi-degree operator replicas. Moreover, for improving the fault-tolerance of the system, we color the SOs based on a coloring bin-packing (CBP) model that assigns peer operator replicas to different servers. To minimize the effects of input rate bursts upon the system, we take advantage of a load shedding method while keeping the QoS guarantees made by the system based on the SO scheme and the CBP model. Finally, we substantiate the utility of our work through experiments on ns-3.
Sender-based message logging (SBML) with checkpointing has its well-known beneficial feature, lowering highly failure-free overhead of synchronous logging with volatile logging at sender's memory. This feature encourages it to be applied into many distributed systems as a low-cost transparent rollback recovery technique. However, the original SBML recovery algorithm may no longer be progressing in some transient communication error cases. This paper proposes a consistent recovery algorithm to solve this problem by piggybacking small log information for unstable messages received on each acknowledgement message for returning the receive sequence number assigned to a message by its receiver. Our algorithm also enables all messages scheduled to be sent, but delayed because of some preceding unstable messages to be actually transmitted out much earlier than the existing ones.
Lihong SHANG Mi ZHOU Yu HU Erfu YANG
Field programmable gate arrays (FPGAs) are widely used in reliability-critical systems due to their reconfiguration ability. However, with the shrinking device feature size and increasing die area, nowadays FPGAs can be deeply affected by the errors induced by electromigration and radiation. To improve the reliability of FPGA-based reconfigurable systems, a permanent fault recovery approach using a domain partition model is proposed in this paper. In the proposed approach, the fault-tolerant FPGA recovery from faults is realized by reloading a proper configuration from a pool of multiple alternative configurations with overlaps. The overlaps are presented as a set of vectors in the domain partition model. To enhance the reliability, a technical procedure is also presented in which the set of vectors are heuristically filtered so that the corresponding small overlaps can be merged into big ones. Experimental results are provided to demonstrate the effectiveness of the proposed approach through applying it to several benchmark circuits. Compared with previous approaches, the proposed approach increased MTTF by up to 18.87%.
Tadayoshi HORITA Itsuo TAKANAMI Masatoshi MORI
Two simple but useful methods, called the deep learning methods, for making multilayer neural networks tolerant to multiple link-weight and neuron-output faults, are proposed. The methods make the output errors in learning phase smaller than those in practical use. The abilities of fault-tolerance of the multilayer neural networks in practical use, are analyzed in the relationship between the output errors in learning phase and in practical use. The analytical result shows that the multilayer neural networks have complete (100%) fault-tolerance to multiple weight-and-neuron faults in practical use. The simulation results concerning the rate of successful learnings, the ability of fault-tolerance, and the learning time, are also shown.
Fault-tolerance is an important design issue in building a reliable mobile computing system. This paper considers checkpointing recovery services for a mobile computing system based on the ad-hoc network environment. Since potential problems of this new environment are insufficient power and limited storage capacity, the proposed scheme tries to reduce disk access frequency for saving recovery information, and also the amount of information saved for recovery. A brief simulation study has been performed and the results show that the proposed scheme takes advantage of the existing checkpointing recovery schemes.
Daisuke TAKEMOTO Shigeaki TAGASHIRA Satoshi FUJITA
In this paper, we propose a new method to enhance the fault-tolerance of the Content Addressable Network (CAN), which is known as a typical pure P2P system based on the notion of Distributed Hash Table (DHT). The basic idea of the proposed method is to introduce redundancy to the management of index information distributed over the nodes in the given P2P network, by allowing each index to be assigned to several nodes, which was restricted to be one in the original CAN system. To keep the consistency among several copies of indices, we propose an efficient synchronization scheme based on the notion of labels assigned to each copy in a distinct manner. The performance of the proposed scheme is evaluated by simulation. The result of simulations indicates that the proposed scheme significantly enhances the fault-tolerance of the CAN system.
To guarantee the high reliability of video services, video servers usually adopt parity-encoding techniques in which data blocks and their associated parity blocks form a parity group. For real-time video service, all the blocks in a parity group are prefetched in order to cope with a possible disk failure, thereby incurring a buffering overhead. In this paper, we propose a new scheme called Round-level Parity Grouping (RPG) to reduce the buffer overhead while restoring VBR video streams in the presence of a faulty disk. RPG allows variable parity group sizes so that the exact amount of data is retrieved during each round. Based on RPG, we have developed a storage allocation algorithm for effective buffer management. Experimental results show that our proposed scheme reduces the buffer requirement by 20% to 25%.
Sayaka KAMEI Hirotsugu KAKUGAWA
Self-stabilization is a theoretical framework of non-masking fault-tolerant distributed algorithms. In this paper, we investigate a self-stabilizing distributed approximation for the minimum k-dominating set (KDS) problem in general networks. The minimum KDS problem is a generalization of the well-known dominating set problem in graph theory. For a graph G = (V,E), a set Dk
We present a built-in self-reconfiguring system for a mesh-connected processor array where faulty processor elements are compensated for by spare processing elements located in one row and one column. It has advantages in that the number of spare processing elements is small and additional control circuits and networks for changing interconnections of processing elements is so simple that hardware overhead for reconfiguration is also small. First, to indicate the motivation to the proposed reconfiguration scheme, we briefly describe other schemes with the same number of spares as that of the proposed scheme where faulty processing elements are replaced using straight shifts toward spares, and compare their reconfiguration probabilities to each other. Then, we show that a variant of the proposed scheme has the highest probability. Next, we present a built-in self-reconfiguring system for the scheme and formally prove that it works correctly. It can automatically replace faulty processors by spare processors on detecting faults of processors.
Deron LIANG Chen-Liang FANG Chyouhwa CHEN
Active replication is a common approach to building highly available and reliable distributed software applications. The redundant nested invocation (RNI) problem arises when servers in a replicated group issues nested invocations to other server groups in response to a client invocation. Automatic suppression of RNI is always a desirable solution, yet it is usually a difficult design issue. If the system has multithreading support, the difficulties of implementation increase dramatically. Intuitively, to design a deterministic thread execution control mechanism is a possible approach. Unfortunately, some modern operating systems implement thread on kernel level for execution fairness. For the kernel thread case, modification on thread control implies modifying the operating system kernel. This approach loses system portability which is one of the important requirements of CORBA or middleware. In this work, we propose a mechanism to perform the auto-suppression of redundant nested invocation in an active replication fault-tolerant (FT) CORBA system. Besides the mechanism design, we discuss the design correctness semantic and the correctness proof of our design.