The search functionality is under construction.

Keyword Search Result

[Keyword] fault-tolerance(44hit)

1-20hit(44hit)

  • Enhanced Sender-Based Message Logging for Reducing Forced Checkpointing Overhead in Distributed Systems

    Jinho AHN  

     
    LETTER-Dependable Computing

      Pubricized:
    2021/06/08
      Vol:
    E104-D No:9
      Page(s):
    1500-1505

    The previous communication-induced checkpointing may considerably induce worthless forced checkpoints because each process receiving messages cannot obtain sufficient information related to non-causal Z-paths. This paper presents an enhanced sender-based message logging protocol applicable to any communication-induced checkpointing to lead to a high decrease of the forced checkpointing overhead of communication-induced checkpointing in an effective way while permitting no useless checkpoint. The protocol allows each process sending a message to know the exact timestamp of the receiver of the message in its logging procedures without any extra message. Simulation verifies their great efficiency of overhead alleviation regardless of communication patterns.

  • Recovering Faulty Non-Volatile Flip Flops for Coarse-Grained Reconfigurable Architectures

    Takeharu IKEZOE  Takuya KOJIMA  Hideharu AMANO  

     
    PAPER

      Pubricized:
    2020/12/14
      Vol:
    E104-C No:6
      Page(s):
    215-225

    Recent IoT devices require extremely low standby power consumption, while a certain performance is needed during the active time, and Coarse-Grained Reconfigurable Arrays (CGRAs) have received attention because of their high energy efficiency. For further reduction of the standby energy consumption of CGRAs, the leakage power for their configuration memory must be reduced. Although the power gating is a common technique, the lost data in flip-flops and memory must be retrieved after the wake-up. Recovering everything requires numerous state transitions and considerable overhead both on its execution time and energy. To address the problem, Non-volatile Cool Mega Array (NVCMA), a CGRA providing non-volatile flip-flops (NVFFs) with spin transfer torque type non-volatile memory (NVM) technology has been developed. However, in general, non-volatile memory technologies have problems with reliability. Some NVFFs are stacked-at-0/1, and cannot store the data in a certain possibility. To improve the chip yield, we propose a mapping algorithm to avoid faulty processing elements of the CGRA caused by the erroneous configuration data. Next, we also propose a method to add an error-correcting code (ECC) mechanism to NVFFs for the configuration and constant memory. The proposed method was applied to NVCMA to evaluate the availability rate and reduction of write time. By using both methods, the average availability ratio of 94.2% was achieved, while the average availability ratio of the nine applications was 0.056% when the probability of failure of the FF was 0.01. The energy for storing data becomes about 2.3 times because of the hardware overhead of ECC but the proposed method can save 8.6% of the writing power on average.

  • Cyclic Vertex Connectivity of Trivalent Cayley Graphs

    Jenn-Yang KE  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2018/03/30
      Vol:
    E101-D No:7
      Page(s):
    1828-1834

    A vertex subset F ⊆ V(G) is called a cyclic vertex-cut set of a connected graph G if G-F is disconnected such that at least two components in G-F contain cycles. The cyclic vertex connectivity is the cardinality of a minimum cyclic vertex-cut set. In this paper, we show that the cyclic vertex connectivity of the trivalent Cayley graphs TGn is equal to eight for n ≥ 4.

  • Set-to-Set Disjoint Paths Routing in Torus-Connected Cycles

    Antoine BOSSARD  Keiichi KANEKO  

     
    LETTER-Dependable Computing

      Pubricized:
    2016/08/10
      Vol:
    E99-D No:11
      Page(s):
    2821-2823

    Extending the very popular tori interconnection networks[1]-[3], Torus-Connected Cycles (TCC) have been proposed as a novel network topology for massively parallel systems [5]. Here, the set-to-set disjoint paths routing problem in a TCC is solved. In a TCC(k,n), it is proved that paths of lengths at most kn2+2n can be selected in O(kn2) time.

  • Autonomous Decentralized Service Oriented Architecture Concept and Application for Mission Critical Information Systems

    Carlos PEREZ-LEGUIZAMO  P. Josue HERNANDEZ-TORRES  J.S. Guadalupe GODINEZ-BORJA  Victor TAPIA-TEC  

     
    PAPER

      Vol:
    E99-B No:4
      Page(s):
    803-811

    Recently, the Services Oriented Architectures (SOA) have been recognized as the key to the integration and interoperability of different applications and systems that coexist in an organization. However, even though the use of SOA has increased, some applications are unable to use it. That is the case of mission critical information applications, whose requirements such as high reliability, non-stop operation, high flexibility and high performance are not satisfied by conventional SOA infrastructures. In this article we present a novel approach of combining SOA with Autonomous Decentralized Systems (ADS) in order to provide an infrastructure that can satisfy those requirements. We have named this infrastructure Autonomous Decentralized Service Oriented Architecture (ADSOA). We present the concept and architecture of ADSOA, as well as the Loosely Couple Delivery Transaction and Synchronization Technology for assuring the data consistency and high reliability of the application. Moreover, a real implementation and evaluation of the proposal in a mission critical information system, the Uniqueness Verifying Public Key Infrastructure (UV-PKI), is shown in order to prove its effectiveness.

  • A Low-Latency DMR Architecture with Fast Checkpoint Recovery Scheme

    Go MATSUKAWA  Yohei NAKATA  Yasuo SUGURE  Shigeru OHO  Yuta KIMI  Masafumi SHIMOZAWA  Shuhei YOSHIDA  Hiroshi KAWAGUCHI  Masahiko YOSHIMOTO  

     
    PAPER

      Vol:
    E98-C No:4
      Page(s):
    333-339

    This paper presents a novel architecture for a fault-tolerant and dual modular redundancy (DMR) system using a checkpoint recovery approach. The architecture features exploitation of SRAM with simultaneous copy and instantaneous compare function. It can perform low-latency data copying between dual cores. Therefore, it can carry out fast backup and rollback. Furthermore, it can reduce the power consumption during data comparison process compared to the cyclic redundancy check (CRC). Evaluation results show that, compared with the conventional checkpoint/restart DMR, the proposed architecture reduces the cycle overhead by 97.8% and achieves a 3.28% low-latency execution cycle even if a one-time fault occurs when executing the task. The proposed architecture provides high reliability for systems with a real-time requirement.

  • A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems

    Yonghwan KIM  Tadashi ARARAGI  Junya NAKAMURA  Toshimitsu MASUZAWA  

     
    PAPER-Dependable Computing

      Vol:
    E97-D No:1
      Page(s):
    65-76

    Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.

  • On Reducing Rollback Propagation Effect of Optimistic Message Logging for Group-Based Distributed Systems

    Jinho AHN  

     
    LETTER-Dependable Computing

      Vol:
    E96-D No:11
      Page(s):
    2473-2477

    This paper presents a new scalable method to considerably reduce the rollback propagation effect of the conventional optimistic message logging by utilizing positive features of reliable FIFO group communication links. To satisfy this goal, the proposed method forces group members to replicate different receive sequence numbers (RSNs), which they assigned for each identical message to their group respectively, into their volatile memories. As the degree of redundancy of RSNs increases, the possibility of local recovery for each crashed process may significantly be higher. Experimental results show that our method can outperform the previous one in terms of the rollback distance of non-faulty processes with a little normal time overhead.

  • An Efficient Interpolation Based Erasure-Only Decoder for High-Rate Reed-Solomon Codes

    Qian GUO  Haibin KAN  

     
    LETTER-Coding Theory

      Vol:
    E95-A No:5
      Page(s):
    978-981

    In this paper, we derive a simple formula to generate a wide-sense systematic generator matrix(we call it quasi-systematic) B for a Reed-Solomon code. This formula can be utilized to construct an efficient interpolation based erasure-only decoder with time complexity O(n2) and space complexity O(n). Specifically, the decoding algorithm requires 3kr + r2 - 2r field additions, kr + r2 + r field negations, 2kr + r2 - r + k field multiplications and kr + r field inversions. Compared to another interpolation based erasure-only decoding algorithm derived by D.J.J. Versfeld et al., our algorithm is much more efficient for high-rate Reed-Solomon codes.

  • Support Efficient and Fault-Tolerant Multicast in Bufferless Network-on-Chip

    Chaochao FENG  Zhonghai LU  Axel JANTSCH  Minxuan ZHANG  Xianju YANG  

     
    PAPER-Computer System

      Vol:
    E95-D No:4
      Page(s):
    1052-1061

    In this paper, we propose three Deflection-Routing-based Multicast (DRM) schemes for a bufferless NoC. The DRM scheme without packets replication (DRM_noPR) sends multicast packet through a non-deterministic path. The DRM schemes with adaptive packets replication (DRM_PR_src and DRM_PR_all) replicate multicast packets at the source or intermediate node according to the destination position and the state of output ports to reduce the average multicast latency. We also provide fault-tolerant supporting in these schemes through a reinforcement-learning-based method to reconfigure the routing table to tolerate permanent faulty links in the network. Simulation results illustrate that the DRM_PR_all scheme achieves 41%, 43% and 37% less latency on average than that of the DRM_noPR scheme and 27%, 29% and 25% less latency on average than that of the DRM_PR_src scheme under three synthetic traffic patterns respectively. In addition, all three fault-tolerant DRM schemes achieve acceptable performance degradation at various link fault rates without any packet lost.

  • Economical and Fault-Tolerant Load Balancing in Distributed Stream Processing Systems

    Fuyuan XIAO  Teruaki KITASUKA  Masayoshi ARITSUGI  

     
    PAPER-Data Engineering, Web Information Systems

      Vol:
    E95-D No:4
      Page(s):
    1062-1073

    We present an economical and fault-tolerant load balancing strategy (EFTLBS) based on an operator replication mechanism and a load shedding method, that fully utilizes the network resources to realize continuous and highly-available data stream processing without dynamic operator migration over wide area networks. In this paper, we first design an economical operator distribution (EOD) plan based on a bin-packing model under the constraints of each stream bandwidth as well as each server's CPU capacity. Next, we devise super-operator (SO) that load balances multi-degree operator replicas. Moreover, for improving the fault-tolerance of the system, we color the SOs based on a coloring bin-packing (CBP) model that assigns peer operator replicas to different servers. To minimize the effects of input rate bursts upon the system, we take advantage of a load shedding method while keeping the QoS guarantees made by the system based on the SO scheme and the CBP model. Finally, we substantiate the utility of our work through experiments on ns-3.

  • Lightweight Consistent Recovery Algorithm for Sender-Based Message Logging in Distributed Systems

    Jinho AHN  

     
    LETTER-Dependable Computing

      Vol:
    E94-D No:8
      Page(s):
    1712-1715

    Sender-based message logging (SBML) with checkpointing has its well-known beneficial feature, lowering highly failure-free overhead of synchronous logging with volatile logging at sender's memory. This feature encourages it to be applied into many distributed systems as a low-cost transparent rollback recovery technique. However, the original SBML recovery algorithm may no longer be progressing in some transient communication error cases. This paper proposes a consistent recovery algorithm to solve this problem by piggybacking small log information for unstable messages received on each acknowledgement message for returning the receive sequence number assigned to a message by its receiver. Our algorithm also enables all messages scheduled to be sent, but delayed because of some preceding unstable messages to be actually transmitted out much earlier than the existing ones.

  • A Domain Partition Model Approach to the Online Fault Recovery of FPGA-Based Reconfigurable Systems

    Lihong SHANG  Mi ZHOU  Yu HU  Erfu YANG  

     
    PAPER-Nonlinear Problems

      Vol:
    E94-A No:1
      Page(s):
    290-299

    Field programmable gate arrays (FPGAs) are widely used in reliability-critical systems due to their reconfiguration ability. However, with the shrinking device feature size and increasing die area, nowadays FPGAs can be deeply affected by the errors induced by electromigration and radiation. To improve the reliability of FPGA-based reconfigurable systems, a permanent fault recovery approach using a domain partition model is proposed in this paper. In the proposed approach, the fault-tolerant FPGA recovery from faults is realized by reloading a proper configuration from a pool of multiple alternative configurations with overlaps. The overlaps are presented as a set of vectors in the domain partition model. To enhance the reliability, a technical procedure is also presented in which the set of vectors are heuristically filtered so that the corresponding small overlaps can be merged into big ones. Experimental results are provided to demonstrate the effectiveness of the proposed approach through applying it to several benchmark circuits. Compared with previous approaches, the proposed approach increased MTTF by up to 18.87%.

  • Learning Algorithms Which Make Multilayer Neural Networks Multiple-Weight-and-Neuron-Fault Tolerant

    Tadayoshi HORITA  Itsuo TAKANAMI  Masatoshi MORI  

     
    PAPER-Biocybernetics, Neurocomputing

      Vol:
    E91-D No:4
      Page(s):
    1168-1175

    Two simple but useful methods, called the deep learning methods, for making multilayer neural networks tolerant to multiple link-weight and neuron-output faults, are proposed. The methods make the output errors in learning phase smaller than those in practical use. The abilities of fault-tolerance of the multilayer neural networks in practical use, are analyzed in the relationship between the output errors in learning phase and in practical use. The analytical result shows that the multilayer neural networks have complete (100%) fault-tolerance to multiple weight-and-neuron faults in practical use. The simulation results concerning the rate of successful learnings, the ability of fault-tolerance, and the learning time, are also shown.

  • Fault-Tolerance for the Mobile Ad-Hoc Environment

    Taesoon PARK  Kwangho KIM  

     
    LETTER-Reliability, Maintainability and Safety Analysis

      Vol:
    E91-A No:1
      Page(s):
    413-416

    Fault-tolerance is an important design issue in building a reliable mobile computing system. This paper considers checkpointing recovery services for a mobile computing system based on the ad-hoc network environment. Since potential problems of this new environment are insufficient power and limited storage capacity, the proposed scheme tries to reduce disk access frequency for saving recovery information, and also the amount of information saved for recovery. A brief simulation study has been performed and the results show that the proposed scheme takes advantage of the existing checkpointing recovery schemes.

  • A Fault-Tolerant Content Addressable Network

    Daisuke TAKEMOTO  Shigeaki TAGASHIRA  Satoshi FUJITA  

     
    PAPER-Networks

      Vol:
    E89-D No:6
      Page(s):
    1923-1930

    In this paper, we propose a new method to enhance the fault-tolerance of the Content Addressable Network (CAN), which is known as a typical pure P2P system based on the notion of Distributed Hash Table (DHT). The basic idea of the proposed method is to introduce redundancy to the management of index information distributed over the nodes in the given P2P network, by allowing each index to be assigned to several nodes, which was restricted to be one in the original CAN system. To keep the consistency among several copies of indices, we propose an efficient synchronization scheme based on the notion of labels assigned to each copy in a distinct manner. The performance of the proposed scheme is evaluated by simulation. The result of simulations indicates that the proposed scheme significantly enhances the fault-tolerance of the CAN system.

  • Minimizing the Buffer Size in Fault-Tolerant Video Servers for VBR Streams

    Minseok SONG  Heonshik SHIN  

     
    LETTER-Dependable Computing

      Vol:
    E88-D No:6
      Page(s):
    1294-1298

    To guarantee the high reliability of video services, video servers usually adopt parity-encoding techniques in which data blocks and their associated parity blocks form a parity group. For real-time video service, all the blocks in a parity group are prefetched in order to cope with a possible disk failure, thereby incurring a buffering overhead. In this paper, we propose a new scheme called Round-level Parity Grouping (RPG) to reduce the buffer overhead while restoring VBR video streams in the presence of a faulty disk. RPG allows variable parity group sizes so that the exact amount of data is retrieved during each round. Based on RPG, we have developed a storage allocation algorithm for effective buffer management. Experimental results show that our proposed scheme reduces the buffer requirement by 20% to 25%.

  • A Self-Stabilizing Approximation Algorithm for the Distributed Minimum k-Domination

    Sayaka KAMEI  Hirotsugu KAKUGAWA  

     
    PAPER

      Vol:
    E88-A No:5
      Page(s):
    1109-1116

    Self-stabilization is a theoretical framework of non-masking fault-tolerant distributed algorithms. In this paper, we investigate a self-stabilizing distributed approximation for the minimum k-dominating set (KDS) problem in general networks. The minimum KDS problem is a generalization of the well-known dominating set problem in graph theory. For a graph G = (V,E), a set Dk V is a KDS of G if and only if each vertex not in Dk is adjacent to at least k vertices in Dk. The approximation ratio of our algorithm is , where Δ is the maximum degree of G, in the networks of which the minimum degree is more than or equal to k.

  • Self-Reconfiguring of -Track-Switch Mesh Arrays with Spares on One Row and One Column by Simple Built-in Circuit

    Itsuo TAKANAMI  

     
    PAPER-Dependable Computing

      Vol:
    E87-D No:10
      Page(s):
    2318-2328

    We present a built-in self-reconfiguring system for a mesh-connected processor array where faulty processor elements are compensated for by spare processing elements located in one row and one column. It has advantages in that the number of spare processing elements is small and additional control circuits and networks for changing interconnections of processing elements is so simple that hardware overhead for reconfiguration is also small. First, to indicate the motivation to the proposed reconfiguration scheme, we briefly describe other schemes with the same number of spares as that of the proposed scheme where faulty processing elements are replaced using straight shifts toward spares, and compare their reconfiguration probabilities to each other. Then, we show that a variant of the proposed scheme has the highest probability. Next, we present a built-in self-reconfiguring system for the scheme and formally prove that it works correctly. It can automatically replace faulty processors by spare processors on detecting faults of processors.

  • A Nested Invocation Suppression Mechanism for Active Replication Fault-Tolerant CORBA

    Deron LIANG  Chen-Liang FANG  Chyouhwa CHEN  

     
    PAPER-Dependable Computing

      Vol:
    E87-D No:8
      Page(s):
    2070-2077

    Active replication is a common approach to building highly available and reliable distributed software applications. The redundant nested invocation (RNI) problem arises when servers in a replicated group issues nested invocations to other server groups in response to a client invocation. Automatic suppression of RNI is always a desirable solution, yet it is usually a difficult design issue. If the system has multithreading support, the difficulties of implementation increase dramatically. Intuitively, to design a deterministic thread execution control mechanism is a possible approach. Unfortunately, some modern operating systems implement thread on kernel level for execution fairness. For the kernel thread case, modification on thread control implies modifying the operating system kernel. This approach loses system portability which is one of the important requirements of CORBA or middleware. In this work, we propose a mechanism to perform the auto-suppression of redundant nested invocation in an active replication fault-tolerant (FT) CORBA system. Besides the mechanism design, we discuss the design correctness semantic and the correctness proof of our design.

1-20hit(44hit)