The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] fault(493hit)

101-120hit(493hit)

  • A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems

    Yonghwan KIM  Tadashi ARARAGI  Junya NAKAMURA  Toshimitsu MASUZAWA  

     
    PAPER-Dependable Computing

      Vol:
    E97-D No:1
      Page(s):
    65-76

    Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.

  • A Method of Parallelizing Consensuses for Accelerating Byzantine Fault Tolerance

    Junya NAKAMURA  Tadashi ARARAGI  Toshimitsu MASUZAWA  Shigeru MASUYAMA  

     
    PAPER-Dependable Computing

      Vol:
    E97-D No:1
      Page(s):
    53-64

    We propose a new method that accelerates asynchronous Byzantine Fault Tolerant (BFT) protocols designed on the principle of state machine replication. State machine replication protocols ensure consistency among replicas by applying operations in the same order to all of them. A naive way to determine the application order of the operations is to repeatedly execute the BFT consensus to determine the next executed operation, but this may introduce inefficiency caused by waiting for the completion of the previous execution of the consensus protocol. To reduce this inefficiency, our method allows parallel execution of the consensuses while keeping consistency of the consensus results at the replicas. In this paper, we also prove the correctness of our method and experimentally compare it with the existing method in terms of latency and throughput. The evaluation results show that our method makes a BFT protocol three or four times faster than the existing one when some machines or message transmissions are delayed.

  • Cooperative VM Migration: A Symbiotic Virtualization Mechanism by Leveraging the Guest OS Knowledge

    Ryousei TAKANO  Hidemoto NAKADA  Takahiro HIROFUCHI  Yoshio TANAKA  Tomohiro KUDOH  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2675-2683

    A virtual machine (VM) migration is useful for improving flexibility and maintainability in cloud computing environments. However, VM monitor (VMM)-bypass I/O technologies, including PCI passthrough and SR-IOV, in which the overhead of I/O virtualization can be significantly reduced, make VM migration impossible. This paper proposes a novel and practical mechanism, called Symbiotic Virtualization (SymVirt), for enabling migration and checkpoint/restart on a virtualized cluster with VMM-bypass I/O devices, without the virtualization overhead during normal operations. SymVirt allows a VMM to cooperate with a message passing layer on the guest OS, then it realizes VM-level migration and checkpoint/restart by using a combination of a user-level dynamic device configuration and coordination of distributed VMMs. We have implemented the proposed mechanism on top of QEMU/KVM and the Open MPI system. All PCI devices, including Infiniband, Ethernet, and Myrinet, are supported without implementing specific para-virtualized drivers; and it is not necessary to modify either of the MPI runtime and applications. Using the proposed mechanism, we demonstrate reactive and proactive FT mechanisms on a virtualized Infiniband cluster. We have confirmed the effectiveness using both a memory intensive micro benchmark and the NAS parallel benchmark.

  • SAT-Based Test Generation for Open Faults Using Fault Excitation Caused by Effect of Adjacent Lines

    Jun YAMASHITA  Hiroyuki YOTSUYANAGI  Masaki HASHIZUME  Kozo KINOSHITA  

     
    PAPER-Logic Synthesis, Test and Verification

      Vol:
    E96-A No:12
      Page(s):
    2561-2567

    Open faults are difficult to test since the voltage at the floating line is unpredictable and depends on the voltage at the adjacent lines. The effect of open faults can be easily excited if a test pattern provides the opposite logic value to most of the adjacent lines. In this paper, we present a procedure to generate as high a quality test as possible. We define the test quality for evaluating the effect of adjacent lines by assigning an opposite logic value to the faulty line. In our proposed test generation method, we utilize the SAT-based ATPG method. We generate test patterns that propagate the faulty effect to primary outputs and assign logic values to adjacent lines opposite that of the faulty line. In order to estimate test quality for open faults, we define the excitation effectiveness Eeff. To reduce the test volume, we utilize the open fault simulation. We calculate the excitation effectiveness by open fault simulation in order to eliminate unnecessary test patterns. The experimental results for the benchmark circuits prove the effectiveness of our procedure.

  • On Reducing Rollback Propagation Effect of Optimistic Message Logging for Group-Based Distributed Systems

    Jinho AHN  

     
    LETTER-Dependable Computing

      Vol:
    E96-D No:11
      Page(s):
    2473-2477

    This paper presents a new scalable method to considerably reduce the rollback propagation effect of the conventional optimistic message logging by utilizing positive features of reliable FIFO group communication links. To satisfy this goal, the proposed method forces group members to replicate different receive sequence numbers (RSNs), which they assigned for each identical message to their group respectively, into their volatile memories. As the degree of redundancy of RSNs increases, the possibility of local recovery for each crashed process may significantly be higher. Experimental results show that our method can outperform the previous one in terms of the rollback distance of non-faulty processes with a little normal time overhead.

  • An Efficient Test and Repair Flow for Yield Enhancement of One-Time-Programming NROM-Based ROMs

    Tsu-Lin LI  Masaki HASHIZUME  Shyue-Kung LU  

     
    LETTER

      Vol:
    E96-D No:9
      Page(s):
    2026-2030

    NROM is one of the emerging non-volatile-memory technologies, which is promising for replacing current floating-gate-based non-volatile memory such as flash memory. In order to raise the fabrication yield and enhance its reliability, a novel test and repair flow is proposed in this paper. Instead of the conventional fault replacement techniques, a novel fault masking technique is also exploited by considering the logical effects of physical defects when the customer's code is to be programmed. In order to maximize the possibilities of fault masking, a novel data inversion technique is proposed. The corresponding BIST architectures are also presented. According to experimental results, the repair rate and fabrication yield can be improved significantly. Moreover, the incurred hardware overhead is almost negligible.

  • Open-Fault Resilient Multiple-Valued Codes for Reliable Asynchronous Global Communication Links

    Naoya ONIZAWA  Atsushi MATSUMOTO  Takahiro HANYU  

     
    PAPER

      Vol:
    E96-D No:9
      Page(s):
    1952-1961

    This paper introduces open-wire fault-resilient multiple-valued codes for reliable asynchronous point-to-point global communication links. In the proposed encoding, two communication modules assign complementary codewords that change between two valid states without an open-wire fault. Under an open-wire fault, at each module, the codewords don't reach to one of the two valid states and remains as “invalid” states. The detection of the invalid states makes it possible to stop sending wrong codewords caused by an open-wire fault. The detectability of the open-wire fault based on the proposed encoding is proven for m-of-n codes. The proposed code used in the multiple-valued asynchronous global communication link is capable of detecting a single open-wire fault with 3.08-times higher coding efficiency compared with a conventional multiple-valued code used in a triple-modular redundancy (TMR) link that detects an open-wire fault under the same dynamic range of logical values.

  • Low-Overhead Fault-Secure Parallel Prefix Adder by Carry-Bit Duplication

    Nobutaka KITO  Naofumi TAKAGI  

     
    PAPER

      Vol:
    E96-D No:9
      Page(s):
    1962-1970

    We propose a low-overhead fault-secure parallel prefix adder. We duplicate carry bits for checking purposes. Only one half of normal carry bits are compared with the corresponding redundant carry bits, and the hardware overhead of the adder is low. For concurrent error detection, we also predict the parity of the result. The adder uses parity-based error detection and it has high compatibility with systems that have parity-based error detection. We can implement various fault-secure parallel prefix adders such as Sklansky adder, Brent-Kung adder, Han-Carlson adder, and Kogge-Stone adder. The area overhead of the proposed adder is about 15% lower than that of a previously proposed adder that compares all the carry bits.

  • Round Addition DFA on 80-bit Piccolo and TWINE

    Hideki YOSHIKAWA  Masahiro KAMINAGA  Arimitsu SHIKODA  Toshinori SUZUKI  

     
    LETTER

      Vol:
    E96-D No:9
      Page(s):
    2031-2035

    We present a round addition differential fault analysis (DFA) for some lightweight 80-bit block ciphers. It is shown that only one correct ciphertext and two faulty ciphertexts are required to reconstruct secret keys in 80-bit Piccolo and TWINE, and the reconstructions are easier than 128-bit CLEFIA.

  • Fault Diagnosis and Reconfiguration Method for Network-on-Chip Based Multiple Processor Systems with Restricted Private Memories

    Masashi IMAI  Tomohiro YONEDA  

     
    PAPER

      Vol:
    E96-D No:9
      Page(s):
    1914-1925

    We propose a fault diagnosis and reconfiguration method based on the Pair and Swap scheme to improve the reliability and the MTTF (Mean Time To Failure) of network-on-chip based multiple processor systems where each processor core has its private memory. In the proposed scheme, two identical copies of a given task are executed on a pair of processor cores and the results are compared repeatedly in order to detect processor faults. If a fault is detected by mismatches, the fault is identified and isolated using a TMR (Triple Module Redundancy) and the system is reconfigured by the redundant processor cores. We propose that each task is quadruplicated and statically assigned to private memories so that each memory has only two different tasks. We evaluate the reliability of the proposed quadruplicated task allocation scheme in the viewpoint of MTTF. As a result, the MTTF of the proposed scheme is over 4.3 times longer than that of the duplicated task allocation scheme.

  • Potential of Fault-Detection Coverage by means of On-Chip Redundancy - IEC61508: Are There Royal Roads to SIL 4?

    Nobuyasu KANEKAWA  

     
    PAPER

      Vol:
    E96-D No:9
      Page(s):
    1907-1913

    This paper investigates potential to improve fault-detection coverage by means of on-chip redundancy. The international standard on functional safety, namely, IEC61508 Ed. 2.0 Part 2 Annex E.3 prescribes the upper bound of βIC (common cause failure (CCF) ratio to all failures) is 0.25 to satisfy frequency upper bound of dangerous failure in the safety function for SIL (Safety Integrated Level) 3. On the other hand, this paper argues that the βIC does not necessarily have to be less than 0.25 for SIL 3, and that the upper bound of βIC can be determined depending on failure rate λ and CCF detection coverage. In other words, the frequency upper bound of dangerous failure for SIL3 can also be satisfied with βIC higher than 0.25 if the failure rate λ is lower than 400[fit]. Moreover, the paper shows that on-chip redundancy has potential to satisfy SIL 4 requirement; the frequency upper bound of dangerous failure for SIL4 can be satisfied with feasible ranges of βIC, λ and CCF coverage which can be realized by redundant code.

  • Field Slack Assessment for Predictive Fault Avoidance on Coarse-Grained Reconfigurable Devices

    Toshihiro KAMEDA  Hiroaki KONOURA  Dawood ALNAJJAR  Yukio MITSUYAMA  Masanori HASHIMOTO  Takao ONOYE  

     
    PAPER-Test and Verification

      Vol:
    E96-D No:8
      Page(s):
    1624-1631

    This paper proposes a procedure for avoiding delay faults in field with slack assessment during standby time. The proposed procedure performs path delay testing and checks if the slack is larger than a threshold value using selectable delay embedded in basic elements (BE). If the slack is smaller than the threshold, a pair of BEs to be replaced, which maximizes the path slack, is identified. Experimental results with two application circuits mapped on a coarse-grained architecture show that for aging-induced delay degradation a small threshold slack, which is less than 1 ps in a test case, is enough to ensure the delay fault prediction.

  • Selective Check of Data-Path for Effective Fault Tolerance

    Tanvir AHMED  Jun YAO  Yuko HARA-AZUMI  Shigeru YAMASHITA  Yasuhiko NAKASHIMA  

     
    PAPER-Design Methodology

      Vol:
    E96-D No:8
      Page(s):
    1592-1601

    Nowadays, fault tolerance has been playing a progressively important role in covering increasing soft/hard error rates in electronic devices that accompany the advances of process technologies. Research shows that wear-out faults have a gradual onset, starting with a timing fault and then eventually leading to a permanent fault. Error detection is thus a required function to maintain execution correctness. Currently, however, many highly dependable methods to cover permanent faults are commonly over-designed by using very frequent checking, due to lack of awareness of the fault possibility in circuits used for the pending executions. In this research, to address the over-checking problem, we introduce a metric for permanent defects, as operation defective probability (ODP), to quantitatively instruct the check operations being placed only at critical positions. By using this selective checking approach, we can achieve a near-100% dependability by having about 53% less check operations, as compared to the ideal reliable method, which performs exhaustive checks to guarantee a zero-error propagation. By this means, we are able to reduce 21.7% power consumption by avoiding the non-critical checking inside the over-designed approach.

  • Dynamic Fault Tree Analysis for Systems with Nonexponential Failure Components

    Tetsushi YUGE  Shigeru YANAGI  

     
    PAPER-Reliability, Maintainability and Safety Analysis

      Vol:
    E96-A No:8
      Page(s):
    1730-1736

    A method of calculating the top event probability of a fault tree, where dynamic gates and repeated events are included and the occurrences of basic events follow nonexponential distributions, is proposed. The method is on the basis of the Bayesian network formulation for a DFT proposed by Yuge and Yanagi [1]. The formulation had a difficulty in calculating a sequence probability if components have nonexponential failure distributions. We propose an alternative method to obtain the sequence probability in this paper. First, a method in the case of the Erlang distribution is discussed. Then, Tijms's fitting procedure is applied to deal with a general distribution. The procedure gives a mixture of two Erlang distributions as an approximate distribution for a general distribution given the mean and standard deviation. A numerical example shows that our method works well for complex systems.

  • Coverage of Irrelevant Components in Systems with Imperfect Fault Coverage

    Jianwen XIANG  Fumio MACHIDA  Kumiko TADANO  Yoshiharu MAENO  Kazuo YANOO  

     
    LETTER-Reliability, Maintainability and Safety Analysis

      Vol:
    E96-A No:7
      Page(s):
    1649-1652

    Traditional imperfect fault coverage models only consider the coverage (including identification and isolation) of faulty components, and they do not consider the coverage of irrelevant (operational) components. One potential reason for the omission is that in these models the system is generally assumed to be coherent in which each component is initially relevant. In this paper, we first point out that an initially relevant component could become irrelevant afterwards due to the failures of some other components, and thus it is important to consider the handling of irrelevancy even the system is originally coherent. We propose an irrelevancy coverage model (IRCM) in which the coverage is extended to the irrelevant components in addition to the faulty components. The IRCM can not only significantly enhance system reliability by preventing the future system failures resulting from the not-covered failures of the irrelevant components, but may also play an important role in efficient energy use in practice by timely turning off the irrelevant components.

  • Test Generation for Delay Faults on Clock Lines under Launch-on-Capture Test Environment

    Yoshinobu HIGAMI  Hiroshi TAKAHASHI  Shin-ya KOBAYASHI  Kewal K. SALUJA  

     
    PAPER-Dependable Computing

      Vol:
    E96-D No:6
      Page(s):
    1323-1331

    This paper deals with delay faults on clock lines assuming the launch-on-capture test. In this realistic fault model, the amount of delay at the FF driven by the faulty clock line is such that the scan shift operation can perform correctly even in the presence of a fault, but during the system clock operation, capturing functional value(s) at faulty FF(s), i.e. FF(s) driven by the clock with delay, is delayed and correct value(s) may not be captured. We developed a fault simulator that can handle such faults and using this simulator we investigate the relation between the duration of the delay and the difficulty of detecting clock delay faults in the launch-on-capture test. Next, we propose test generation methods for detecting clock delay faults that affect a single or two FFs. Experimental results for benchmark circuits are given in order to establish the effectiveness of the proposed methods.

  • Detection and Localization of Link Quality Degradation in Transparent WDM Networks

    Wissarut YUTTACHAI  Poompat SAENGUDOMLERT  Wuttipong KUMWILAISAK  

     
    PAPER-Fiber-Optic Transmission for Communications

      Vol:
    E96-B No:6
      Page(s):
    1412-1424

    We consider the problem of detecting and localizing of link quality degradations in transparent wavelength division multiplexing (WDM) networks. In particular, we consider the degradation of the optical signal-to-noise ratio (OSNR), which is a key parameter for link quality monitoring in WDM networks. With transparency in WDM networks, transmission lightpaths can bypass electronic processing at intermediate nodes. Accordingly, links cannot always be monitored by receivers at their end nodes. This paper proposes the use of optical multicast probes to monitor OSNR degradations on optical links. The proposed monitoring scheme consists of two steps. The first step is an off-line process to set up monitoring trees using integer linear programming (ILP). The set of monitoring trees is selected to guarantee that significant OSNR degradations can be identified on any link or links in the network. The second step uses optical performance monitors that are placed at the receivers identified in the first step. The information from these monitors is collected and input to the estimation algorithm to localize the degraded links. Numerical results indicate that the proposed monitoring algorithm is able to detect link degradations that cause significant OSNR changes. In addition, we demonstrate how the information obtained from monitoring can be used to detect a significant end-to-end OSNR degradation even though there is no significant OSNR degradation on individual links.

  • Dynamic Fault Tree Analysis Using Bayesian Networks and Sequence Probabilities

    Tetsushi YUGE  Shigeru YANAGI  

     
    PAPER-Reliability, Maintainability and Safety Analysis

      Vol:
    E96-A No:5
      Page(s):
    953-962

    A method of calculating the exact top event probability of a fault tree with dynamic gates and repeated basic events is proposed. The top event probability of such a dynamic fault tree is obtained by converting the tree into an equivalent Markov model. However, the Markov-based method is not realistic for a complex system model because the number of states that should be considered in the Markov analysis increases explosively as the number of basic events in the model increases. To overcome this shortcoming, we propose an alternative method in this paper. It is a hybrid of a Bayesian network (BN) and an algebraic technique. First, modularization is applied to a dynamic fault tree. The detected modules are classified into two types: one satisfies the parental Markov condition and the other does not. The module without the parental Markov condition is replaced with an equivalent single event. The occurrence probability of this event is obtained as the sum of disjoint sequence probabilities. After the contraction of modules without parent Markov condition, the BN algorithm is applied to the dynamic fault tree. The conditional probability tables for dynamic gates are presented. The BN is a standard one and has hierarchical and modular features. Numerical example shows that our method works well for complex systems.

  • Improving Test Coverage by Measuring Path Delay Time Including Transmission Time of FF

    Wenpo ZHANG  Kazuteru NAMBA  Hideo ITO  

     
    LETTER-Dependable Computing

      Vol:
    E96-D No:5
      Page(s):
    1219-1222

    As technology scales to 45 nm and below, the reliability of VLSI declines due to small delay defects, which are hard to detect by functional clock frequency. To detect small delay defects, a method which measures the delay time of path in circuit under test (CUT) was proposed. However, because a large number of FFs exist in recent VLSI, the probability that the resistive defect occurs in the FFs is increased. A test method measuring path delay time including the transmission time of FFs is necessary. However, the path measured by the conventional on-chip path delay time measurement method does not include a part of a master latch. Thus, testing using the conventional measurement method cannot detect defects occurring on the part. This paper proposes an improved on-chip path delay time measurement method. Test coverage is improved by measuring the path delay time including transmission time of a master latch. The proposed method uses a duty-cycle-modified clock signal. Evaluation results show that, the proposed method improves test coverage 5.2511.28% with the same area overhead as the conventional method.

  • Understanding the Impact of BPRAM on Incremental Checkpoint

    Xu LI  Kai LU  Xiaoping WANG  Bin DAI  Xu ZHOU  

     
    PAPER-Dependable Computing

      Vol:
    E96-D No:3
      Page(s):
    663-672

    Existing large-scale systems suffer from various hardware/software failures, motivating the research of fault-tolerance techniques. Checkpoint-restart techniques are widely applied fault-tolerance approaches, especially in scientific computing systems. However, the overhead of checkpoint largely influences the overall system performance. Recently, the emerging byte-addressable, persistent memory technologies, such as phase change memory (PCM), make it possible to implement checkpointing in arbitrary data granularity. However, the impact of data granularity on the checkpointing cost has not been fully addressed. In this paper, we investigate how data granularity influences the performance of a checkpoint system. Further, we design and implement a high-performance checkpoint system named AG-ckpt. AG-ckpt is a hybrid-granularity incremental checkpointing scheme through: (1) low-cost modified-memory detection and (2) fine-grained memory duplication. Moreover, we also formulize the performance-granularity relationship of checkpointing systems through a mathematical model, and further obtain the optimum solutions. We conduct the experiments through several typical benchmarks to verify the performance gain of our design. Compared to conventional incremental checkpoint, our results show that AG-ckpt can reduce checkpoint data amount up to 50% and provide a speedup of 1.2x-1.3x on checkpoint efficiency.

101-120hit(493hit)