IEICE global.ieice.org Site

Keyword Search Result

[Keyword] fault(493hit)

101-120hit(493hit)

A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems
Yonghwan KIM Tadashi ARARAGI Junya NAKAMURA Toshimitsu MASUZAWA

PAPER-Dependable Computing

Vol:
E97-D No:1
Page(s):
65-76
Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.
A Method of Parallelizing Consensuses for Accelerating Byzantine Fault Tolerance
Junya NAKAMURA Tadashi ARARAGI Toshimitsu MASUZAWA Shigeru MASUYAMA

PAPER-Dependable Computing

Vol:
E97-D No:1
Page(s):
53-64
We propose a new method that accelerates asynchronous Byzantine Fault Tolerant (BFT) protocols designed on the principle of state machine replication. State machine replication protocols ensure consistency among replicas by applying operations in the same order to all of them. A naive way to determine the application order of the operations is to repeatedly execute the BFT consensus to determine the next executed operation, but this may introduce inefficiency caused by waiting for the completion of the previous execution of the consensus protocol. To reduce this inefficiency, our method allows parallel execution of the consensuses while keeping consistency of the consensus results at the replicas. In this paper, we also prove the correctness of our method and experimentally compare it with the existing method in terms of latency and throughput. The evaluation results show that our method makes a BFT protocol three or four times faster than the existing one when some machines or message transmissions are delayed.
Cooperative VM Migration: A Symbiotic Virtualization Mechanism by Leveraging the Guest OS Knowledge
Ryousei TAKANO Hidemoto NAKADA Takahiro HIROFUCHI Yoshio TANAKA Tomohiro KUDOH

PAPER

Vol:
E96-D No:12
Page(s):
2675-2683
A virtual machine (VM) migration is useful for improving flexibility and maintainability in cloud computing environments. However, VM monitor (VMM)-bypass I/O technologies, including PCI passthrough and SR-IOV, in which the overhead of I/O virtualization can be significantly reduced, make VM migration impossible. This paper proposes a novel and practical mechanism, called Symbiotic Virtualization (SymVirt), for enabling migration and checkpoint/restart on a virtualized cluster with VMM-bypass I/O devices, without the virtualization overhead during normal operations. SymVirt allows a VMM to cooperate with a message passing layer on the guest OS, then it realizes VM-level migration and checkpoint/restart by using a combination of a user-level dynamic device configuration and coordination of distributed VMMs. We have implemented the proposed mechanism on top of QEMU/KVM and the Open MPI system. All PCI devices, including Infiniband, Ethernet, and Myrinet, are supported without implementing specific para-virtualized drivers; and it is not necessary to modify either of the MPI runtime and applications. Using the proposed mechanism, we demonstrate reactive and proactive FT mechanisms on a virtualized Infiniband cluster. We have confirmed the effectiveness using both a memory intensive micro benchmark and the NAS parallel benchmark.
SAT-Based Test Generation for Open Faults Using Fault Excitation Caused by Effect of Adjacent Lines
Jun YAMASHITA Hiroyuki YOTSUYANAGI Masaki HASHIZUME Kozo KINOSHITA

PAPER-Logic Synthesis, Test and Verification

Vol:
E96-A No:12
Page(s):
2561-2567
Open faults are difficult to test since the voltage at the floating line is unpredictable and depends on the voltage at the adjacent lines. The effect of open faults can be easily excited if a test pattern provides the opposite logic value to most of the adjacent lines. In this paper, we present a procedure to generate as high a quality test as possible. We define the test quality for evaluating the effect of adjacent lines by assigning an opposite logic value to the faulty line. In our proposed test generation method, we utilize the SAT-based ATPG method. We generate test patterns that propagate the faulty effect to primary outputs and assign logic values to adjacent lines opposite that of the faulty line. In order to estimate test quality for open faults, we define the excitation effectiveness Eeff. To reduce the test volume, we utilize the open fault simulation. We calculate the excitation effectiveness by open fault simulation in order to eliminate unnecessary test patterns. The experimental results for the benchmark circuits prove the effectiveness of our procedure.
On Reducing Rollback Propagation Effect of Optimistic Message Logging for Group-Based Distributed Systems
Jinho AHN

LETTER-Dependable Computing

Vol:
E96-D No:11
Page(s):
2473-2477
This paper presents a new scalable method to considerably reduce the rollback propagation effect of the conventional optimistic message logging by utilizing positive features of reliable FIFO group communication links. To satisfy this goal, the proposed method forces group members to replicate different receive sequence numbers (RSNs), which they assigned for each identical message to their group respectively, into their volatile memories. As the degree of redundancy of RSNs increases, the possibility of local recovery for each crashed process may significantly be higher. Experimental results show that our method can outperform the previous one in terms of the rollback distance of non-faulty processes with a little normal time overhead.
An Efficient Test and Repair Flow for Yield Enhancement of One-Time-Programming NROM-Based ROMs
Tsu-Lin LI Masaki HASHIZUME Shyue-Kung LU

LETTER

Vol:
E96-D No:9
Page(s):
2026-2030
NROM is one of the emerging non-volatile-memory technologies, which is promising for replacing current floating-gate-based non-volatile memory such as flash memory. In order to raise the fabrication yield and enhance its reliability, a novel test and repair flow is proposed in this paper. Instead of the conventional fault replacement techniques, a novel fault masking technique is also exploited by considering the logical effects of physical defects when the customer's code is to be programmed. In order to maximize the possibilities of fault masking, a novel data inversion technique is proposed. The corresponding BIST architectures are also presented. According to experimental results, the repair rate and fabrication yield can be improved significantly. Moreover, the incurred hardware overhead is almost negligible.
Open-Fault Resilient Multiple-Valued Codes for Reliable Asynchronous Global Communication Links
Naoya ONIZAWA Atsushi MATSUMOTO Takahiro HANYU

PAPER

Vol:
E96-D No:9
Page(s):
1952-1961
This paper introduces open-wire fault-resilient multiple-valued codes for reliable asynchronous point-to-point global communication links. In the proposed encoding, two communication modules assign complementary codewords that change between two valid states without an open-wire fault. Under an open-wire fault, at each module, the codewords don't reach to one of the two valid states and remains as “invalid” states. The detection of the invalid states makes it possible to stop sending wrong codewords caused by an open-wire fault. The detectability of the open-wire fault based on the proposed encoding is proven for m-of-n codes. The proposed code used in the multiple-valued asynchronous global communication link is capable of detecting a single open-wire fault with 3.08-times higher coding efficiency compared with a conventional multiple-valued code used in a triple-modular redundancy (TMR) link that detects an open-wire fault under the same dynamic range of logical values.
Low-Overhead Fault-Secure Parallel Prefix Adder by Carry-Bit Duplication
Nobutaka KITO Naofumi TAKAGI

PAPER

Vol:
E96-D No:9
Page(s):
1962-1970
We propose a low-overhead fault-secure parallel prefix adder. We duplicate carry bits for checking purposes. Only one half of normal carry bits are compared with the corresponding redundant carry bits, and the hardware overhead of the adder is low. For concurrent error detection, we also predict the parity of the result. The adder uses parity-based error detection and it has high compatibility with systems that have parity-based error detection. We can implement various fault-secure parallel prefix adders such as Sklansky adder, Brent-Kung adder, Han-Carlson adder, and Kogge-Stone adder. The area overhead of the proposed adder is about 15% lower than that of a previously proposed adder that compares all the carry bits.
Round Addition DFA on 80-bit Piccolo and TWINE
Hideki YOSHIKAWA Masahiro KAMINAGA Arimitsu SHIKODA Toshinori SUZUKI

LETTER

Vol:
E96-D No:9
Page(s):
2031-2035
We present a round addition differential fault analysis (DFA) for some lightweight 80-bit block ciphers. It is shown that only one correct ciphertext and two faulty ciphertexts are required to reconstruct secret keys in 80-bit Piccolo and TWINE, and the reconstructions are easier than 128-bit CLEFIA.
Fault Diagnosis and Reconfiguration Method for Network-on-Chip Based Multiple Processor Systems with Restricted Private Memories
Masashi IMAI Tomohiro YONEDA

PAPER

Vol:
E96-D No:9
Page(s):
1914-1925
We propose a fault diagnosis and reconfiguration method based on the Pair and Swap scheme to improve the reliability and the MTTF (Mean Time To Failure) of network-on-chip based multiple processor systems where each processor core has its private memory. In the proposed scheme, two identical copies of a given task are executed on a pair of processor cores and the results are compared repeatedly in order to detect processor faults. If a fault is detected by mismatches, the fault is identified and isolated using a TMR (Triple Module Redundancy) and the system is reconfigured by the redundant processor cores. We propose that each task is quadruplicated and statically assigned to private memories so that each memory has only two different tasks. We evaluate the reliability of the proposed quadruplicated task allocation scheme in the viewpoint of MTTF. As a result, the MTTF of the proposed scheme is over 4.3 times longer than that of the duplicated task allocation scheme.
Potential of Fault-Detection Coverage by means of On-Chip Redundancy - IEC61508: Are There Royal Roads to SIL 4?
Nobuyasu KANEKAWA

PAPER

Vol:
E96-D No:9
Page(s):
1907-1913
This paper investigates potential to improve fault-detection coverage by means of on-chip redundancy. The international standard on functional safety, namely, IEC61508 Ed. 2.0 Part 2 Annex E.3 prescribes the upper bound of βIC (common cause failure (CCF) ratio to all failures) is 0.25 to satisfy frequency upper bound of dangerous failure in the safety function for SIL (Safety Integrated Level) 3. On the other hand, this paper argues that the βIC does not necessarily have to be less than 0.25 for SIL 3, and that the upper bound of βIC can be determined depending on failure rate λ and CCF detection coverage. In other words, the frequency upper bound of dangerous failure for SIL3 can also be satisfied with βIC higher than 0.25 if the failure rate λ is lower than 400[fit]. Moreover, the paper shows that on-chip redundancy has potential to satisfy SIL 4 requirement; the frequency upper bound of dangerous failure for SIL4 can be satisfied with feasible ranges of βIC, λ and CCF coverage which can be realized by redundant code.
Field Slack Assessment for Predictive Fault Avoidance on Coarse-Grained Reconfigurable Devices
Toshihiro KAMEDA Hiroaki KONOURA Dawood ALNAJJAR Yukio MITSUYAMA Masanori HASHIMOTO Takao ONOYE

PAPER-Test and Verification

Vol:
E96-D No:8
Page(s):
1624-1631
This paper proposes a procedure for avoiding delay faults in field with slack assessment during standby time. The proposed procedure performs path delay testing and checks if the slack is larger than a threshold value using selectable delay embedded in basic elements (BE). If the slack is smaller than the threshold, a pair of BEs to be replaced, which maximizes the path slack, is identified. Experimental results with two application circuits mapped on a coarse-grained architecture show that for aging-induced delay degradation a small threshold slack, which is less than 1 ps in a test case, is enough to ensure the delay fault prediction.
Selective Check of Data-Path for Effective Fault Tolerance
Tanvir AHMED Jun YAO Yuko HARA-AZUMI Shigeru YAMASHITA Yasuhiko NAKASHIMA

PAPER-Design Methodology

Vol:
E96-D No:8
Page(s):
1592-1601
Nowadays, fault tolerance has been playing a progressively important role in covering increasing soft/hard error rates in electronic devices that accompany the advances of process technologies. Research shows that wear-out faults have a gradual onset, starting with a timing fault and then eventually leading to a permanent fault. Error detection is thus a required function to maintain execution correctness. Currently, however, many highly dependable methods to cover permanent faults are commonly over-designed by using very frequent checking, due to lack of awareness of the fault possibility in circuits used for the pending executions. In this research, to address the over-checking problem, we introduce a metric for permanent defects, as operation defective probability (ODP), to quantitatively instruct the check operations being placed only at critical positions. By using this selective checking approach, we can achieve a near-100% dependability by having about 53% less check operations, as compared to the ideal reliable method, which performs exhaustive checks to guarantee a zero-error propagation. By this means, we are able to reduce 21.7% power consumption by avoiding the non-critical checking inside the over-designed approach.
Dynamic Fault Tree Analysis for Systems with Nonexponential Failure Components
Tetsushi YUGE Shigeru YANAGI

PAPER-Reliability, Maintainability and Safety Analysis

Vol:
E96-A No:8
Page(s):
1730-1736
A method of calculating the top event probability of a fault tree, where dynamic gates and repeated events are included and the occurrences of basic events follow nonexponential distributions, is proposed. The method is on the basis of the Bayesian network formulation for a DFT proposed by Yuge and Yanagi [1]. The formulation had a difficulty in calculating a sequence probability if components have nonexponential failure distributions. We propose an alternative method to obtain the sequence probability in this paper. First, a method in the case of the Erlang distribution is discussed. Then, Tijms's fitting procedure is applied to deal with a general distribution. The procedure gives a mixture of two Erlang distributions as an approximate distribution for a general distribution given the mean and standard deviation. A numerical example shows that our method works well for complex systems.
Coverage of Irrelevant Components in Systems with Imperfect Fault Coverage
Jianwen XIANG Fumio MACHIDA Kumiko TADANO Yoshiharu MAENO Kazuo YANOO

LETTER-Reliability, Maintainability and Safety Analysis

Vol:
E96-A No:7
Page(s):
1649-1652
Traditional imperfect fault coverage models only consider the coverage (including identification and isolation) of faulty components, and they do not consider the coverage of irrelevant (operational) components. One potential reason for the omission is that in these models the system is generally assumed to be coherent in which each component is initially relevant. In this paper, we first point out that an initially relevant component could become irrelevant afterwards due to the failures of some other components, and thus it is important to consider the handling of irrelevancy even the system is originally coherent. We propose an irrelevancy coverage model (IRCM) in which the coverage is extended to the irrelevant components in addition to the faulty components. The IRCM can not only significantly enhance system reliability by preventing the future system failures resulting from the not-covered failures of the irrelevant components, but may also play an important role in efficient energy use in practice by timely turning off the irrelevant components.
Test Generation for Delay Faults on Clock Lines under Launch-on-Capture Test Environment
Yoshinobu HIGAMI Hiroshi TAKAHASHI Shin-ya KOBAYASHI Kewal K. SALUJA

PAPER-Dependable Computing

Vol:
E96-D No:6
Page(s):
1323-1331
This paper deals with delay faults on clock lines assuming the launch-on-capture test. In this realistic fault model, the amount of delay at the FF driven by the faulty clock line is such that the scan shift operation can perform correctly even in the presence of a fault, but during the system clock operation, capturing functional value(s) at faulty FF(s), i.e. FF(s) driven by the clock with delay, is delayed and correct value(s) may not be captured. We developed a fault simulator that can handle such faults and using this simulator we investigate the relation between the duration of the delay and the difficulty of detecting clock delay faults in the launch-on-capture test. Next, we propose test generation methods for detecting clock delay faults that affect a single or two FFs. Experimental results for benchmark circuits are given in order to establish the effectiveness of the proposed methods.
Detection and Localization of Link Quality Degradation in Transparent WDM Networks
Wissarut YUTTACHAI Poompat SAENGUDOMLERT Wuttipong KUMWILAISAK

PAPER-Fiber-Optic Transmission for Communications

Vol:
E96-B No:6
Page(s):
1412-1424
We consider the problem of detecting and localizing of link quality degradations in transparent wavelength division multiplexing (WDM) networks. In particular, we consider the degradation of the optical signal-to-noise ratio (OSNR), which is a key parameter for link quality monitoring in WDM networks. With transparency in WDM networks, transmission lightpaths can bypass electronic processing at intermediate nodes. Accordingly, links cannot always be monitored by receivers at their end nodes. This paper proposes the use of optical multicast probes to monitor OSNR degradations on optical links. The proposed monitoring scheme consists of two steps. The first step is an off-line process to set up monitoring trees using integer linear programming (ILP). The set of monitoring trees is selected to guarantee that significant OSNR degradations can be identified on any link or links in the network. The second step uses optical performance monitors that are placed at the receivers identified in the first step. The information from these monitors is collected and input to the estimation algorithm to localize the degraded links. Numerical results indicate that the proposed monitoring algorithm is able to detect link degradations that cause significant OSNR changes. In addition, we demonstrate how the information obtained from monitoring can be used to detect a significant end-to-end OSNR degradation even though there is no significant OSNR degradation on individual links.
Dynamic Fault Tree Analysis Using Bayesian Networks and Sequence Probabilities
Tetsushi YUGE Shigeru YANAGI

PAPER-Reliability, Maintainability and Safety Analysis

Vol:
E96-A No:5
Page(s):
953-962
A method of calculating the exact top event probability of a fault tree with dynamic gates and repeated basic events is proposed. The top event probability of such a dynamic fault tree is obtained by converting the tree into an equivalent Markov model. However, the Markov-based method is not realistic for a complex system model because the number of states that should be considered in the Markov analysis increases explosively as the number of basic events in the model increases. To overcome this shortcoming, we propose an alternative method in this paper. It is a hybrid of a Bayesian network (BN) and an algebraic technique. First, modularization is applied to a dynamic fault tree. The detected modules are classified into two types: one satisfies the parental Markov condition and the other does not. The module without the parental Markov condition is replaced with an equivalent single event. The occurrence probability of this event is obtained as the sum of disjoint sequence probabilities. After the contraction of modules without parent Markov condition, the BN algorithm is applied to the dynamic fault tree. The conditional probability tables for dynamic gates are presented. The BN is a standard one and has hierarchical and modular features. Numerical example shows that our method works well for complex systems.
Improving Test Coverage by Measuring Path Delay Time Including Transmission Time of FF
Wenpo ZHANG Kazuteru NAMBA Hideo ITO

LETTER-Dependable Computing

Vol:
E96-D No:5
Page(s):
1219-1222
As technology scales to 45 nm and below, the reliability of VLSI declines due to small delay defects, which are hard to detect by functional clock frequency. To detect small delay defects, a method which measures the delay time of path in circuit under test (CUT) was proposed. However, because a large number of FFs exist in recent VLSI, the probability that the resistive defect occurs in the FFs is increased. A test method measuring path delay time including the transmission time of FFs is necessary. However, the path measured by the conventional on-chip path delay time measurement method does not include a part of a master latch. Thus, testing using the conventional measurement method cannot detect defects occurring on the part. This paper proposes an improved on-chip path delay time measurement method. Test coverage is improved by measuring the path delay time including transmission time of a master latch. The proposed method uses a duty-cycle-modified clock signal. Evaluation results show that, the proposed method improves test coverage 5.2511.28% with the same area overhead as the conventional method.
Understanding the Impact of BPRAM on Incremental Checkpoint
Xu LI Kai LU Xiaoping WANG Bin DAI Xu ZHOU

PAPER-Dependable Computing

Vol:
E96-D No:3
Page(s):
663-672
Existing large-scale systems suffer from various hardware/software failures, motivating the research of fault-tolerance techniques. Checkpoint-restart techniques are widely applied fault-tolerance approaches, especially in scientific computing systems. However, the overhead of checkpoint largely influences the overall system performance. Recently, the emerging byte-addressable, persistent memory technologies, such as phase change memory (PCM), make it possible to implement checkpointing in arbitrary data granularity. However, the impact of data granularity on the checkpointing cost has not been fully addressed. In this paper, we investigate how data granularity influences the performance of a checkpoint system. Further, we design and implement a high-performance checkpoint system named AG-ckpt. AG-ckpt is a hybrid-granularity incremental checkpointing scheme through: (1) low-cost modified-memory detection and (2) fine-grained memory duplication. Moreover, we also formulize the performance-granularity relationship of checkpointing systems through a mathematical model, and further obtain the optimum solutions. We conduct the experiments through several typical benchmarks to verify the performance gain of our design. Compared to conventional incremental checkpoint, our results show that AG-ckpt can reduce checkpoint data amount up to 50% and provide a speedup of 1.2x-1.3x on checkpoint efficiency.

101-120hit(493hit)

Keyword Search Result

[Keyword] fault(493hit)

A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems

A Method of Parallelizing Consensuses for Accelerating Byzantine Fault Tolerance

Cooperative VM Migration: A Symbiotic Virtualization Mechanism by Leveraging the Guest OS Knowledge

SAT-Based Test Generation for Open Faults Using Fault Excitation Caused by Effect of Adjacent Lines

On Reducing Rollback Propagation Effect of Optimistic Message Logging for Group-Based Distributed Systems

An Efficient Test and Repair Flow for Yield Enhancement of One-Time-Programming NROM-Based ROMs

Open-Fault Resilient Multiple-Valued Codes for Reliable Asynchronous Global Communication Links

Low-Overhead Fault-Secure Parallel Prefix Adder by Carry-Bit Duplication

Round Addition DFA on 80-bit Piccolo and TWINE

Fault Diagnosis and Reconfiguration Method for Network-on-Chip Based Multiple Processor Systems with Restricted Private Memories

Potential of Fault-Detection Coverage by means of On-Chip Redundancy - IEC61508: Are There Royal Roads to SIL 4?

Field Slack Assessment for Predictive Fault Avoidance on Coarse-Grained Reconfigurable Devices

Selective Check of Data-Path for Effective Fault Tolerance

Dynamic Fault Tree Analysis for Systems with Nonexponential Failure Components

Coverage of Irrelevant Components in Systems with Imperfect Fault Coverage

Test Generation for Delay Faults on Clock Lines under Launch-on-Capture Test Environment

Detection and Localization of Link Quality Degradation in Transparent WDM Networks

Dynamic Fault Tree Analysis Using Bayesian Networks and Sequence Probabilities

Improving Test Coverage by Measuring Path Delay Time Including Transmission Time of FF

Understanding the Impact of BPRAM on Incremental Checkpoint

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles