IEICE global.ieice.org Site

Keyword Search Result

[Keyword] fault-tolerance(44hit)

1-20hit(44hit)

Enhanced Sender-Based Message Logging for Reducing Forced Checkpointing Overhead in Distributed Systems
Jinho AHN

LETTER-Dependable Computing

Pubricized:
2021/06/08
Vol:
E104-D No:9
Page(s):
1500-1505
The previous communication-induced checkpointing may considerably induce worthless forced checkpoints because each process receiving messages cannot obtain sufficient information related to non-causal Z-paths. This paper presents an enhanced sender-based message logging protocol applicable to any communication-induced checkpointing to lead to a high decrease of the forced checkpointing overhead of communication-induced checkpointing in an effective way while permitting no useless checkpoint. The protocol allows each process sending a message to know the exact timestamp of the receiver of the message in its logging procedures without any extra message. Simulation verifies their great efficiency of overhead alleviation regardless of communication patterns.
Recovering Faulty Non-Volatile Flip Flops for Coarse-Grained Reconfigurable Architectures
Takeharu IKEZOE Takuya KOJIMA Hideharu AMANO

PAPER

Pubricized:
2020/12/14
Vol:
E104-C No:6
Page(s):
215-225
Recent IoT devices require extremely low standby power consumption, while a certain performance is needed during the active time, and Coarse-Grained Reconfigurable Arrays (CGRAs) have received attention because of their high energy efficiency. For further reduction of the standby energy consumption of CGRAs, the leakage power for their configuration memory must be reduced. Although the power gating is a common technique, the lost data in flip-flops and memory must be retrieved after the wake-up. Recovering everything requires numerous state transitions and considerable overhead both on its execution time and energy. To address the problem, Non-volatile Cool Mega Array (NVCMA), a CGRA providing non-volatile flip-flops (NVFFs) with spin transfer torque type non-volatile memory (NVM) technology has been developed. However, in general, non-volatile memory technologies have problems with reliability. Some NVFFs are stacked-at-0/1, and cannot store the data in a certain possibility. To improve the chip yield, we propose a mapping algorithm to avoid faulty processing elements of the CGRA caused by the erroneous configuration data. Next, we also propose a method to add an error-correcting code (ECC) mechanism to NVFFs for the configuration and constant memory. The proposed method was applied to NVCMA to evaluate the availability rate and reduction of write time. By using both methods, the average availability ratio of 94.2% was achieved, while the average availability ratio of the nine applications was 0.056% when the probability of failure of the FF was 0.01. The energy for storing data becomes about 2.3 times because of the hardware overhead of ECC but the proposed method can save 8.6% of the writing power on average.
Cyclic Vertex Connectivity of Trivalent Cayley Graphs
Jenn-Yang KE

PAPER-Fundamentals of Information Systems

Pubricized:
2018/03/30
Vol:
E101-D No:7
Page(s):
1828-1834
A vertex subset F ⊆ V(G) is called a cyclic vertex-cut set of a connected graph G if G-F is disconnected such that at least two components in G-F contain cycles. The cyclic vertex connectivity is the cardinality of a minimum cyclic vertex-cut set. In this paper, we show that the cyclic vertex connectivity of the trivalent Cayley graphs TGn is equal to eight for n ≥ 4.
Set-to-Set Disjoint Paths Routing in Torus-Connected Cycles
Antoine BOSSARD Keiichi KANEKO

LETTER-Dependable Computing

Pubricized:
2016/08/10
Vol:
E99-D No:11
Page(s):
2821-2823
Extending the very popular tori interconnection networks[1]-[3], Torus-Connected Cycles (TCC) have been proposed as a novel network topology for massively parallel systems [5]. Here, the set-to-set disjoint paths routing problem in a TCC is solved. In a TCC(k,n), it is proved that paths of lengths at most kn2+2n can be selected in O(kn2) time.
Autonomous Decentralized Service Oriented Architecture Concept and Application for Mission Critical Information Systems
Carlos PEREZ-LEGUIZAMO P. Josue HERNANDEZ-TORRES J.S. Guadalupe GODINEZ-BORJA Victor TAPIA-TEC

PAPER

Vol:
E99-B No:4
Page(s):
803-811
Recently, the Services Oriented Architectures (SOA) have been recognized as the key to the integration and interoperability of different applications and systems that coexist in an organization. However, even though the use of SOA has increased, some applications are unable to use it. That is the case of mission critical information applications, whose requirements such as high reliability, non-stop operation, high flexibility and high performance are not satisfied by conventional SOA infrastructures. In this article we present a novel approach of combining SOA with Autonomous Decentralized Systems (ADS) in order to provide an infrastructure that can satisfy those requirements. We have named this infrastructure Autonomous Decentralized Service Oriented Architecture (ADSOA). We present the concept and architecture of ADSOA, as well as the Loosely Couple Delivery Transaction and Synchronization Technology for assuring the data consistency and high reliability of the application. Moreover, a real implementation and evaluation of the proposal in a mission critical information system, the Uniqueness Verifying Public Key Infrastructure (UV-PKI), is shown in order to prove its effectiveness.
A Low-Latency DMR Architecture with Fast Checkpoint Recovery Scheme
Go MATSUKAWA Yohei NAKATA Yasuo SUGURE Shigeru OHO Yuta KIMI Masafumi SHIMOZAWA Shuhei YOSHIDA Hiroshi KAWAGUCHI Masahiko YOSHIMOTO

PAPER

Vol:
E98-C No:4
Page(s):
333-339
This paper presents a novel architecture for a fault-tolerant and dual modular redundancy (DMR) system using a checkpoint recovery approach. The architecture features exploitation of SRAM with simultaneous copy and instantaneous compare function. It can perform low-latency data copying between dual cores. Therefore, it can carry out fast backup and rollback. Furthermore, it can reduce the power consumption during data comparison process compared to the cyclic redundancy check (CRC). Evaluation results show that, compared with the conventional checkpoint/restart DMR, the proposed architecture reduces the cycle overhead by 97.8% and achieves a 3.28% low-latency execution cycle even if a one-time fault occurs when executing the task. The proposed architecture provides high reliability for systems with a real-time requirement.
A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems
Yonghwan KIM Tadashi ARARAGI Junya NAKAMURA Toshimitsu MASUZAWA

PAPER-Dependable Computing

Vol:
E97-D No:1
Page(s):
65-76
Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.
On Reducing Rollback Propagation Effect of Optimistic Message Logging for Group-Based Distributed Systems
Jinho AHN

LETTER-Dependable Computing

Vol:
E96-D No:11
Page(s):
2473-2477
This paper presents a new scalable method to considerably reduce the rollback propagation effect of the conventional optimistic message logging by utilizing positive features of reliable FIFO group communication links. To satisfy this goal, the proposed method forces group members to replicate different receive sequence numbers (RSNs), which they assigned for each identical message to their group respectively, into their volatile memories. As the degree of redundancy of RSNs increases, the possibility of local recovery for each crashed process may significantly be higher. Experimental results show that our method can outperform the previous one in terms of the rollback distance of non-faulty processes with a little normal time overhead.
An Efficient Interpolation Based Erasure-Only Decoder for High-Rate Reed-Solomon Codes
Qian GUO Haibin KAN

LETTER-Coding Theory

Vol:
E95-A No:5
Page(s):
978-981
In this paper, we derive a simple formula to generate a wide-sense systematic generator matrix(we call it quasi-systematic) B for a Reed-Solomon code. This formula can be utilized to construct an efficient interpolation based erasure-only decoder with time complexity O(n2) and space complexity O(n). Specifically, the decoding algorithm requires 3kr + r2 - 2r field additions, kr + r2 + r field negations, 2kr + r2 - r + k field multiplications and kr + r field inversions. Compared to another interpolation based erasure-only decoding algorithm derived by D.J.J. Versfeld et al., our algorithm is much more efficient for high-rate Reed-Solomon codes.
Support Efficient and Fault-Tolerant Multicast in Bufferless Network-on-Chip
Chaochao FENG Zhonghai LU Axel JANTSCH Minxuan ZHANG Xianju YANG

PAPER-Computer System

Vol:
E95-D No:4
Page(s):
1052-1061
In this paper, we propose three Deflection-Routing-based Multicast (DRM) schemes for a bufferless NoC. The DRM scheme without packets replication (DRM_noPR) sends multicast packet through a non-deterministic path. The DRM schemes with adaptive packets replication (DRM_PR_src and DRM_PR_all) replicate multicast packets at the source or intermediate node according to the destination position and the state of output ports to reduce the average multicast latency. We also provide fault-tolerant supporting in these schemes through a reinforcement-learning-based method to reconfigure the routing table to tolerate permanent faulty links in the network. Simulation results illustrate that the DRM_PR_all scheme achieves 41%, 43% and 37% less latency on average than that of the DRM_noPR scheme and 27%, 29% and 25% less latency on average than that of the DRM_PR_src scheme under three synthetic traffic patterns respectively. In addition, all three fault-tolerant DRM schemes achieve acceptable performance degradation at various link fault rates without any packet lost.
Economical and Fault-Tolerant Load Balancing in Distributed Stream Processing Systems
Fuyuan XIAO Teruaki KITASUKA Masayoshi ARITSUGI

PAPER-Data Engineering, Web Information Systems

Vol:
E95-D No:4
Page(s):
1062-1073
We present an economical and fault-tolerant load balancing strategy (EFTLBS) based on an operator replication mechanism and a load shedding method, that fully utilizes the network resources to realize continuous and highly-available data stream processing without dynamic operator migration over wide area networks. In this paper, we first design an economical operator distribution (EOD) plan based on a bin-packing model under the constraints of each stream bandwidth as well as each server's CPU capacity. Next, we devise super-operator (SO) that load balances multi-degree operator replicas. Moreover, for improving the fault-tolerance of the system, we color the SOs based on a coloring bin-packing (CBP) model that assigns peer operator replicas to different servers. To minimize the effects of input rate bursts upon the system, we take advantage of a load shedding method while keeping the QoS guarantees made by the system based on the SO scheme and the CBP model. Finally, we substantiate the utility of our work through experiments on ns-3.
Lightweight Consistent Recovery Algorithm for Sender-Based Message Logging in Distributed Systems
Jinho AHN

LETTER-Dependable Computing

Vol:
E94-D No:8
Page(s):
1712-1715
Sender-based message logging (SBML) with checkpointing has its well-known beneficial feature, lowering highly failure-free overhead of synchronous logging with volatile logging at sender's memory. This feature encourages it to be applied into many distributed systems as a low-cost transparent rollback recovery technique. However, the original SBML recovery algorithm may no longer be progressing in some transient communication error cases. This paper proposes a consistent recovery algorithm to solve this problem by piggybacking small log information for unstable messages received on each acknowledgement message for returning the receive sequence number assigned to a message by its receiver. Our algorithm also enables all messages scheduled to be sent, but delayed because of some preceding unstable messages to be actually transmitted out much earlier than the existing ones.
A Domain Partition Model Approach to the Online Fault Recovery of FPGA-Based Reconfigurable Systems
Lihong SHANG Mi ZHOU Yu HU Erfu YANG

PAPER-Nonlinear Problems

Vol:
E94-A No:1
Page(s):
290-299
Field programmable gate arrays (FPGAs) are widely used in reliability-critical systems due to their reconfiguration ability. However, with the shrinking device feature size and increasing die area, nowadays FPGAs can be deeply affected by the errors induced by electromigration and radiation. To improve the reliability of FPGA-based reconfigurable systems, a permanent fault recovery approach using a domain partition model is proposed in this paper. In the proposed approach, the fault-tolerant FPGA recovery from faults is realized by reloading a proper configuration from a pool of multiple alternative configurations with overlaps. The overlaps are presented as a set of vectors in the domain partition model. To enhance the reliability, a technical procedure is also presented in which the set of vectors are heuristically filtered so that the corresponding small overlaps can be merged into big ones. Experimental results are provided to demonstrate the effectiveness of the proposed approach through applying it to several benchmark circuits. Compared with previous approaches, the proposed approach increased MTTF by up to 18.87%.
Learning Algorithms Which Make Multilayer Neural Networks Multiple-Weight-and-Neuron-Fault Tolerant
Tadayoshi HORITA Itsuo TAKANAMI Masatoshi MORI

PAPER-Biocybernetics, Neurocomputing

Vol:
E91-D No:4
Page(s):
1168-1175
Two simple but useful methods, called the deep learning methods, for making multilayer neural networks tolerant to multiple link-weight and neuron-output faults, are proposed. The methods make the output errors in learning phase smaller than those in practical use. The abilities of fault-tolerance of the multilayer neural networks in practical use, are analyzed in the relationship between the output errors in learning phase and in practical use. The analytical result shows that the multilayer neural networks have complete (100%) fault-tolerance to multiple weight-and-neuron faults in practical use. The simulation results concerning the rate of successful learnings, the ability of fault-tolerance, and the learning time, are also shown.
Fault-Tolerance for the Mobile Ad-Hoc Environment
Taesoon PARK Kwangho KIM

LETTER-Reliability, Maintainability and Safety Analysis

Vol:
E91-A No:1
Page(s):
413-416
Fault-tolerance is an important design issue in building a reliable mobile computing system. This paper considers checkpointing recovery services for a mobile computing system based on the ad-hoc network environment. Since potential problems of this new environment are insufficient power and limited storage capacity, the proposed scheme tries to reduce disk access frequency for saving recovery information, and also the amount of information saved for recovery. A brief simulation study has been performed and the results show that the proposed scheme takes advantage of the existing checkpointing recovery schemes.
A Fault-Tolerant Content Addressable Network
Daisuke TAKEMOTO Shigeaki TAGASHIRA Satoshi FUJITA

PAPER-Networks

Vol:
E89-D No:6
Page(s):
1923-1930
In this paper, we propose a new method to enhance the fault-tolerance of the Content Addressable Network (CAN), which is known as a typical pure P2P system based on the notion of Distributed Hash Table (DHT). The basic idea of the proposed method is to introduce redundancy to the management of index information distributed over the nodes in the given P2P network, by allowing each index to be assigned to several nodes, which was restricted to be one in the original CAN system. To keep the consistency among several copies of indices, we propose an efficient synchronization scheme based on the notion of labels assigned to each copy in a distinct manner. The performance of the proposed scheme is evaluated by simulation. The result of simulations indicates that the proposed scheme significantly enhances the fault-tolerance of the CAN system.
Minimizing the Buffer Size in Fault-Tolerant Video Servers for VBR Streams
Minseok SONG Heonshik SHIN

LETTER-Dependable Computing

Vol:
E88-D No:6
Page(s):
1294-1298
To guarantee the high reliability of video services, video servers usually adopt parity-encoding techniques in which data blocks and their associated parity blocks form a parity group. For real-time video service, all the blocks in a parity group are prefetched in order to cope with a possible disk failure, thereby incurring a buffering overhead. In this paper, we propose a new scheme called Round-level Parity Grouping (RPG) to reduce the buffer overhead while restoring VBR video streams in the presence of a faulty disk. RPG allows variable parity group sizes so that the exact amount of data is retrieved during each round. Based on RPG, we have developed a storage allocation algorithm for effective buffer management. Experimental results show that our proposed scheme reduces the buffer requirement by 20% to 25%.
A Self-Stabilizing Approximation Algorithm for the Distributed Minimum k-Domination
Sayaka KAMEI Hirotsugu KAKUGAWA

PAPER

Vol:
E88-A No:5
Page(s):
1109-1116
Self-stabilization is a theoretical framework of non-masking fault-tolerant distributed algorithms. In this paper, we investigate a self-stabilizing distributed approximation for the minimum k-dominating set (KDS) problem in general networks. The minimum KDS problem is a generalization of the well-known dominating set problem in graph theory. For a graph G = (V,E), a set Dk V is a KDS of G if and only if each vertex not in Dk is adjacent to at least k vertices in Dk. The approximation ratio of our algorithm is , where Δ is the maximum degree of G, in the networks of which the minimum degree is more than or equal to k.
Self-Reconfiguring of -Track-Switch Mesh Arrays with Spares on One Row and One Column by Simple Built-in Circuit
Itsuo TAKANAMI

PAPER-Dependable Computing

Vol:
E87-D No:10
Page(s):
2318-2328
We present a built-in self-reconfiguring system for a mesh-connected processor array where faulty processor elements are compensated for by spare processing elements located in one row and one column. It has advantages in that the number of spare processing elements is small and additional control circuits and networks for changing interconnections of processing elements is so simple that hardware overhead for reconfiguration is also small. First, to indicate the motivation to the proposed reconfiguration scheme, we briefly describe other schemes with the same number of spares as that of the proposed scheme where faulty processing elements are replaced using straight shifts toward spares, and compare their reconfiguration probabilities to each other. Then, we show that a variant of the proposed scheme has the highest probability. Next, we present a built-in self-reconfiguring system for the scheme and formally prove that it works correctly. It can automatically replace faulty processors by spare processors on detecting faults of processors.
A Nested Invocation Suppression Mechanism for Active Replication Fault-Tolerant CORBA
Deron LIANG Chen-Liang FANG Chyouhwa CHEN

PAPER-Dependable Computing

Vol:
E87-D No:8
Page(s):
2070-2077
Active replication is a common approach to building highly available and reliable distributed software applications. The redundant nested invocation (RNI) problem arises when servers in a replicated group issues nested invocations to other server groups in response to a client invocation. Automatic suppression of RNI is always a desirable solution, yet it is usually a difficult design issue. If the system has multithreading support, the difficulties of implementation increase dramatically. Intuitively, to design a deterministic thread execution control mechanism is a possible approach. Unfortunately, some modern operating systems implement thread on kernel level for execution fairness. For the kernel thread case, modification on thread control implies modifying the operating system kernel. This approach loses system portability which is one of the important requirements of CORBA or middleware. In this work, we propose a mechanism to perform the auto-suppression of redundant nested invocation in an active replication fault-tolerant (FT) CORBA system. Besides the mechanism design, we discuss the design correctness semantic and the correctness proof of our design.

1-20hit(44hit)

Keyword Search Result

[Keyword] fault-tolerance(44hit)

Enhanced Sender-Based Message Logging for Reducing Forced Checkpointing Overhead in Distributed Systems

Recovering Faulty Non-Volatile Flip Flops for Coarse-Grained Reconfigurable Architectures

Cyclic Vertex Connectivity of Trivalent Cayley Graphs

Set-to-Set Disjoint Paths Routing in Torus-Connected Cycles

Autonomous Decentralized Service Oriented Architecture Concept and Application for Mission Critical Information Systems

A Low-Latency DMR Architecture with Fast Checkpoint Recovery Scheme

A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems

On Reducing Rollback Propagation Effect of Optimistic Message Logging for Group-Based Distributed Systems

An Efficient Interpolation Based Erasure-Only Decoder for High-Rate Reed-Solomon Codes

Support Efficient and Fault-Tolerant Multicast in Bufferless Network-on-Chip

Economical and Fault-Tolerant Load Balancing in Distributed Stream Processing Systems

Lightweight Consistent Recovery Algorithm for Sender-Based Message Logging in Distributed Systems

A Domain Partition Model Approach to the Online Fault Recovery of FPGA-Based Reconfigurable Systems

Learning Algorithms Which Make Multilayer Neural Networks Multiple-Weight-and-Neuron-Fault Tolerant

Fault-Tolerance for the Mobile Ad-Hoc Environment

A Fault-Tolerant Content Addressable Network

Minimizing the Buffer Size in Fault-Tolerant Video Servers for VBR Streams

A Self-Stabilizing Approximation Algorithm for the Distributed Minimum k-Domination

Self-Reconfiguring of -Track-Switch Mesh Arrays with Spares on One Row and One Column by Simple Built-in Circuit

A Nested Invocation Suppression Mechanism for Active Replication Fault-Tolerant CORBA

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles