The search functionality is under construction.

Keyword Search Result

[Keyword] fault tolerance(100hit)

1-20hit(100hit)

  • Dynamic Fault Tolerance for Multi-Node Query Processing

    Yutaro BESSHO  Yuto HAYAMIZU  Kazuo GODA  Masaru KITSUREGAWA  

     
    PAPER

      Pubricized:
    2022/02/03
      Vol:
    E105-D No:5
      Page(s):
    909-919

    Parallel processing is a typical approach to answer analytical queries on large database. As the size of the database increases, we often try to increase the parallelism by incorporating more processing nodes. However, this approach increases the possibility of node failure as well. According to the conventional practice, if a failure occurs during query processing, the database system restarts the query processing from the beginning. Such temporal cost may be unacceptable to the user. This paper proposes a fault-tolerant query processing mechanism, named PhoeniQ, for analytical parallel database systems. PhoeniQ continuously takes a checkpoint for every operator pipeline and replicates the output of each stateful operator among different processing nodes. If a single processing node fails during query processing, another can promptly take over the processing. Hence, PhoneniQ allows the database system to efficiently resume query processing after a partial failure event. This paper presents a key design of PhoeniQ and prototype-based experiments to demonstrate that PhoeniQ imposes negligible performance overhead and efficiently continues query processing in the face of node failure.

  • Node-Disjoint Paths Problems in Directed Bijective Connection Graphs

    Keiichi KANEKO  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2019/09/26
      Vol:
    E103-D No:1
      Page(s):
    93-100

    In this paper, we extend the notion of bijective connection graphs to introduce directed bijective connection graphs. We propose algorithms that solve the node-to-set node-disjoint paths problem and the node-to-node node-disjoint paths problem in a directed bijective connection graph. The time complexities of the algorithms are both O(n4), and the maximum path lengths are both 2n-1.

  • Avoiding Performance Impacts by Re-Replication Workload Shifting in HDFS Based Cloud Storage

    Thanda SHWE  Masayoshi ARITSUGI  

     
    PAPER-Cloud Computing

      Pubricized:
    2018/09/18
      Vol:
    E101-D No:12
      Page(s):
    2958-2967

    Data replication in cloud storage systems brings a lot of benefits, such as fault tolerance, data availability, data locality and load balancing both from reliability and performance perspectives. However, each time a datanode fails, data blocks stored on the failed datanode must be restored to maintain replication level. This may be a large burden for the system in which resources are highly utilized with users' application workloads. Although there have been many proposals for replication, the approach of re-replication has not been properly addressed yet. In this paper, we present a deferred re-replication algorithm to dynamically shift the re-replication workload based on current resource utilization status of the system. As workload pattern varies depending on the time of the day, simulation results from synthetic workload demonstrate a large opportunity for minimizing impacts on users' application workloads with the simple algorithm that adjusts re-replication based on current resource utilization. Our approach can reduce performance impacts on users' application workloads while ensuring the same reliability level as default HDFS can provide.

  • Low-Cost Adaptive and Fault-Tolerant Routing Method for 2D Network-on-Chip

    Ruilian XIE  Jueping CAI  Xin XIN  Bo YANG  

     
    LETTER-Computer System

      Pubricized:
    2017/01/20
      Vol:
    E100-D No:4
      Page(s):
    910-913

    This letter presents a Preferable Mad-y (PMad-y) turn model and Low-cost Adaptive and Fault-tolerant Routing (LAFR) method that use one and two virtual channels along the X and Y dimensions for 2D mesh Network-on-Chip (NoC). Applying PMad-y rules and using the link status of neighbor routers within 2-hops, LAFR can tolerate multiple faulty links and routers in more complicated faulty situations and impose the reliability of network without losing the performance of network. Simulation results show that LAFR achieves better saturation throughput (0.98% on average) than those of other fault-tolerant routing methods and maintains high reliability of more than 99.56% on average. For achieving 100% reliability of network, a Preferable LAFR (PLAFR) is proposed.

  • Placement of Virtual Storages for Distributed Robust Cloud Storage

    Yuya TARUTANI  Yuichi OHSITA  Masayuki MURATA  

     
    PAPER-Network Management/Operation

      Vol:
    E99-B No:4
      Page(s):
    885-893

    Cloud storage has become popular and is being used to hold important data. As a result, availability to become important; cloud storage providers should allow users to upload or download data even if some part of the system has failed. In this paper, we discuss distributed cloud storage that is robust against failures. In distributed cloud storage, multiple replicas of each data chunk are stored in the virtual storage at geographically different locations. Thus, even if one of the virtual storage systems becomes unavailable, users can access the data chunk from another virtual storage system. In distributed cloud storage, the placement of the virtual storage system is important; if the placement of the virtual cloud storage system means that a large number of virtual storages are possible could become unavailable from a failure, a large number of replicas of each data chunk should be prepared to maintain availability. In this paper, we propose a virtual storage placement method that assures availability with a small number of replicas. We evaluated our method by comparing it with three other methods. The evaluation shows that our method can maintain availability while requiring only with 60% of the network costs required by the compared methods.

  • Living Will for Resilient Structured Overlay Networks

    Kimihiro MIZUTANI  Takeru INOUE  Toru MANO  Osamu AKASHI  Satoshi MATSUURA  Kazutoshi FUJIKAWA  

     
    PAPER

      Vol:
    E99-B No:4
      Page(s):
    830-840

    The routing efficiency of structured overlay networks depends on the consistency of pointers between nodes, where a pointer maps a node identifier to the corresponding address. This consistency can, however, break temporarily when some overlay nodes fail, since it takes time to repair the broken pointers in a distributed manner. Conventional solutions utilize “backpointers” to quickly discover any failure among the pointing nodes, which allow them to fix the pointers in a short time. Overlay nodes are, however, required to maintain backpointers for every pointing node, which incurs significant memory and consistency check overhead. This paper proposes a novel light-weight protocol; an overlay node gives a “living will” containing its acquaintances (backpointers) only to its successor, thus other nodes are freed from the need to maintain it. Our carefully-designed protocol guarantees that all acquaintances are registered via the living will, even in the presence of churn, and the successor notifies the acquaintances for the deceased. Even if the successor passes away and the living will is lost, the successor to the successor can identify the acquaintances with a high success ratio. Simulations show that our protocol greatly reduces memory overhead as well as the detection time for node failure with the cost being a slight increase in messaging load.

  • Failure Detection in P2P-Grid System

    Huan WANG  Hideroni NAKAZATO  

     
    PAPER-Grid System

      Pubricized:
    2015/09/15
      Vol:
    E98-D No:12
      Page(s):
    2123-2131

    Peer-to-peer (P2P)-Grid systems are being investigated as a platform for converging the Grid and P2P network in the construction of large-scale distributed applications. The highly dynamic nature of P2P-Grid systems greatly affects the execution of the distributed program. Uncertainty caused by arbitrary node failure and departure significantly affects the availability of computing resources and system performance. Checkpoint-and-restart is the most common scheme for fault tolerance because it periodically saves the execution progress onto stable storage. In this paper, we suggest a checkpoint-and-restart mechanism as a fault-tolerant method for applications on P2P-Grid systems. Failure detection mechanism is a necessary prerequisite to fault tolerance and fault recovery in general. Given the highly dynamic nature of nodes within P2P-Grid systems, any failure should be detected to ensure effective task execution. Therefore, failure detection mechanism as an integral part of P2P-Grid systems was studied. We discussed how the design of various failure detection algorithms affects their performance in average failure detection time of nodes. Numerical analysis results and implementation evaluation are also provided to show different average failure detection times in real systems for various failure detection algorithms. The comparison shows the shortest average failure detection time by 8.8s on basis of the WP failure detector. Our lowest mean time to recovery (MTTR) is also proven to have a distinct advantage with a time consumption reduction of about 5.5s over its counterparts.

  • The Fault-Tolerant Hamiltonian Problems of Crossed Cubes with Path Faults

    Hon-Chan CHEN  Tzu-Liang KUNG  Yun-Hao ZOU  Hsin-Wei MAO  

     
    PAPER-Switching System

      Pubricized:
    2015/09/15
      Vol:
    E98-D No:12
      Page(s):
    2116-2122

    In this paper, we investigate the fault-tolerant Hamiltonian problems of crossed cubes with a faulty path. More precisely, let P denote any path in an n-dimensional crossed cube CQn for n ≥ 5, and let V(P) be the vertex set of P. We show that CQn-V(P) is Hamiltonian if |V(P)|≤n and is Hamiltonian connected if |V(P)| ≤ n-1. Compared with the previous results showing that the crossed cube is (n-2)-fault-tolerant Hamiltonian and (n-3)-fault-tolerant Hamiltonian connected for arbitrary faults, the contribution of this paper indicates that the crossed cube can tolerate more faulty vertices if these vertices happen to form some specific types of structures.

  • Virtual Network Allocation for Fault Tolerance Balanced with Physical Resources Consumption in a Multi-Tenant Data Center

    Yukio OGAWA  Go HASEGAWA  Masayuki MURATA  

     
    PAPER

      Vol:
    E98-B No:11
      Page(s):
    2121-2131

    In a multi-tenant data center, nodes and links of tenants' virtual networks (VNs) share a single component of the physical substrate network (SN). The failure of a single SN component can thereby cause the simultaneous failures of multiple nodes and links in a single VN; this complex of failures must significantly disrupt the services offered on the VN. In the present paper, we clarify how the fault tolerance of each VN is affected by a single SN failure, especially from the perspective of VN allocation in the SN. We propose a VN allocation model for multi-tenant data centers and formulate a problem that deals with the bandwidth loss in a single VN due a single SN failure. We conduct numerical simulations (with the setting that has 1.7×108bit/s bandwidth demand on each VN, (denoted by Ci)). When each node in each VN is scattered and mapped to an individual physical server, each VN can have the minimum bandwidth loss (5.3×102bit/s (3.0×10-6×Ci)) but the maximum required bandwidth between physical servers (1.0×109bit/s (5.7×Ci)). The balance between the bandwidth loss and the required physical resources can be optimized by assigning every four nodes of each VN to an individual physical server, meaning that we minimize the bandwidth loss without over-provisioning of core switches.

  • Efficient Randomized Byzantine Fault-Tolerant Replication Based on Special Valued Coin Tossing

    Junya NAKAMURA  Tadashi ARARAGI  Shigeru MASUYAMA  Toshimitsu MASUZAWA  

     
    PAPER-Dependable Computing

      Vol:
    E97-D No:2
      Page(s):
    231-244

    We propose a fast and resource-efficient agreement protocol on a request set, which is used to realize Byzantine fault tolerant server replication. Although most existing randomized protocols for Byzantine agreement exploit a modular approach, that is, a combination of agreement on a bit value and a reduction of request set values to the bit values, our protocol directly solves the multi-valued agreement problem for request sets. We introduce a novel coin tossing scheme to select a candidate of an agreed request set randomly. This coin toss allows our protocol to reduce resource consumption and to attain faster response time than the existing representative protocols.

  • A Method of Parallelizing Consensuses for Accelerating Byzantine Fault Tolerance

    Junya NAKAMURA  Tadashi ARARAGI  Toshimitsu MASUZAWA  Shigeru MASUYAMA  

     
    PAPER-Dependable Computing

      Vol:
    E97-D No:1
      Page(s):
    53-64

    We propose a new method that accelerates asynchronous Byzantine Fault Tolerant (BFT) protocols designed on the principle of state machine replication. State machine replication protocols ensure consistency among replicas by applying operations in the same order to all of them. A naive way to determine the application order of the operations is to repeatedly execute the BFT consensus to determine the next executed operation, but this may introduce inefficiency caused by waiting for the completion of the previous execution of the consensus protocol. To reduce this inefficiency, our method allows parallel execution of the consensuses while keeping consistency of the consensus results at the replicas. In this paper, we also prove the correctness of our method and experimentally compare it with the existing method in terms of latency and throughput. The evaluation results show that our method makes a BFT protocol three or four times faster than the existing one when some machines or message transmissions are delayed.

  • Cooperative VM Migration: A Symbiotic Virtualization Mechanism by Leveraging the Guest OS Knowledge

    Ryousei TAKANO  Hidemoto NAKADA  Takahiro HIROFUCHI  Yoshio TANAKA  Tomohiro KUDOH  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2675-2683

    A virtual machine (VM) migration is useful for improving flexibility and maintainability in cloud computing environments. However, VM monitor (VMM)-bypass I/O technologies, including PCI passthrough and SR-IOV, in which the overhead of I/O virtualization can be significantly reduced, make VM migration impossible. This paper proposes a novel and practical mechanism, called Symbiotic Virtualization (SymVirt), for enabling migration and checkpoint/restart on a virtualized cluster with VMM-bypass I/O devices, without the virtualization overhead during normal operations. SymVirt allows a VMM to cooperate with a message passing layer on the guest OS, then it realizes VM-level migration and checkpoint/restart by using a combination of a user-level dynamic device configuration and coordination of distributed VMMs. We have implemented the proposed mechanism on top of QEMU/KVM and the Open MPI system. All PCI devices, including Infiniband, Ethernet, and Myrinet, are supported without implementing specific para-virtualized drivers; and it is not necessary to modify either of the MPI runtime and applications. Using the proposed mechanism, we demonstrate reactive and proactive FT mechanisms on a virtualized Infiniband cluster. We have confirmed the effectiveness using both a memory intensive micro benchmark and the NAS parallel benchmark.

  • Open-Fault Resilient Multiple-Valued Codes for Reliable Asynchronous Global Communication Links

    Naoya ONIZAWA  Atsushi MATSUMOTO  Takahiro HANYU  

     
    PAPER

      Vol:
    E96-D No:9
      Page(s):
    1952-1961

    This paper introduces open-wire fault-resilient multiple-valued codes for reliable asynchronous point-to-point global communication links. In the proposed encoding, two communication modules assign complementary codewords that change between two valid states without an open-wire fault. Under an open-wire fault, at each module, the codewords don't reach to one of the two valid states and remains as “invalid” states. The detection of the invalid states makes it possible to stop sending wrong codewords caused by an open-wire fault. The detectability of the open-wire fault based on the proposed encoding is proven for m-of-n codes. The proposed code used in the multiple-valued asynchronous global communication link is capable of detecting a single open-wire fault with 3.08-times higher coding efficiency compared with a conventional multiple-valued code used in a triple-modular redundancy (TMR) link that detects an open-wire fault under the same dynamic range of logical values.

  • Coverage of Irrelevant Components in Systems with Imperfect Fault Coverage

    Jianwen XIANG  Fumio MACHIDA  Kumiko TADANO  Yoshiharu MAENO  Kazuo YANOO  

     
    LETTER-Reliability, Maintainability and Safety Analysis

      Vol:
    E96-A No:7
      Page(s):
    1649-1652

    Traditional imperfect fault coverage models only consider the coverage (including identification and isolation) of faulty components, and they do not consider the coverage of irrelevant (operational) components. One potential reason for the omission is that in these models the system is generally assumed to be coherent in which each component is initially relevant. In this paper, we first point out that an initially relevant component could become irrelevant afterwards due to the failures of some other components, and thus it is important to consider the handling of irrelevancy even the system is originally coherent. We propose an irrelevancy coverage model (IRCM) in which the coverage is extended to the irrelevant components in addition to the faulty components. The IRCM can not only significantly enhance system reliability by preventing the future system failures resulting from the not-covered failures of the irrelevant components, but may also play an important role in efficient energy use in practice by timely turning off the irrelevant components.

  • Understanding the Impact of BPRAM on Incremental Checkpoint

    Xu LI  Kai LU  Xiaoping WANG  Bin DAI  Xu ZHOU  

     
    PAPER-Dependable Computing

      Vol:
    E96-D No:3
      Page(s):
    663-672

    Existing large-scale systems suffer from various hardware/software failures, motivating the research of fault-tolerance techniques. Checkpoint-restart techniques are widely applied fault-tolerance approaches, especially in scientific computing systems. However, the overhead of checkpoint largely influences the overall system performance. Recently, the emerging byte-addressable, persistent memory technologies, such as phase change memory (PCM), make it possible to implement checkpointing in arbitrary data granularity. However, the impact of data granularity on the checkpointing cost has not been fully addressed. In this paper, we investigate how data granularity influences the performance of a checkpoint system. Further, we design and implement a high-performance checkpoint system named AG-ckpt. AG-ckpt is a hybrid-granularity incremental checkpointing scheme through: (1) low-cost modified-memory detection and (2) fine-grained memory duplication. Moreover, we also formulize the performance-granularity relationship of checkpointing systems through a mathematical model, and further obtain the optimum solutions. We conduct the experiments through several typical benchmarks to verify the performance gain of our design. Compared to conventional incremental checkpoint, our results show that AG-ckpt can reduce checkpoint data amount up to 50% and provide a speedup of 1.2x-1.3x on checkpoint efficiency.

  • Resco: Automatic Collection of Leaked Resources

    Ziying DAI  Xiaoguang MAO  Yan LEI  Xiaomin WAN  Kerong BEN  

     
    PAPER-Software Engineering

      Vol:
    E96-D No:1
      Page(s):
    28-39

    A garbage collector relieves programmers from manual memory management and improves productivity and program reliability. However, there are many other finite system resources that programmers must manage by themselves, such as sockets and database connections. Growing resource leaks can lead to performance degradation and even program crashes. This paper presents the automatic resource collection approach called Resco (RESource COllector) to tolerate non-memory resource leaks. Resco prevents performance degradation and crashes due to resource leaks by two steps. First, it utilizes monitors to count resource consumption and request resource collections independently of memory usage when resource limits are about to be violated. Second, it responds to a resource collection request by safely releasing leaked resources. We implement Resco based on a Java Virtual Machine for Java programs. The performance evaluation against standard benchmarks shows that Resco has a very low overhead, around 1% or 3%. Experiments on resource leak bugs show that Resco successfully prevents most of these programs from crashing with little increase in execution time.

  • Unified Constant Geometry Fault Tolerant DCT/IDCT for Image Codec System on a Display Panel

    Jaehee YOU  

     
    PAPER-Digital Signal Processing

      Vol:
    E95-A No:12
      Page(s):
    2396-2406

    System-on-display panel design methodologies are proposed with the purpose of integrating DCT and IDCT on display panels for image codec and peripheral systems so as to reduce the bus data rate, memory size and power consumption. Unified constant geometry algorithms and architectures including recursive additions are proposed for DCT and IDCT butterfly computation, recursive additions and interconnections between stages. These schemes facilitate VLSI implementation and improve fault tolerance, suitable for low-yield SOP processing technologies through duplicate use of a PE as all the butterfly and recursive addition stages are composed and interconnected in a regular fashion. Efficient redundancy replacement methodologies optimizing the computation speed and the amount of hardware in various application areas are also described with testability and reliability issues. Finally, a performance analysis of speed, hardware and interconnection complexity is described with the proposed work's advantages.

  • WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs

    Xinhai XU  Xuejun YANG  Yufei LIN  

     
    PAPER-Computer System

      Vol:
    E95-D No:3
      Page(s):
    786-796

    As supercomputers increase in size, the mean time between failures (MTBF) of a system becomes shorter, and the reliability problem of supercomputers becomes more and more serious. MPI is currently the de facto standard used to build high-performance applications, and researches on the fault tolerance methods of MPI are always hot topics. However, due to the characteristics of MPI programs, most current checkpointing methods for MPI programs need to modify the MPI library (even operating system), or implement a complicated protocol by logging lots of messages. In this paper, we carry forward the idea of Application-Level Checkpointing (ALC). Based on the general fact that programmers are familiar with the communication characteristics of applications, we have developed BC-ALC, a new portable blocking coordinated ALC for MPI programs. BC-ALC neither modifies the MPI library (even operating system) nor logs any message. It implements coordination only by the Barrier operations instead of any complicated protocol. Furthermore, in order to reduce the cost of fault-tolerance, we reduce the synchronization range of the barrier, and design WBC-ALC, a weak blocking coordinated ALC utilizing group synchronization instead of global synchronization based on the communication relationship between processes. We also propose a fault-tolerance framework developed on top of WBC-ALC and discuss an implementation of it. Experimental results on NPB3.3-MPI benchmarks validate BC-ALC and WBC-ALC, and show that compared with BC-ALC, the average coordination time and the average backup time of a single checkpoint in WBC-ALC are reduced by 44.5% and 5.7% respectively.

  • A Fault-Tolerant Architecture with Error Correcting Code for the Instruction-Level Temporal Redundancy

    Chao YAN  Hongjun DAI  Tianzhou CHEN  

     
    PAPER-Trust

      Vol:
    E95-D No:1
      Page(s):
    38-45

    Soft error has become an increasingly significant concern in modern micro-processor design, it is reported that the instruction-level temporal redundancy in out-of-order cores suffers an performance degradation up to 45%. In this work, we propose a fault tolerant architecture with fast error correcting codes (such as the two-dimensional code) based on double execution. Experimental results show that our scheme can gain back IPC loss between 9.1% and 10.2%, with an average around 9.2% compared with the conventional double execution architecture.

  • Maximal Interconnect Resilient Methodology for Fault Tolerance, Yield, and Reliability Improvement in Network on Chip

    Katherine Shu-Min LI  Chih-Yun PAI  Liang-Bi CHEN  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E94-A No:12
      Page(s):
    2649-2658

    This paper presents an interconnect resilient (IR) methodology with maximal interconnect fault tolerance, yield, and reliability for both single and multiple interconnect faults under stuck-at and open fault models. By exploiting multiple routes inherent in an interconnect structure, this method can tolerate faulty connections by efficiently finding alternative paths. The proposed approach is compatible with previous interconnect detection and diagnosis methods under oscillation ring schemes, and together they can be applied to implement a robust interconnect structure that may still provide correct communication even under multiple link faults in Network-on-Chips (NoCs). With such knowledge, designers can significantly improve interconnect reliability by augmenting vulnerable interconnect structures in NoCs. As a result, the experimental results show that alternative paths in NoCs can be found for almost all paths. Hence, the proposed method provides a good way to achieve fault tolerance and reliability/yield improvement.

1-20hit(100hit)