The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] checkpoint(36hit)

1-20hit(36hit)

  • Enhanced Sender-Based Message Logging for Reducing Forced Checkpointing Overhead in Distributed Systems

    Jinho AHN  

     
    LETTER-Dependable Computing

      Pubricized:
    2021/06/08
      Vol:
    E104-D No:9
      Page(s):
    1500-1505

    The previous communication-induced checkpointing may considerably induce worthless forced checkpoints because each process receiving messages cannot obtain sufficient information related to non-causal Z-paths. This paper presents an enhanced sender-based message logging protocol applicable to any communication-induced checkpointing to lead to a high decrease of the forced checkpointing overhead of communication-induced checkpointing in an effective way while permitting no useless checkpoint. The protocol allows each process sending a message to know the exact timestamp of the receiver of the message in its logging procedures without any extra message. Simulation verifies their great efficiency of overhead alleviation regardless of communication patterns.

  • Optimal Rejuvenation Policies for Non-Markovian Availability Models with Aperiodic Checkpointing

    Junjun ZHENG  Hiroyuki OKAMURA  Tadashi DOHI  

     
    PAPER-Dependable Computing

      Pubricized:
    2020/07/16
      Vol:
    E103-D No:10
      Page(s):
    2133-2142

    In this paper, we present non-Markovian availability models for capturing the dynamics of system behavior of an operational software system that undergoes aperiodic time-based software rejuvenation and checkpointing. Two availability models with rejuvenation are considered taking account of the procedure after the completion of rollback recovery operation. We further proceed to investigate whether there exists the optimal rejuvenation schedule that maximizes the steady-state system availability, which is derived by means of the phase expansion technique, since the resulting models are not the trivial stochastic models such as semi-Markov process and Markov regenerative process, so that it is hard to solve them by using the common approaches like Laplace-Stieltjes transform and embedded Markov chain techniques. The numerical experiments are conducted to determine the optimal rejuvenation trigger timing maximizing the steady-state system availability for each availability model, and to compare both two models.

  • Minimization of Vote Operations for Soft Error Detection in DMR Design with Error Correction by Operation Re-Execution

    Kazuhito ITO  Yuto ISHIHARA  Shinichi NISHIZAWA  

     
    PAPER

      Vol:
    E101-A No:12
      Page(s):
    2271-2279

    As LSI chips integrate more transistors and the operating power supply voltage decreases, LSI chips are becoming more vulnerable to the soft error caused by neutrons induced from cosmic rays. The soft error is detected by comparing the duplicated operation results in double modular redundancy (DMR) and the error is corrected by re-executing necessary operations. In this paper, based on the error recovery scheme of re-executing necessary operations, the minimization of the vote operations for error checking with respect to given resource constraints is considered. An ILP model for the optimal solution to the problem is presented and a heuristic algorithm is proposed to minimize the vote operations.

  • A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing

    Hoang-Gia VU  Shinya TAKAMAEDA-YAMAZAKI  Takashi NAKADA  Yasuhiko NAKASHIMA  

     
    PAPER-Device and Architecture

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    288-302

    Modern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide “fine-grained” management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).

  • Energy-Performance Modeling of Speculative Checkpointing for Exascale Systems

    Muhammad ALFIAN AMRIZAL  Atsuya UNO  Yukinori SATO  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER-High performance computing

      Pubricized:
    2017/07/14
      Vol:
    E100-D No:12
      Page(s):
    2749-2760

    Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on speculative checkpointing, a CPR mechanism that allows for temporal distribution of checkpointings to avoid I/O concentration. We propose execution time and energy models for speculative checkpointing, and investigate energy-performance characteristics when speculative checkpointing is adopted in exascale systems. Using these models, we study the benefit of speculative checkpointing over coordinated checkpointing under various realistic scenarios for exascale HPC systems. We show that, compared to coordinated checkpointing, speculative checkpointing can achieve up to a 11% energy reduction at the cost of a relatively-small increase in the execution time. In addition, a significant energy-performance trade-off is expected when the system scale exceeds 1.2 million nodes.

  • A Low-Latency DMR Architecture with Fast Checkpoint Recovery Scheme

    Go MATSUKAWA  Yohei NAKATA  Yasuo SUGURE  Shigeru OHO  Yuta KIMI  Masafumi SHIMOZAWA  Shuhei YOSHIDA  Hiroshi KAWAGUCHI  Masahiko YOSHIMOTO  

     
    PAPER

      Vol:
    E98-C No:4
      Page(s):
    333-339

    This paper presents a novel architecture for a fault-tolerant and dual modular redundancy (DMR) system using a checkpoint recovery approach. The architecture features exploitation of SRAM with simultaneous copy and instantaneous compare function. It can perform low-latency data copying between dual cores. Therefore, it can carry out fast backup and rollback. Furthermore, it can reduce the power consumption during data comparison process compared to the cyclic redundancy check (CRC). Evaluation results show that, compared with the conventional checkpoint/restart DMR, the proposed architecture reduces the cycle overhead by 97.8% and achieves a 3.28% low-latency execution cycle even if a one-time fault occurs when executing the task. The proposed architecture provides high reliability for systems with a real-time requirement.

  • A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems

    Yonghwan KIM  Tadashi ARARAGI  Junya NAKAMURA  Toshimitsu MASUZAWA  

     
    PAPER-Dependable Computing

      Vol:
    E97-D No:1
      Page(s):
    65-76

    Checkpoint-rollback recovery, which is a universal method for restoring distributed systems after faults, requires a sophisticated snapshot algorithm especially if the systems are large-scale, since repeatedly taking global snapshots of the whole system requires unacceptable communication cost. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a snapshot of a subsystem consisting only of the nodes that are communication-related to the initiator instead of a global snapshot of the whole system. In this paper, we modify the previous partial snapshot algorithm to create a new one that can take a partial snapshot more efficiently, especially when multiple nodes concurrently initiate the algorithm. Experiments show that the proposed algorithm greatly reduces the amount of communication needed for taking partial snapshots.

  • A Scalable Communication-Induced Checkpointing Algorithm for Distributed Systems

    Alberto CALIXTO SIMON  Saul E. POMARES HERNANDEZ  Jose Roberto PEREZ CRUZ  Pilar GOMEZ-GIL  Khalil DRIRA  

     
    PAPER-Fundamentals of Information Systems

      Vol:
    E96-D No:4
      Page(s):
    886-896

    Communication-induced checkpointing (CIC) has two main advantages: first, it allows processes in a distributed computation to take asynchronous checkpoints, and secondly, it avoids the domino effect. To achieve these, CIC algorithms piggyback information on the application messages and take forced local checkpoints when they recognize potentially dangerous patterns. The main disadvantages of CIC algorithms are the amount of overhead per message and the induced storage overhead. In this paper we present a communication-induced checkpointing algorithm called Scalable Fully-Informed (S-FI) that attacks the problem of message overhead. For this, our algorithm modifies the Fully-Informed algorithm by integrating it with the immediate dependency principle. The S-FI algorithm was simulated and the result shows that the algorithm is scalable since the message overhead presents an under-linear growth as the number of processes and/or the message density increase.

  • Understanding the Impact of BPRAM on Incremental Checkpoint

    Xu LI  Kai LU  Xiaoping WANG  Bin DAI  Xu ZHOU  

     
    PAPER-Dependable Computing

      Vol:
    E96-D No:3
      Page(s):
    663-672

    Existing large-scale systems suffer from various hardware/software failures, motivating the research of fault-tolerance techniques. Checkpoint-restart techniques are widely applied fault-tolerance approaches, especially in scientific computing systems. However, the overhead of checkpoint largely influences the overall system performance. Recently, the emerging byte-addressable, persistent memory technologies, such as phase change memory (PCM), make it possible to implement checkpointing in arbitrary data granularity. However, the impact of data granularity on the checkpointing cost has not been fully addressed. In this paper, we investigate how data granularity influences the performance of a checkpoint system. Further, we design and implement a high-performance checkpoint system named AG-ckpt. AG-ckpt is a hybrid-granularity incremental checkpointing scheme through: (1) low-cost modified-memory detection and (2) fine-grained memory duplication. Moreover, we also formulize the performance-granularity relationship of checkpointing systems through a mathematical model, and further obtain the optimum solutions. We conduct the experiments through several typical benchmarks to verify the performance gain of our design. Compared to conventional incremental checkpoint, our results show that AG-ckpt can reduce checkpoint data amount up to 50% and provide a speedup of 1.2x-1.3x on checkpoint efficiency.

  • Checkpoint Time Arrangement Rotation in Hybrid State Saving with a Limited Number of Periodical Checkpoints

    Ryo SUZUKI  Mamoru OHARA  Masayuki ARAI  Satoshi FUKUMOTO  Kazuhiko IWASAKI  

     
    LETTER-Dependable Computing

      Vol:
    E96-D No:1
      Page(s):
    141-145

    This paper discusses hybrid state saving for applications in which processes should create checkpoints at constant intervals and can hold a finite number of checkpoints. We propose a reclamation technique for checkpoint space, that provides effective checkpoint time arrangements for a rollback distance distribution. Numerical examples show that when we cannot use the optimal checkpoint interval due to the system requirements, the proposed technique can achieve lower expected overhead compared to the conventional technique without considering the form of the rollback distance distribution.

  • WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs

    Xinhai XU  Xuejun YANG  Yufei LIN  

     
    PAPER-Computer System

      Vol:
    E95-D No:3
      Page(s):
    786-796

    As supercomputers increase in size, the mean time between failures (MTBF) of a system becomes shorter, and the reliability problem of supercomputers becomes more and more serious. MPI is currently the de facto standard used to build high-performance applications, and researches on the fault tolerance methods of MPI are always hot topics. However, due to the characteristics of MPI programs, most current checkpointing methods for MPI programs need to modify the MPI library (even operating system), or implement a complicated protocol by logging lots of messages. In this paper, we carry forward the idea of Application-Level Checkpointing (ALC). Based on the general fact that programmers are familiar with the communication characteristics of applications, we have developed BC-ALC, a new portable blocking coordinated ALC for MPI programs. BC-ALC neither modifies the MPI library (even operating system) nor logs any message. It implements coordination only by the Barrier operations instead of any complicated protocol. Furthermore, in order to reduce the cost of fault-tolerance, we reduce the synchronization range of the barrier, and design WBC-ALC, a weak blocking coordinated ALC utilizing group synchronization instead of global synchronization based on the communication relationship between processes. We also propose a fault-tolerance framework developed on top of WBC-ALC and discuss an implementation of it. Experimental results on NPB3.3-MPI benchmarks validate BC-ALC and WBC-ALC, and show that compared with BC-ALC, the average coordination time and the average backup time of a single checkpoint in WBC-ALC are reduced by 44.5% and 5.7% respectively.

  • 7T SRAM Enabling Low-Energy Instantaneous Block Copy and Its Application to Transactional Memory

    Shunsuke OKUMURA  Yuki KAGIYAMA  Yohei NAKATA  Shusuke YOSHIMOTO  Hiroshi KAWAGUCHI  Masahiko YOSHIMOTO  

     
    PAPER-Circuit Design

      Vol:
    E94-A No:12
      Page(s):
    2693-2700

    This paper proposes 7T SRAM which realizes block-level simultaneous copying feature. The proposed SRAM can be used for data transfer between local memories such as checkpoint data storage and transactional memory. The 1-Mb SRAM is comprised of 32-kb blocks, in which 16-kb data can be copied in 33.3 ns at 1.2 V. The proposed scheme reduces energy consumption in copying by 92.7% compared to the conventional read-modify-write manner. By applying the proposed scheme to transactional memory, the number of write back cycles is possibly reduced by 98.7% compared with the conventional memory system.

  • Lightweight Consistent Recovery Algorithm for Sender-Based Message Logging in Distributed Systems

    Jinho AHN  

     
    LETTER-Dependable Computing

      Vol:
    E94-D No:8
      Page(s):
    1712-1715

    Sender-based message logging (SBML) with checkpointing has its well-known beneficial feature, lowering highly failure-free overhead of synchronous logging with volatile logging at sender's memory. This feature encourages it to be applied into many distributed systems as a low-cost transparent rollback recovery technique. However, the original SBML recovery algorithm may no longer be progressing in some transient communication error cases. This paper proposes a consistent recovery algorithm to solve this problem by piggybacking small log information for unstable messages received on each acknowledgement message for returning the receive sequence number assigned to a message by its receiver. Our algorithm also enables all messages scheduled to be sent, but delayed because of some preceding unstable messages to be actually transmitted out much earlier than the existing ones.

  • Random Checkpoint Models with N Tandem Tasks

    Toshio NAKAGAWA  Kenichiro NARUSE  Sayori MAEJI  

     
    PAPER

      Vol:
    E92-A No:7
      Page(s):
    1572-1577

    We have a job with N tandem tasks each of which is executed successively until it is completed. A double modular system of error detection for the processing of each task is adopted. Either type of checkpoints such as compare-checkpoint or compare-and-store-checkpoint can be placed at the end of tasks. Three schemes for the above process of a job are considered and the mean execution time of each scheme is obtained. Three schemes are compared and the best scheme is determined numerically. As an example, a job with 4 tasks is given and 6 types of schemes are compared numerically. Finally, we consider a majority decision system as an error masking system and compute the mean execution time for three schemes.

  • A Checkpointing Method with Small Checkpoint Latency

    Masato KITAKAMI  Bochuan CAI  Hideo ITO  

     
    LETTER-Dependable Computing

      Vol:
    E91-D No:3
      Page(s):
    857-861

    The cost of checkpointing consists of checkpoint overhead and checkpoint latency. The former is the time to stop the process for checkpointing. The latter is the time to complete the checkpointing including background checkpointing which stores memory pages. The large checkpoint latency increases the possibility that the error occurs in background checkpointing, which leads to long rollback distance. The method for small checkpoint latency has not been proposed yet. This paper proposes a checkpointing method which achieves small checkpoint latency. The proposed method divides a checkpoint interval into several subcheckpoint intervals. By using the history of memory page modification in subcheckpoint intervals, the proposed method saves some pages which are not expected to be modified in the rest of checkpoint interval in advance. Computer simulation says that the proposed method can reduce the checkpoint latency by 25% comparing to the existing methods.

  • Fault-Tolerance for the Mobile Ad-Hoc Environment

    Taesoon PARK  Kwangho KIM  

     
    LETTER-Reliability, Maintainability and Safety Analysis

      Vol:
    E91-A No:1
      Page(s):
    413-416

    Fault-tolerance is an important design issue in building a reliable mobile computing system. This paper considers checkpointing recovery services for a mobile computing system based on the ad-hoc network environment. Since potential problems of this new environment are insufficient power and limited storage capacity, the proposed scheme tries to reduce disk access frequency for saving recovery information, and also the amount of information saved for recovery. A brief simulation study has been performed and the results show that the proposed scheme takes advantage of the existing checkpointing recovery schemes.

  • A Low-Cost Recovery Mechanism for Processors with Large Instruction Windows

    In Pyo HONG  Byung In MOON  Yong Surk LEE  

     
    LETTER-VLSI Systems

      Vol:
    E89-D No:6
      Page(s):
    1967-1970

    The latest processors employ a large instruction window and longer pipelines to achieve higher performance. Although current branch predictors show high accuracy, the misprediction penalty is getting larger in proportion to the number of pipeline stages and pipeline width. This negative effect also happens in case of exceptions or interrupts. Therefore, it is important to recover processor state quickly and restart processing immediately. In this letter, we propose a low-cost recovery mechanism for processors with large instruction windows.

  • Controller/Precompiler for Portable Checkpointing

    Gabriel RODRIGUEZ  María J. MARTIN  Patricia GONZALEZ  Juan TOURIÑO  

     
    PAPER-Parallel/Distributed Programming Models, Paradigms and Tools

      Vol:
    E89-D No:2
      Page(s):
    408-417

    This paper presents CPPC (Controller/Precompiler for Portable Checkpointing), a checkpointing tool designed for heterogeneous clusters and Grid infrastructures through the use of portable protocols, portable checkpoint files and portable code. It works at variable level being user-directed, thus generating small checkpoint files. It allows parallel processes to checkpoint independently, without runtime coordination or message-logging. Consistency is achieved at restart time by negotiating the restart point. A directive-based checkpointing precompiler has also been implemented to ease up user's effort. CPPC was designed to work with parallel MPI programs, though it can be used with sequential ones, and easily extended to parallel programs written using different message-passing libraries, due to its highly modular design. Experimental results are shown using CPPC with different test applications.

  • MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes

    Namyoon WOO  Hyungsoo JUNG  Heon Young YEOM  Taesoon PARK  Hyungwoo PARK  

     
    PAPER-Distributed, Grid and P2P Computing

      Vol:
    E87-D No:7
      Page(s):
    1820-1828

    Fault-tolerance is an essential feature of the distributed systems where the possibility of a failure increases with the growth of the system. In spite of extensive researches over two decades, fault-tolerance systems have not succeeded in practical use. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system for grid-enabled MPICH. Our objectives are to fill the gap between the theory and the practice of fault-tolerance systems, and to provide a checkpointing-recovery system for grids. To build a fault-tolerant MPICH version, we have designed task migration, dynamic process management, and atomic message transfer. MPICH-GF requires no modification of application source codes, and it affects the MPICH communication characteristics as less as possible. The features of MPICH-GF are that it supports the direct message transfer mode and that all of the implementation has been done at the lower layer, that is, the abstract device level. We have evaluated MPICH-GF using NPB applications on Globus middleware.

  • An Efficient Centralized Algorithm Ensuring Consistent Recovery in Causal Message Logging with Independent Checkpointing

    JinHo AHN  SungGi MIN  

     
    LETTER-Dependable Computing

      Vol:
    E87-D No:4
      Page(s):
    1039-1043

    Because it has desirable features such as no cascading rollback, fast output commit and asynchronous logging, causal message logging needs a consistent recovery algorithm to tolerate concurrent failures. For this purpose, Elnozahy proposed a centralized recovery algorithm to have two practical benefits, i.e. reducing the number of stable storage accesses and imposing no restriction on the execution of live processes during recovery. However, the algorithm with independent checkpointing may force the system to be in an inconsistent state when processes fail concurrently. In this paper, we identify these inconsistent cases and then present a recovery algorithm to have the two benefits and ensure the system consistency when integrated with any kind of checkpointing protocol. Also, our algorithm requires no additional message compared with Elnozahy's algorithm.

1-20hit(36hit)