The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] checkpoint(36hit)

21-36hit(36hit)

  • Determining Consistent Global Checkpoints of a Distributed Computation

    Dakshnamoorthy MANIVANNAN  

     
    PAPER-Computer Systems

      Vol:
    E87-D No:1
      Page(s):
    164-174

    Determining consistent global checkpoints of a distributed computation has applications in the areas such as rollback recovery, distributed debugging, output commit and others. Netzer and Xu introduced the notion of zigzag paths and presented necessary and sufficient conditions for a set of checkpoints to be part of a consistent global checkpoint. This result also reveals that determining the existence of zigzag paths between checkpoints is crucial for determining consistent global checkpoints. Recent research also reveals that determining zigzag paths on-line is not possible. In this paper, we present an off-line method for determining the existence of zigzag paths between checkpoints.

  • Evaluation of Checkpointing Mechanism on SCore Cluster System

    Masaaki KONDO  Takuro HAYASHIDA  Masashi IMAI  Hiroshi NAKAMURA  Takashi NANYA  Atsushi HORI  

     
    PAPER-Dependable Software

      Vol:
    E86-D No:12
      Page(s):
    2553-2562

    Cluster systems are getting widely used because of good performance / cost ratio. However, their reliability has not been well discussed in practical environment so far. As the number of commodity components in a cluster system gets increased, it is indispensable to support reliability by system software. SCore cluster system software is a parallel programming environment for High Performance Computing (HPC). SCore provides checkpointing and rollback-recovery mechanism for high availability. In this paper, we analyze and evaluate the checkpointing and rollback-recovery mechanisms of SCore quantitively. The experimental results reveal that the required time for checkpointing scales very well in respect to the number of computing nodes. However, the required time is quite long due to the low effective network bandwidth. Based on the results, we modify SCore and successfully make checkpointing and recovery 1.8 2.8 times and 3.7 5.0 times faster respectively. This is very helpful for cluster systems to achieve high performance and high availability.

  • Fault-Tolerant Execution of Collaborating Mobile Agents

    Taesoon PARK  

     
    LETTER-Reliability, Maintainability and Safety Analysis

      Vol:
    E86-A No:11
      Page(s):
    2897-2900

    Fault-tolerant execution of a mobile agent is an important design issue to build a reliable mobile agent system. Several fault-tolerant schemes for a single agent system have been proposed, however, there has been little research result on the multi-agent system. For the cooperating mobile agents, fault-tolerant schemes should consider the inter-agent dependency as well as the mobility; and try to localize the effect of a failure. In this paper, we investigate properties of inter-agent dependency and agent mobility; and then characterize rollback propagation caused by the dependency and the mobility. We then suggest some schemes to localize rollback propagation.

  • Cost Analysis of Optimistic Recovery Model for Forked Checkpointing

    Jiman HONG  Sangsu KIM  Yookun CHO  

     
    PAPER-Networking and Architectures

      Vol:
    E86-D No:9
      Page(s):
    1534-1541

    Forked checkpointing scheme is proposed to achieve low checkpoint overhead. When a process wants to take a checkpoint in the forked checkpointing scheme, it creates a child process and continues its normal computation. Two recovery models can be used for forked checkpointing when the parent process fails before the child process establishes the checkpoint. One is the pessimistic recovery model where the recovery process rolls back to the previous checkpoint state. The other is the optimistic recovery model where a recovery process waits for the checkpoint to be established by the child process. In this paper, we present the recovery models for forked checkpointing by deriving the expected execution time of a process with and without checkpointing and also show that the expected recovery time of the optimistic recovery model is smaller than that of the pessimistic recovery model.

  • Two-Tier Checkpointing Algorithm Using MSS in Wireless Networks

    Kyue-Sup BYUN  Sung-Hwa LIM  Jai-Hoon KIM  

     
    PAPER-Network Management/Operation

      Vol:
    E86-B No:7
      Page(s):
    2136-2142

    This paper presents a two-tier coordinated checkpointing algorithm which can reduce the number of messages by being composed of two levels in mobile computing. Thus mobile devices have a high mobility and are lack of resources (e.g., storage, bandwidth, and battery power), traditional distributed algorithms like coordinated checkpointing algorithms could not be applied properly in mobile environment. In our proposed two-tier coordinated checkpointing algorithm, the messages to be transferred are requested by the mobile hosts and are handled by the appropriate MSS's (Mobile Support Stations). And the broadcast messages are handled by MSS instead of relaying the messages to all the mobile hosts directly as with the previous algorithms. This can reduce the communication cost and maintain the overall system consistency. In wireless cellular network, mobile computing based on a two-tier coordinated checkpointing algorithm reduces the number of synchronization messages. We perform performance comparisons by parametric analysis to show that a two-tier coordinated checkpointing algorithm can reduce communication cost compared to the previous algorithms in which the messages are directly sent to the mobile hosts.

  • Probabilistic Checkpointing

    Hyochang NAM  Jong KIM  Sung Je HONG  Sunggu LEE  

     
    PAPER-Fault Tolerance

      Vol:
    E85-D No:7
      Page(s):
    1093-1104

    For checkpointing to be practical, it has to introduce low overhead for the targeted application. As a means of reducing the overhead of checkpointing, this paper proposes a probabilistic checkpointing method, which uses block encoding to detect the modified memory area between two consecutive checkpoints. Since the proposed technique uses block encoding to detect the modified area, the possibility of aliasing exists in encoded words. However, this paper shows that the aliasing probability is near zero when an 8-byte encoded word is used. The performance of the proposed technique is analyzed and measured by using experiments. An analytic model which predicts the checkpointing overhead is first constructed. By using this model, the block size that produces the best performance for a given target program is estimated. In most cases, medium block sizes, i.e., 128 or 256 bytes, show the best performance. The proposed technique has also been implemented on Unix based systems, and its performance has been measured in real environments. According to the experimental results, the proposed technique reduces the overhead by 11.7% in the best case and increases the overhead by 0.5% in the worst case in comparison with page-based incremental checkpointing.

  • PQPCkpt: An Efficient Three Level Synchronous Checkpointing Scheme in Mobile Computing Systems

    Cheng-Min LIN  Chyi-Ren DOW  

     
    PAPER-Fault Tolerance

      Vol:
    E84-D No:11
      Page(s):
    1556-1567

    Distributed domino effect-free checkpointing techniques can be divided into two categories: coordinated and communication-induced checkpointing. The former is inappropriate for mobile computing systems because it either forces every mobile host to take a new checkpoint or blocks the underlying computation during the checkpointing process. The latter makes every mobile host take the checkpoint independently. However, each mobile host may need to store multiple local checkpoints in stable storage. This investigation presents a novel three level synchronous checkpointing algorithm that combines the advantages of above two methods for mobile computing systems. The algorithm utilizes pre-synchronization, quasi-synchronization, and post-synchronization techniques and has the following merits: (1) Consistent global checkpoints can be ensured. (2) No mobile host is blocked during checkpointing. (3) Only twice the checkpoint size is required. (4) Power consumption is low. (5) The disconnection problem of mobile hosts can be resolved. (6) Very few mobile hosts in doze mode are disturbed. (7) It is simple and easy to implement. The proposed algorithm's numerical results are also provided in this work for comparison. The comparison reveals that our algorithm outperforms other algorithms in terms of checkpoint overhead, maintained checkpoints, power consumption, and disturbed mobile hosts.

  • Crash Recovery for Distributed Mobile Computing Systems

    Tong-Ying Tony JUANG  

     
    PAPER-Mobile Information Network and Personal Communications

      Vol:
    E84-A No:2
      Page(s):
    668-674

    One major breakthrough on the communication society recently is the extension of networking from wired to wireless networks. This has made possible creating a mobile distributed computing environment and has brought us several new challenges in distributed protocol design. Obviously, wireless networks do have some fundamental differences from wired networks that need to be paid special attention of, such as lower communication bandwidth compared to wired networks, limited electrical power due to battery capacity, and mobility of processes. These new issues make traditional recovery algorithm unsuitable. In this paper, we propose an efficient algorithm with O(nr) message complexity where O(nr) is the total number of mobile hosts (MHs) related to the failed MH. In addition, these MHs only need to rollback once and can immediately resume its operation without waiting for any coordination message from other MHs. During normal operation, the application message needs O(1) additional information when it transmitted between MHs and mobile support stations (MSSs). Each MSS must keep an ntotal_h*n cell_h dependency matrix, where O(ntotal_h) is the total number of MHs in the system and ncell_h is the total number of MHs in its cell. Finally, one related issue of resending lost messages is also considered.

  • Efficient Techniques for Adaptive Independent Checkpointing in Distributed Systems

    Cheng-Min LIN  Chyi-Ren DOW  

     
    PAPER-Fault Tolerance

      Vol:
    E83-D No:8
      Page(s):
    1642-1653

    This work presents two novel algorithms to prevent rollback propagation for independent checkpointing: an efficient adaptive independent checkpointing algorithm and an optimized adaptive independent checkpointing algorithm. The last opportunity strategy that yields a better performance than the conservation strategy is also employed to prevent useless checkpoints for both causal rewinding paths and non-causal rewinding paths. The two methods proposed herein are domino effect-free and require only a limited amount of control information. They also take less unnecessary adaptive checkpoints than other algorithms. Furthermore, experimental results indicate that the checkpoint overhead of our techniques is lower than that of the coordinated checkpointing and domino effect-free algorithms for service-providing applications.

  • A Simulation Study to Analyze Unreliable File Systems with Checkpointing and Rollback Recovery

    Tadashi DOHI  Kouji NOMURA  Naoto KAIO  Shunji OSAKI  

     
    PAPER

      Vol:
    E83-A No:5
      Page(s):
    804-811

    This paper considers two simulation models for simple unreliable file systems with checkpointing and rollback recovery. In Model 1, the checkpoint is generated at a pre-specified time and the information on the main memory since the last checkpoint is back-uped in a secondary medium. On the other hand, in Model 2, the checkpointing is executed at the time when the number of transactions completed for processing is achieved at a pre-determined level. However, it is difficult to treat such models analytically without employing any approximation method, if queueing effects related with arrival and processing of transactions can not be ignored. We apply the generalized stochastic Petri net (GSPN) to represent the stochastic behaviour of systems under two checkpointing schemes. Throughout GSPN simulation, we evaluate quantitatively the maintainability of checkpoint models under consideration and examine the dependence of model parameters in the optimal checkpoint policies and their associated system availabilities.

  • Computational Aspects of Optimal Checkpoint Strategy in Fault-Tolerant Database Management

    Tadashi DOHI  Takashi AOKI  Naoto KAIO  Shunji OSAKI  

     
    PAPER-Systems and Control

      Vol:
    E80-A No:10
      Page(s):
    2006-2015

    This paper considers a probabilistic model for a database recovery action with checkpoint generations when system failures occur according to a renewal process whose renewal density depends on the cumulative operation period since the last checkpoint. Necessary and sufficient conditions on the existence of the optimal checkpoint interval which maximizes the ergodic availability are analytically derived, and solvable examples are given for the well-known failure time distributions. Further, several methods to be needed for numerical calculations are proposed when the information on system failures is not sufficient. We use four analytical/tractable approximation methods to calculate the optimal checkpoint schedule. Finally, it is shown through numerical comparisons that the gamma approximation method is the best to seek the approximate solution precisely.

  • A Novel Replication Technique for Detecting and Masking Failures for Parallel Software: Active Parallel Replication

    Adel CHERIF  Masato SUZUKI  Takuya KATAYAMA  

     
    PAPER-Fault Tolerance

      Vol:
    E80-D No:9
      Page(s):
    886-892

    We present a novel replication technique for parallel applications where instances of the replicated application are active on different group of processors called replicas. The replication technique is based on the FTAG (Fault Tolerant Attribute Grammar) computation model. FTAG is a functional and attribute based model. The developed replication technique implements "active parallel replication," that is, all replicas are active and compute concurrently a different piece of the application parallel code. In our model replicas cooperate not only to detect and mask failures but also to perform parallel computation. The replication mechanisms are supported by FTAG run time system and are fully application-transparent. Different novel mechanisms for checkpointing and recovery are developed. In our model during rollback recovery only that part of the computation that was detected faulty is discarded. The replication technique takes full advantage of parallel computing to reduce overall computation time.

  • Achieving Fault Tolerance in Pipelined Multiprocessor Systems

    Jeng-Ping LIN  Sy-Yen KUO  

     
    PAPER-Fault Tolerant Computing

      Vol:
    E80-D No:6
      Page(s):
    665-671

    This paper focuses on recovering from processor transient faults in pipelined multiprocessor systems. A pipelined machine may employ out of order execution and branch prediction techniques to increase performance, thus a precise computation state would not be available. We propose an efficient scheme to maintain the precise computation state in a pipelined machine. The goal of this paper is to implement checkpointing and rollback recovery utilizing the technique of precise interrupt in a pipelined system. Detailed analysis is included to demonstrate the effectiveness of this method.

  • Group Communications Algorithm for Dynamically Updating in Distributed Systems

    Hiroaki HIGAKI  

     
    PAPER-Computer Networks

      Vol:
    E78-D No:4
      Page(s):
    444-454

    This paper proposes a novel updating technique, dynamically updating, for achieving extension or modification of functions in a distributed system. Usual updating technique requires synchronous suspension for multiple processes for avoiding unspecified reception caused by the conflict of different versions of processes. Thus, this technique needs very high overhead and it must restrict the types of distributed systems, to which it can be applied, to RPC (remote procedure call) type or client-server type. Using the proposed dynamically updating technique, updating management can be invoked asynchronously by each process with assurance of correct execution of the system, i.e., the system can cope with the effect of unspecified reception caused by mixture of different version processes. Therefore, low overhead updating can be achieved in partner type distributed systems, that is more general type including communications systems or computer networks. Dynamically updating technique is implemented by using a novel distributed algorithm that consists of group communication, checkpoint setting, and rollback recovery. By using the algorithm proposed in this paper, rollback recovery can be achieved with the lowest overhead, i.e., a set of checkpoint determines the last global state for consistent rollback recovery and a set of processes that need to rollback simultaneously is the smallest one. This paper also proves the correctness of the proposed algorithm.

  • A Note on Optimal Checkpoint Sequence Taking Account of Preventive Maintenance

    Masanori ODAGIRI  Naoto KAIO  Shunji OSAKI  

     
    LETTER-Maintainability

      Vol:
    E77-A No:1
      Page(s):
    244-246

    Checkpointing is one of the most powerful tools to operate a computer system with high reliability. We should execute the optimal checkpointing in some sense. This note shows the optimal checkpoint sequence minimizing the expected loss, Numerical examples are shown for illustration.

  • Synthesis of Protocol Specifications for Design of Responsive Protocols

    Hirotaka IGARASHI  Yoshiaki KAKUDA  Tohru KIKUNO  

     
    PAPER

      Vol:
    E76-D No:11
      Page(s):
    1375-1385

    Responsive protocols are communication protocols which ensure timely and reliable recovery when error events occur. Protocol synthesis for design of responsive protocols is to derive a protocol specification based on a service specification. In the previous methods, if the service specification includes simultaneous transmission of primitives from a high layer to a low layer through different service access points, then the derived protocol specification includes protocol errors of unspecified reception caused by message collisions. Also, they only includes a recovery function such as retransmission of messages. This is not enough for recovery from abnormal states due to coordination loss. This paper extends a class of derived protocol specifications to include message collisions which usually occur in real communication protocols. Furthermore, this paper proposes a new method for synthesis of a responsive protocal specification derived from a service specification such that the derived protocol specification is free from protocol erros of unspecified receptions caused by message collisions and includes two recovery functions: message retransmission and checkpoint restart functions.

21-36hit(36hit)