Cluster systems are getting widely used because of good performance / cost ratio. However, their reliability has not been well discussed in practical environment so far. As the number of commodity components in a cluster system gets increased, it is indispensable to support reliability by system software. SCore cluster system software is a parallel programming environment for High Performance Computing (HPC). SCore provides checkpointing and rollback-recovery mechanism for high availability. In this paper, we analyze and evaluate the checkpointing and rollback-recovery mechanisms of SCore quantitively. The experimental results reveal that the required time for checkpointing scales very well in respect to the number of computing nodes. However, the required time is quite long due to the low effective network bandwidth. Based on the results, we modify SCore and successfully make checkpointing and recovery 1.8
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Masaaki KONDO, Takuro HAYASHIDA, Masashi IMAI, Hiroshi NAKAMURA, Takashi NANYA, Atsushi HORI, "Evaluation of Checkpointing Mechanism on SCore Cluster System" in IEICE TRANSACTIONS on Information,
vol. E86-D, no. 12, pp. 2553-2562, December 2003, doi: .
Abstract: Cluster systems are getting widely used because of good performance / cost ratio. However, their reliability has not been well discussed in practical environment so far. As the number of commodity components in a cluster system gets increased, it is indispensable to support reliability by system software. SCore cluster system software is a parallel programming environment for High Performance Computing (HPC). SCore provides checkpointing and rollback-recovery mechanism for high availability. In this paper, we analyze and evaluate the checkpointing and rollback-recovery mechanisms of SCore quantitively. The experimental results reveal that the required time for checkpointing scales very well in respect to the number of computing nodes. However, the required time is quite long due to the low effective network bandwidth. Based on the results, we modify SCore and successfully make checkpointing and recovery 1.8
URL: https://global.ieice.org/en_transactions/information/10.1587/e86-d_12_2553/_p
Copy
@ARTICLE{e86-d_12_2553,
author={Masaaki KONDO, Takuro HAYASHIDA, Masashi IMAI, Hiroshi NAKAMURA, Takashi NANYA, Atsushi HORI, },
journal={IEICE TRANSACTIONS on Information},
title={Evaluation of Checkpointing Mechanism on SCore Cluster System},
year={2003},
volume={E86-D},
number={12},
pages={2553-2562},
abstract={Cluster systems are getting widely used because of good performance / cost ratio. However, their reliability has not been well discussed in practical environment so far. As the number of commodity components in a cluster system gets increased, it is indispensable to support reliability by system software. SCore cluster system software is a parallel programming environment for High Performance Computing (HPC). SCore provides checkpointing and rollback-recovery mechanism for high availability. In this paper, we analyze and evaluate the checkpointing and rollback-recovery mechanisms of SCore quantitively. The experimental results reveal that the required time for checkpointing scales very well in respect to the number of computing nodes. However, the required time is quite long due to the low effective network bandwidth. Based on the results, we modify SCore and successfully make checkpointing and recovery 1.8
keywords={},
doi={},
ISSN={},
month={December},}
Copy
TY - JOUR
TI - Evaluation of Checkpointing Mechanism on SCore Cluster System
T2 - IEICE TRANSACTIONS on Information
SP - 2553
EP - 2562
AU - Masaaki KONDO
AU - Takuro HAYASHIDA
AU - Masashi IMAI
AU - Hiroshi NAKAMURA
AU - Takashi NANYA
AU - Atsushi HORI
PY - 2003
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E86-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2003
AB - Cluster systems are getting widely used because of good performance / cost ratio. However, their reliability has not been well discussed in practical environment so far. As the number of commodity components in a cluster system gets increased, it is indispensable to support reliability by system software. SCore cluster system software is a parallel programming environment for High Performance Computing (HPC). SCore provides checkpointing and rollback-recovery mechanism for high availability. In this paper, we analyze and evaluate the checkpointing and rollback-recovery mechanisms of SCore quantitively. The experimental results reveal that the required time for checkpointing scales very well in respect to the number of computing nodes. However, the required time is quite long due to the low effective network bandwidth. Based on the results, we modify SCore and successfully make checkpointing and recovery 1.8
ER -