Efficient Recovery from Communication Errors in Distributed Shared Memory Systems

Jenn-Wei LIN; Sy-Yen KUO

IEICE TRANSACTIONS on Information

Efficient Recovery from Communication Errors in Distributed Shared Memory Systems

Jenn-Wei LIN, Sy-Yen KUO

Full Text Views

0

Cite this

Summary :

This paper investigates the problem of communication errors in distributed shared memory (DSM) systems. Communication errors can introduce two critical problems: damage and loss. The damage problem makes the transmitted data destroyed and then produces incorrect computational results. The loss problem causes the transmitted data lost during transmission and then not received. However, the loss problem can be easily resolved using acknowledgement. Therefore, we focus on how to efficiently handle the damage problem. In DSM systems, the size of data transferred between nodes is larger than the size actually shared between nodes. That is, when a processing node receives data, not all the data items in this received data will be used. Based on this property, we present a new technique to resolve the data damage problem in DSM systems. This technique allows a processing node to continue computation without being blocked to wait for the correct data when it receives damaged data. Therefore, the latency for handling the data damage can be hidden. However, there is an optimistic assumption made in the proposed technique. If this optimistic assumption is not valid, the latency will not be hidden. To show the advantage and the overhead of the proposed technique, we perform extensive trace-driven simulations. The simulation results show that at least 62% of the latency for handling data damage can be hidden.

Publication: IEICE TRANSACTIONS on Information Vol.E81-D No.11 pp.1213-1223

Publication Date: 1998/11/25

Publicized

Online ISSN

DOI

Type of Manuscript

Category: Fault Tolerant Computing

Cite this

Copy

Jenn-Wei LIN, Sy-Yen KUO, "Efficient Recovery from Communication Errors in Distributed Shared Memory Systems" in IEICE TRANSACTIONS on Information, vol. E81-D, no. 11, pp. 1213-1223, November 1998, doi: .
Abstract: This paper investigates the problem of communication errors in distributed shared memory (DSM) systems. Communication errors can introduce two critical problems: damage and loss. The damage problem makes the transmitted data destroyed and then produces incorrect computational results. The loss problem causes the transmitted data lost during transmission and then not received. However, the loss problem can be easily resolved using acknowledgement. Therefore, we focus on how to efficiently handle the damage problem. In DSM systems, the size of data transferred between nodes is larger than the size actually shared between nodes. That is, when a processing node receives data, not all the data items in this received data will be used. Based on this property, we present a new technique to resolve the data damage problem in DSM systems. This technique allows a processing node to continue computation without being blocked to wait for the correct data when it receives damaged data. Therefore, the latency for handling the data damage can be hidden. However, there is an optimistic assumption made in the proposed technique. If this optimistic assumption is not valid, the latency will not be hidden. To show the advantage and the overhead of the proposed technique, we perform extensive trace-driven simulations. The simulation results show that at least 62% of the latency for handling data damage can be hidden.
URL: https://global.ieice.org/en_transactions/information/10.1587/e81-d_11_1213/_p

Copy

@ARTICLE{e81-d_11_1213,
author={Jenn-Wei LIN, Sy-Yen KUO, },
journal={IEICE TRANSACTIONS on Information},
title={Efficient Recovery from Communication Errors in Distributed Shared Memory Systems},
year={1998},
volume={E81-D},
number={11},
pages={1213-1223},
abstract={This paper investigates the problem of communication errors in distributed shared memory (DSM) systems. Communication errors can introduce two critical problems: damage and loss. The damage problem makes the transmitted data destroyed and then produces incorrect computational results. The loss problem causes the transmitted data lost during transmission and then not received. However, the loss problem can be easily resolved using acknowledgement. Therefore, we focus on how to efficiently handle the damage problem. In DSM systems, the size of data transferred between nodes is larger than the size actually shared between nodes. That is, when a processing node receives data, not all the data items in this received data will be used. Based on this property, we present a new technique to resolve the data damage problem in DSM systems. This technique allows a processing node to continue computation without being blocked to wait for the correct data when it receives damaged data. Therefore, the latency for handling the data damage can be hidden. However, there is an optimistic assumption made in the proposed technique. If this optimistic assumption is not valid, the latency will not be hidden. To show the advantage and the overhead of the proposed technique, we perform extensive trace-driven simulations. The simulation results show that at least 62% of the latency for handling data damage can be hidden.},
keywords={},
doi={},
ISSN={},
month={November},}

Copy

TY - JOUR
TI - Efficient Recovery from Communication Errors in Distributed Shared Memory Systems
T2 - IEICE TRANSACTIONS on Information
SP - 1213
EP - 1223
AU - Jenn-Wei LIN
AU - Sy-Yen KUO
PY - 1998
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E81-D
IS - 11
JA - IEICE TRANSACTIONS on Information
Y1 - November 1998
AB - This paper investigates the problem of communication errors in distributed shared memory (DSM) systems. Communication errors can introduce two critical problems: damage and loss. The damage problem makes the transmitted data destroyed and then produces incorrect computational results. The loss problem causes the transmitted data lost during transmission and then not received. However, the loss problem can be easily resolved using acknowledgement. Therefore, we focus on how to efficiently handle the damage problem. In DSM systems, the size of data transferred between nodes is larger than the size actually shared between nodes. That is, when a processing node receives data, not all the data items in this received data will be used. Based on this property, we present a new technique to resolve the data damage problem in DSM systems. This technique allows a processing node to continue computation without being blocked to wait for the correct data when it receives damaged data. Therefore, the latency for handling the data damage can be hidden. However, there is an optimistic assumption made in the proposed technique. If this optimistic assumption is not valid, the latency will not be hidden. To show the advantage and the overhead of the proposed technique, we perform extensive trace-driven simulations. The simulation results show that at least 62% of the latency for handling data damage can be hidden.
ER -

IEICE TRANSACTIONS on Information

Efficient Recovery from Communication Errors in Distributed Shared Memory Systems

Summary :

Authors

Keyword

Latest Issue

Contents

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

IEICE TRANSACTIONS on Information

Efficient Recovery from Communication Errors in Distributed Shared Memory Systems

Summary :

Authors

Keyword

Latest Issue

Contents

Copyrights notice of machine-translated contents

Cite this

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles