The search functionality is under construction.

IEICE TRANSACTIONS on Information

Efficient Recovery from Communication Errors in Distributed Shared Memory Systems

Jenn-Wei LIN, Sy-Yen KUO

  • Full Text Views

    0

  • Cite this

Summary :

This paper investigates the problem of communication errors in distributed shared memory (DSM) systems. Communication errors can introduce two critical problems: damage and loss. The damage problem makes the transmitted data destroyed and then produces incorrect computational results. The loss problem causes the transmitted data lost during transmission and then not received. However, the loss problem can be easily resolved using acknowledgement. Therefore, we focus on how to efficiently handle the damage problem. In DSM systems, the size of data transferred between nodes is larger than the size actually shared between nodes. That is, when a processing node receives data, not all the data items in this received data will be used. Based on this property, we present a new technique to resolve the data damage problem in DSM systems. This technique allows a processing node to continue computation without being blocked to wait for the correct data when it receives damaged data. Therefore, the latency for handling the data damage can be hidden. However, there is an optimistic assumption made in the proposed technique. If this optimistic assumption is not valid, the latency will not be hidden. To show the advantage and the overhead of the proposed technique, we perform extensive trace-driven simulations. The simulation results show that at least 62% of the latency for handling data damage can be hidden.

Publication
IEICE TRANSACTIONS on Information Vol.E81-D No.11 pp.1213-1223
Publication Date
1998/11/25
Publicized
Online ISSN
DOI
Type of Manuscript
Category
Fault Tolerant Computing

Authors

Keyword