The search functionality is under construction.

Keyword Search Result

[Keyword] failure recovery(13hit)

1-13hit
  • System Design for Traveling Maintenance in Wide-Area Telecommunication Networks

    Kouji HIRATA  Hiroshi YAMAMOTO  Shohei KAMAMURA  Toshiyuki OKA  Yoshihiko UEMATSU  Hideki MAEDA  Miki YAMAMOTO  

     
    PAPER

      Pubricized:
    2019/10/25
      Vol:
    E103-B No:4
      Page(s):
    363-374

    This paper proposes a traveling maintenance method based on the resource pool concept, as a new network maintenance model. For failure recovery, the proposed method utilizes permissible time that is ensured by shared resource pools. In the proposed method, even if a failure occurs in a communication facility, maintenance staff wait for occurrence of successive failures in other communication facilities during the permissible time instead of immediately tackling the failure. Then, the maintenance staff successively visit the communication facilities that have faulty devices and collectively repair them. Therefore, the proposed method can reduce the amount of time that the maintenance staff take for fault recovery. Furthermore, this paper provides a system design that optimizes the proposed traveling maintenance according to system requirements determined by the design philosophy of telecommunication networks. Through simulation experiments, we show the effectiveness of the proposed method.

  • RLE-MRC: Robustness and Low-Energy Based Multiple Routing Configurations for Fast Failure Recovery

    Takayuki HATANAKA  Takuji TACHIBANA  

     
    PAPER-Network

      Pubricized:
    2019/04/12
      Vol:
    E102-B No:10
      Page(s):
    2045-2053

    Energy consumption is one of the important issues in communication networks, and it is expected that network devices such as network interface cards will be turned off to decrease the energy consumption. Moreover, fast failure recovery is an important issue in large-scale communication networks to minimize the impact of failure on data transmission. In order to realize both low energy consumption and fast failure recovery, a method called LE-MRC (Low-Energy based Multiple Routing Configurations) has been proposed. However, LE-MRC can degrade network robustness because some links ports are turned off for reducing the energy consumption. Nevertheless, network robustness is also important for maintaining the performance of data transmission and the network functionality. In this paper, for realizing both low energy consumption and fast failure recovery while maintaining network robustness, we propose Robustness and Low-Energy based Multiple Routing Configurations (RLE-MRC). In RLE-MRC, some links are categorized into unnecessary links, and those links are turned off to lower the energy consumption. In particular, the number of excluded links is determined based on the network robustness. As a result, the energy consumption can be reduced so as not to degrade the network robustness significantly. Simulations are conducted on some network topologies to evaluate the performance of RLE-MRC. We also use ns-3 to evaluate how the performance of data transmission and network robustness are changed by using RLE-MRC. Numerical examples show that the low energy consumption and the fast failure recovery can be achieved while maintaining network robustness by using RLE-MRC.

  • Single Failure Recovery Method for Erasure Coded Storage System with Heterogeneous Devices Open Access

    Yingxun FU  Junyi GUO  Li MA  Jianyong DUAN  

     
    LETTER-Data Engineering, Web Information Systems

      Pubricized:
    2019/06/14
      Vol:
    E102-D No:9
      Page(s):
    1865-1869

    As the demand of data reliability becomes more and more larger, most of today's storage systems adopt erasure codes to assure the data could be reconstructed when suffering from physical device failures. In order to fast recover the lost data from a single failure, recovery optimization methods have attracted a lot of attention in recent years. However, most of the existing optimization methods focus on homogeneous devices, ignoring the fact that the storage devices are usually heterogeneous. In this paper, we propose a new recovery optimization method named HSR (Heterogeneous Storage Recovery) method, which uses both loads and speed rate among physical devices as the optimization target, in order to further improve the recovery performance for heterogeneous devices. The experiment results show that, compared to existing popular recovery optimization methods, HSR method gains much higher recovery speed over heterogeneous storage devices.

  • Designing Distributed SDN C-Plane Considering Large-Scale Disruption and Restoration Open Access

    Takahiro HIRAYAMA  Masahiro JIBIKI  Hiroaki HARAI  

     
    PAPER

      Pubricized:
    2018/09/20
      Vol:
    E102-B No:3
      Page(s):
    452-463

    Software-defined networking (SDN) technology enables us to flexibly configure switches in a network. Previously, distributed SDN control methods have been discussed to improve their scalability and robustness. Distributed placement of controllers and backing up each other enhance robustness. However, these techniques do not include an emergency measure against large-scale failures such as network separation induced by disasters. In this study, we first propose a network partitioning method to create a robust control plane (C-Plane) against large-scale failures. In our approach, networks are partitioned into multiple sub-networks based on robust topology coefficient (RTC). RTC denotes the probability that nodes in a sub-network isolate from controllers when a large-scale failure occurs. By placing a local controller onto each sub-network, 6%-10% of larger controller-switch connections will be retained after failure as compared to other approaches. Furthermore, we discuss reactive emergency reconstruction of a distributed SDN C-plane. Each node detects a disconnection to its controller. Then, C-plane will be reconstructed by isolated switches and managed by the other substitute controller. Meanwhile, our approach reconstructs C-plane when network connectivity recovers. The main and substitute controllers detect network restoration and merge their C-planes without conflict. Simulation results reveal that our proposed method recovers C-plane logical connectivity with a probability of approximately 90% when failure occurs in 100 node networks. Furthermore, we demonstrate that the convergence time of our reconstruction mechanism is proportional to the network size.

  • Strip-Switched Deployment Method to Optimize Single Failure Recovery for Erasure Coded Storage Systems

    Yingxun FU  Shilin WEN  Li MA  Jianyong DUAN  

     
    LETTER-Computer System

      Pubricized:
    2018/07/25
      Vol:
    E101-D No:11
      Page(s):
    2818-2822

    With the rapid growth on data scale and complexity, single disk failure recovery becomes very important for erasure coded storage systems. In this paper, we propose a new strip-switched deployment method, which utilizes the feature that strips of each stripe of erasure codes could be switched, and uses simulated annealing algorithm to search for the proper strip-deployment on the stack level to balance the read accesses, in order to improve the recovery performance. The analysis and experiments results show that SSDM could effectively improve the single failure recovery performance.

  • Future Nationwide Optical Network Architecture for Higher Availability and Operability Using Transport SDN Technologies Open Access

    Yoshihiko UEMATSU  Shohei KAMAMURA  Hiroki DATE  Hiroshi YAMAMOTO  Aki FUKUDA  Rie HAYASHI  Katsutoshi KODA  

     
    POSITION PAPER-Transmission Systems and Transmission Equipment for Communications

      Pubricized:
    2017/08/08
      Vol:
    E101-B No:2
      Page(s):
    462-475

    An optical transport network is composed of optical transport systems deployed in thousands of office-buildings. As a common infrastructure to accommodate diversified communication services with drastic traffic growth, it is necessary not only to continuously convey the growing traffic but also to achieve high end-to-end communication quality and availability and provide flexible controllability in cooperation with service layer networks. To achieve high-speed and large-capacity transport systems cost-effectively, system configuration, applied devices, and the manufacturing process have recently begun to change, and the cause of failure or performance degradation has become more complex and diversified. The drastic traffic growth and pattern change of service networks increase the frequency and scale of transport-capacity increase and transport-network reconfiguration in cooperation with service networks. Therefore, drastic traffic growth affects both optical-transport-system configuration and its operational cycles. In this paper, we give an overview of the operational problems emerging in current nationwide optical transport networks, and based on trends analysis for system configuration and network-control schemes, we propose a vision of the future nationwide optical-transport-network architecture expressed using five target features.

  • ResilientFlow: Deployments of Distributed Control Channel Maintenance Modules to Recover SDN from Unexpected Failures

    Takuya OMIZO  Takuma WATANABE  Toyokazu AKIYAMA  Katsuyoshi IIDA  

     
    PAPER

      Vol:
    E99-B No:5
      Page(s):
    1041-1053

    Although SDN provides desirable characteristics such as the manageability, flexibility and extensibility of the networks, it has a considerable disadvantage in its reliability due to its centralized architecture. To protect SDN-enabled networks under large-scale, unexpected link failures, we propose ResilientFlow that deploys distributed modules called Control Channel Maintenance Module (CCMM) for every switch and controllers. The CCMMs makes switches able to maintain their own control channels, which are core and fundamental part of SDN. In this paper, we design, implement, and evaluate the ResilientFlow.

  • Failure Detection in P2P-Grid System

    Huan WANG  Hideroni NAKAZATO  

     
    PAPER-Grid System

      Pubricized:
    2015/09/15
      Vol:
    E98-D No:12
      Page(s):
    2123-2131

    Peer-to-peer (P2P)-Grid systems are being investigated as a platform for converging the Grid and P2P network in the construction of large-scale distributed applications. The highly dynamic nature of P2P-Grid systems greatly affects the execution of the distributed program. Uncertainty caused by arbitrary node failure and departure significantly affects the availability of computing resources and system performance. Checkpoint-and-restart is the most common scheme for fault tolerance because it periodically saves the execution progress onto stable storage. In this paper, we suggest a checkpoint-and-restart mechanism as a fault-tolerant method for applications on P2P-Grid systems. Failure detection mechanism is a necessary prerequisite to fault tolerance and fault recovery in general. Given the highly dynamic nature of nodes within P2P-Grid systems, any failure should be detected to ensure effective task execution. Therefore, failure detection mechanism as an integral part of P2P-Grid systems was studied. We discussed how the design of various failure detection algorithms affects their performance in average failure detection time of nodes. Numerical analysis results and implementation evaluation are also provided to show different average failure detection times in real systems for various failure detection algorithms. The comparison shows the shortest average failure detection time by 8.8s on basis of the WP failure detector. Our lowest mean time to recovery (MTTR) is also proven to have a distinct advantage with a time consumption reduction of about 5.5s over its counterparts.

  • Enhancing MPLS Protection Method with Adaptive Segment Repair

    Chin-Ling CHEN  

     
    PAPER-Network

      Vol:
    E92-B No:10
      Page(s):
    3126-3131

    We propose a novel adaptive segment repair mechanism to improve traditional MPLS (Multi-Protocol Label Switching) failure recovery. The proposed mechanism protects one or more contiguous high failure probability links by dynamic setup of segment protection. Simulations demonstrate that the proposed mechanism reduces failure recovery time while also increasing network resource utilization.

  • Improving Ethernet Reliability and Stability Using Global Open Ethernet Technology

    Masaki UMAYABASHI  Youichi HIDAKA  Nobuyuki ENOMOTO  Daisaku OGASAHARA  Kazuo TAKAGI  Atsushi IWATA  Akira ARUTAKI  

     
    PAPER

      Vol:
    E89-B No:3
      Page(s):
    675-682

    In this paper, authors present new schemes of our proposed Global Open Ethernet (GOE) technology from a viewpoint of improving reliability in metro-area Ethernet environment and show the numerical evidence on their performance results. Although several standardized or vendor proprietary technologies are proposed to improve Ethernet reliability, they still have reliability problems in terms of long failure recovery time (due to forwarding database (FDB) flush and recovery from a root bridge failure on spanning tree protocol), broadcast storm, and packet loss in network reconfiguration. To solve these problems, we introduce three schemes, a Per Destination - Multiple Rapid Spanning Tree Protocol (PD-MRSTP), a GOE Virtual Switch Redundancy Protocol (GVSRP), and an In-Service Reconfiguration (ISR) schemes. PD-MRSTP scheme reduces the failure recovery time by eliminating the need to flush the FDB and to recover from root bridge failures. GVSRP scheme ensures the reliability of connections between a GOE domain and a legacy Ethernet domain. Combined with PD-MRSTP, GVSRP prevents broadcast storm problems due to loops in the inter-domain area. ISR scheme enables in-service bridge replacement and upgrade without packet loss. Evaluating our prototype system, we obtained the following remarkable performance results. The GOE network using PD-MRSTP scheme delivered a fast failure recovery performance (4 ms) independent of the number of MAC address entries, whereas the legacy Ethernet network took 522 ms when a bridge had 6000 MAC address entries. Since we found that the failure recovery time increased in proportion to the number of MAC address entries, the one in large carrier network having one million of MAC address entries would take several tens of seconds. Thus using PD-MRSTP can reduce failure recovery time one ten-thousandth comparing with that of legacy Ethernet. In addition, evaluation of the ISR scheme demonstrated that a network can be upgraded with zero packet loss. Therefore, a GOE-based VPN is a promising alternative to other Ethernet VPNs for its reliability and stability.

  • Restoring Delivery Tree from Node Failures in Overlay Multicast

    Zongming FEI  Mengkun YANG  

     
    PAPER-Network

      Vol:
    E88-B No:5
      Page(s):
    2046-2053

    One of the important problems in overlay multicast is how to deal with node failures and ungraceful leavings. When a non-leaf end host fails or leaves the multicast session, all downstream nodes will be affected. In this paper, we adopt the proactive approach, which pre-calculates a candidate node (called parent-to-be) for each node to connect to in case its current parent dies. The goal is to recover the overlay multicast tree quickly so that the disruption of service to those affected nodes is minimized. We combine the local parent-to-be locating and global parent-to-be locating schemes together, in order to take advantage of less interference in the local scheme and the flexibility of the global scheme. The quality of the recovered tree is improved while the responsiveness of the proactive approach is maintained.

  • Determining Consistent Global Checkpoints of a Distributed Computation

    Dakshnamoorthy MANIVANNAN  

     
    PAPER-Computer Systems

      Vol:
    E87-D No:1
      Page(s):
    164-174

    Determining consistent global checkpoints of a distributed computation has applications in the areas such as rollback recovery, distributed debugging, output commit and others. Netzer and Xu introduced the notion of zigzag paths and presented necessary and sufficient conditions for a set of checkpoints to be part of a consistent global checkpoint. This result also reveals that determining the existence of zigzag paths between checkpoints is crucial for determining consistent global checkpoints. Recent research also reveals that determining zigzag paths on-line is not possible. In this paper, we present an off-line method for determining the existence of zigzag paths between checkpoints.

  • Efficient Techniques for Adaptive Independent Checkpointing in Distributed Systems

    Cheng-Min LIN  Chyi-Ren DOW  

     
    PAPER-Fault Tolerance

      Vol:
    E83-D No:8
      Page(s):
    1642-1653

    This work presents two novel algorithms to prevent rollback propagation for independent checkpointing: an efficient adaptive independent checkpointing algorithm and an optimized adaptive independent checkpointing algorithm. The last opportunity strategy that yields a better performance than the conservation strategy is also employed to prevent useless checkpoints for both causal rewinding paths and non-causal rewinding paths. The two methods proposed herein are domino effect-free and require only a limited amount of control information. They also take less unnecessary adaptive checkpoints than other algorithms. Furthermore, experimental results indicate that the checkpoint overhead of our techniques is lower than that of the coordinated checkpointing and domino effect-free algorithms for service-providing applications.