IEICE global.ieice.org Site

Keyword Search Result

[Keyword] failure recovery(13hit)

1-13hit

System Design for Traveling Maintenance in Wide-Area Telecommunication Networks
Kouji HIRATA Hiroshi YAMAMOTO Shohei KAMAMURA Toshiyuki OKA Yoshihiko UEMATSU Hideki MAEDA Miki YAMAMOTO

PAPER

Pubricized:
2019/10/25
Vol:
E103-B No:4
Page(s):
363-374
This paper proposes a traveling maintenance method based on the resource pool concept, as a new network maintenance model. For failure recovery, the proposed method utilizes permissible time that is ensured by shared resource pools. In the proposed method, even if a failure occurs in a communication facility, maintenance staff wait for occurrence of successive failures in other communication facilities during the permissible time instead of immediately tackling the failure. Then, the maintenance staff successively visit the communication facilities that have faulty devices and collectively repair them. Therefore, the proposed method can reduce the amount of time that the maintenance staff take for fault recovery. Furthermore, this paper provides a system design that optimizes the proposed traveling maintenance according to system requirements determined by the design philosophy of telecommunication networks. Through simulation experiments, we show the effectiveness of the proposed method.
RLE-MRC: Robustness and Low-Energy Based Multiple Routing Configurations for Fast Failure Recovery
Takayuki HATANAKA Takuji TACHIBANA

PAPER-Network

Pubricized:
2019/04/12
Vol:
E102-B No:10
Page(s):
2045-2053
Energy consumption is one of the important issues in communication networks, and it is expected that network devices such as network interface cards will be turned off to decrease the energy consumption. Moreover, fast failure recovery is an important issue in large-scale communication networks to minimize the impact of failure on data transmission. In order to realize both low energy consumption and fast failure recovery, a method called LE-MRC (Low-Energy based Multiple Routing Configurations) has been proposed. However, LE-MRC can degrade network robustness because some links ports are turned off for reducing the energy consumption. Nevertheless, network robustness is also important for maintaining the performance of data transmission and the network functionality. In this paper, for realizing both low energy consumption and fast failure recovery while maintaining network robustness, we propose Robustness and Low-Energy based Multiple Routing Configurations (RLE-MRC). In RLE-MRC, some links are categorized into unnecessary links, and those links are turned off to lower the energy consumption. In particular, the number of excluded links is determined based on the network robustness. As a result, the energy consumption can be reduced so as not to degrade the network robustness significantly. Simulations are conducted on some network topologies to evaluate the performance of RLE-MRC. We also use ns-3 to evaluate how the performance of data transmission and network robustness are changed by using RLE-MRC. Numerical examples show that the low energy consumption and the fast failure recovery can be achieved while maintaining network robustness by using RLE-MRC.
Single Failure Recovery Method for Erasure Coded Storage System with Heterogeneous Devices Open Access
Yingxun FU Junyi GUO Li MA Jianyong DUAN

LETTER-Data Engineering, Web Information Systems

Pubricized:
2019/06/14
Vol:
E102-D No:9
Page(s):
1865-1869
As the demand of data reliability becomes more and more larger, most of today's storage systems adopt erasure codes to assure the data could be reconstructed when suffering from physical device failures. In order to fast recover the lost data from a single failure, recovery optimization methods have attracted a lot of attention in recent years. However, most of the existing optimization methods focus on homogeneous devices, ignoring the fact that the storage devices are usually heterogeneous. In this paper, we propose a new recovery optimization method named HSR (Heterogeneous Storage Recovery) method, which uses both loads and speed rate among physical devices as the optimization target, in order to further improve the recovery performance for heterogeneous devices. The experiment results show that, compared to existing popular recovery optimization methods, HSR method gains much higher recovery speed over heterogeneous storage devices.
Designing Distributed SDN C-Plane Considering Large-Scale Disruption and Restoration Open Access
Takahiro HIRAYAMA Masahiro JIBIKI Hiroaki HARAI

PAPER

Pubricized:
2018/09/20
Vol:
E102-B No:3
Page(s):
452-463
Software-defined networking (SDN) technology enables us to flexibly configure switches in a network. Previously, distributed SDN control methods have been discussed to improve their scalability and robustness. Distributed placement of controllers and backing up each other enhance robustness. However, these techniques do not include an emergency measure against large-scale failures such as network separation induced by disasters. In this study, we first propose a network partitioning method to create a robust control plane (C-Plane) against large-scale failures. In our approach, networks are partitioned into multiple sub-networks based on robust topology coefficient (RTC). RTC denotes the probability that nodes in a sub-network isolate from controllers when a large-scale failure occurs. By placing a local controller onto each sub-network, 6%-10% of larger controller-switch connections will be retained after failure as compared to other approaches. Furthermore, we discuss reactive emergency reconstruction of a distributed SDN C-plane. Each node detects a disconnection to its controller. Then, C-plane will be reconstructed by isolated switches and managed by the other substitute controller. Meanwhile, our approach reconstructs C-plane when network connectivity recovers. The main and substitute controllers detect network restoration and merge their C-planes without conflict. Simulation results reveal that our proposed method recovers C-plane logical connectivity with a probability of approximately 90% when failure occurs in 100 node networks. Furthermore, we demonstrate that the convergence time of our reconstruction mechanism is proportional to the network size.
Strip-Switched Deployment Method to Optimize Single Failure Recovery for Erasure Coded Storage Systems
Yingxun FU Shilin WEN Li MA Jianyong DUAN

LETTER-Computer System

Pubricized:
2018/07/25
Vol:
E101-D No:11
Page(s):
2818-2822
With the rapid growth on data scale and complexity, single disk failure recovery becomes very important for erasure coded storage systems. In this paper, we propose a new strip-switched deployment method, which utilizes the feature that strips of each stripe of erasure codes could be switched, and uses simulated annealing algorithm to search for the proper strip-deployment on the stack level to balance the read accesses, in order to improve the recovery performance. The analysis and experiments results show that SSDM could effectively improve the single failure recovery performance.
Future Nationwide Optical Network Architecture for Higher Availability and Operability Using Transport SDN Technologies Open Access
Yoshihiko UEMATSU Shohei KAMAMURA Hiroki DATE Hiroshi YAMAMOTO Aki FUKUDA Rie HAYASHI Katsutoshi KODA

POSITION PAPER-Transmission Systems and Transmission Equipment for Communications

Pubricized:
2017/08/08
Vol:
E101-B No:2
Page(s):
462-475
An optical transport network is composed of optical transport systems deployed in thousands of office-buildings. As a common infrastructure to accommodate diversified communication services with drastic traffic growth, it is necessary not only to continuously convey the growing traffic but also to achieve high end-to-end communication quality and availability and provide flexible controllability in cooperation with service layer networks. To achieve high-speed and large-capacity transport systems cost-effectively, system configuration, applied devices, and the manufacturing process have recently begun to change, and the cause of failure or performance degradation has become more complex and diversified. The drastic traffic growth and pattern change of service networks increase the frequency and scale of transport-capacity increase and transport-network reconfiguration in cooperation with service networks. Therefore, drastic traffic growth affects both optical-transport-system configuration and its operational cycles. In this paper, we give an overview of the operational problems emerging in current nationwide optical transport networks, and based on trends analysis for system configuration and network-control schemes, we propose a vision of the future nationwide optical-transport-network architecture expressed using five target features.
ResilientFlow: Deployments of Distributed Control Channel Maintenance Modules to Recover SDN from Unexpected Failures
Takuya OMIZO Takuma WATANABE Toyokazu AKIYAMA Katsuyoshi IIDA

PAPER

Vol:
E99-B No:5
Page(s):
1041-1053
Although SDN provides desirable characteristics such as the manageability, flexibility and extensibility of the networks, it has a considerable disadvantage in its reliability due to its centralized architecture. To protect SDN-enabled networks under large-scale, unexpected link failures, we propose ResilientFlow that deploys distributed modules called Control Channel Maintenance Module (CCMM) for every switch and controllers. The CCMMs makes switches able to maintain their own control channels, which are core and fundamental part of SDN. In this paper, we design, implement, and evaluate the ResilientFlow.
Failure Detection in P2P-Grid System
Huan WANG Hideroni NAKAZATO

PAPER-Grid System

Pubricized:
2015/09/15
Vol:
E98-D No:12
Page(s):
2123-2131
Peer-to-peer (P2P)-Grid systems are being investigated as a platform for converging the Grid and P2P network in the construction of large-scale distributed applications. The highly dynamic nature of P2P-Grid systems greatly affects the execution of the distributed program. Uncertainty caused by arbitrary node failure and departure significantly affects the availability of computing resources and system performance. Checkpoint-and-restart is the most common scheme for fault tolerance because it periodically saves the execution progress onto stable storage. In this paper, we suggest a checkpoint-and-restart mechanism as a fault-tolerant method for applications on P2P-Grid systems. Failure detection mechanism is a necessary prerequisite to fault tolerance and fault recovery in general. Given the highly dynamic nature of nodes within P2P-Grid systems, any failure should be detected to ensure effective task execution. Therefore, failure detection mechanism as an integral part of P2P-Grid systems was studied. We discussed how the design of various failure detection algorithms affects their performance in average failure detection time of nodes. Numerical analysis results and implementation evaluation are also provided to show different average failure detection times in real systems for various failure detection algorithms. The comparison shows the shortest average failure detection time by 8.8s on basis of the WP failure detector. Our lowest mean time to recovery (MTTR) is also proven to have a distinct advantage with a time consumption reduction of about 5.5s over its counterparts.
Enhancing MPLS Protection Method with Adaptive Segment Repair
Chin-Ling CHEN

PAPER-Network

Vol:
E92-B No:10
Page(s):
3126-3131
We propose a novel adaptive segment repair mechanism to improve traditional MPLS (Multi-Protocol Label Switching) failure recovery. The proposed mechanism protects one or more contiguous high failure probability links by dynamic setup of segment protection. Simulations demonstrate that the proposed mechanism reduces failure recovery time while also increasing network resource utilization.
Improving Ethernet Reliability and Stability Using Global Open Ethernet Technology
Masaki UMAYABASHI Youichi HIDAKA Nobuyuki ENOMOTO Daisaku OGASAHARA Kazuo TAKAGI Atsushi IWATA Akira ARUTAKI

PAPER

Vol:
E89-B No:3
Page(s):
675-682
In this paper, authors present new schemes of our proposed Global Open Ethernet (GOE) technology from a viewpoint of improving reliability in metro-area Ethernet environment and show the numerical evidence on their performance results. Although several standardized or vendor proprietary technologies are proposed to improve Ethernet reliability, they still have reliability problems in terms of long failure recovery time (due to forwarding database (FDB) flush and recovery from a root bridge failure on spanning tree protocol), broadcast storm, and packet loss in network reconfiguration. To solve these problems, we introduce three schemes, a Per Destination - Multiple Rapid Spanning Tree Protocol (PD-MRSTP), a GOE Virtual Switch Redundancy Protocol (GVSRP), and an In-Service Reconfiguration (ISR) schemes. PD-MRSTP scheme reduces the failure recovery time by eliminating the need to flush the FDB and to recover from root bridge failures. GVSRP scheme ensures the reliability of connections between a GOE domain and a legacy Ethernet domain. Combined with PD-MRSTP, GVSRP prevents broadcast storm problems due to loops in the inter-domain area. ISR scheme enables in-service bridge replacement and upgrade without packet loss. Evaluating our prototype system, we obtained the following remarkable performance results. The GOE network using PD-MRSTP scheme delivered a fast failure recovery performance (4 ms) independent of the number of MAC address entries, whereas the legacy Ethernet network took 522 ms when a bridge had 6000 MAC address entries. Since we found that the failure recovery time increased in proportion to the number of MAC address entries, the one in large carrier network having one million of MAC address entries would take several tens of seconds. Thus using PD-MRSTP can reduce failure recovery time one ten-thousandth comparing with that of legacy Ethernet. In addition, evaluation of the ISR scheme demonstrated that a network can be upgraded with zero packet loss. Therefore, a GOE-based VPN is a promising alternative to other Ethernet VPNs for its reliability and stability.
Restoring Delivery Tree from Node Failures in Overlay Multicast
Zongming FEI Mengkun YANG

PAPER-Network

Vol:
E88-B No:5
Page(s):
2046-2053
One of the important problems in overlay multicast is how to deal with node failures and ungraceful leavings. When a non-leaf end host fails or leaves the multicast session, all downstream nodes will be affected. In this paper, we adopt the proactive approach, which pre-calculates a candidate node (called parent-to-be) for each node to connect to in case its current parent dies. The goal is to recover the overlay multicast tree quickly so that the disruption of service to those affected nodes is minimized. We combine the local parent-to-be locating and global parent-to-be locating schemes together, in order to take advantage of less interference in the local scheme and the flexibility of the global scheme. The quality of the recovered tree is improved while the responsiveness of the proactive approach is maintained.
Determining Consistent Global Checkpoints of a Distributed Computation
Dakshnamoorthy MANIVANNAN

PAPER-Computer Systems

Vol:
E87-D No:1
Page(s):
164-174
Determining consistent global checkpoints of a distributed computation has applications in the areas such as rollback recovery, distributed debugging, output commit and others. Netzer and Xu introduced the notion of zigzag paths and presented necessary and sufficient conditions for a set of checkpoints to be part of a consistent global checkpoint. This result also reveals that determining the existence of zigzag paths between checkpoints is crucial for determining consistent global checkpoints. Recent research also reveals that determining zigzag paths on-line is not possible. In this paper, we present an off-line method for determining the existence of zigzag paths between checkpoints.
Efficient Techniques for Adaptive Independent Checkpointing in Distributed Systems
Cheng-Min LIN Chyi-Ren DOW

PAPER-Fault Tolerance

Vol:
E83-D No:8
Page(s):
1642-1653
This work presents two novel algorithms to prevent rollback propagation for independent checkpointing: an efficient adaptive independent checkpointing algorithm and an optimized adaptive independent checkpointing algorithm. The last opportunity strategy that yields a better performance than the conservation strategy is also employed to prevent useless checkpoints for both causal rewinding paths and non-causal rewinding paths. The two methods proposed herein are domino effect-free and require only a limited amount of control information. They also take less unnecessary adaptive checkpoints than other algorithms. Furthermore, experimental results indicate that the checkpoint overhead of our techniques is lower than that of the coordinated checkpointing and domino effect-free algorithms for service-providing applications.

Keyword Search Result

[Keyword] failure recovery(13hit)

System Design for Traveling Maintenance in Wide-Area Telecommunication Networks

RLE-MRC: Robustness and Low-Energy Based Multiple Routing Configurations for Fast Failure Recovery

Single Failure Recovery Method for Erasure Coded Storage System with Heterogeneous Devices Open Access

Designing Distributed SDN C-Plane Considering Large-Scale Disruption and Restoration Open Access

Strip-Switched Deployment Method to Optimize Single Failure Recovery for Erasure Coded Storage Systems

Future Nationwide Optical Network Architecture for Higher Availability and Operability Using Transport SDN Technologies Open Access

ResilientFlow: Deployments of Distributed Control Channel Maintenance Modules to Recover SDN from Unexpected Failures

Failure Detection in P2P-Grid System

Enhancing MPLS Protection Method with Adaptive Segment Repair

Improving Ethernet Reliability and Stability Using Global Open Ethernet Technology

Restoring Delivery Tree from Node Failures in Overlay Multicast

Determining Consistent Global Checkpoints of a Distributed Computation

Efficient Techniques for Adaptive Independent Checkpointing in Distributed Systems

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles