1-3hit |
Kengo TAJIRI Ryoichi KAWAHARA Yoichi MATSUO
Machine learning (ML) has been used for various tasks in network operations in recent years. However, since the scale of networks has grown and the amount of data generated has increased, it has been increasingly difficult for network operators to conduct their tasks with a single server using ML. Thus, ML with edge-cloud cooperation has been attracting attention for efficiently processing and analyzing a large amount of data. In the edge-cloud cooperation setting, although transmission latency, bandwidth congestion, and accuracy of tasks using ML depend on the load balance of processing data with edge servers and a cloud server in edge-cloud cooperation, the relationship is too complex to estimate. In this paper, we focus on monitoring anomalous traffic as an example of ML tasks for network operations and formulate transmission latency, bandwidth congestion, and the accuracy of the task with edge-cloud cooperation considering the ratio of the amount of data preprocessed in edge servers to that in a cloud server. Moreover, we formulate an optimization problem under constraints for transmission latency and bandwidth congestion to select the proper ratio by using our formulation. By solving our optimization problem, the optimal load balance between edge servers and a cloud server can be selected, and the accuracy of anomalous traffic monitoring can be estimated. Our formulation and optimization framework can be used for other ML tasks by considering the generating distribution of data and the type of an ML model. In accordance with our formulation, we simulated the optimal load balance of edge-cloud cooperation in a topology that mimicked a Japanese network and conducted an anomalous traffic detection experiment by using real traffic data to compare the estimated accuracy based on our formulation and the actual accuracy based on the experiment.
Yoichi MATSUO Tatsuaki KIMURA Ken NISHIMATSU
When a failure occurs in a network element, such as switch, router, and server, network operators need to recognize the service impact, such as time to recovery from the failure or severity of the failure, since service impact is essential information for handling failures. In this paper, we propose Deep learning based Service Impact Prediction system (DeepSIP), which predicts the service impact of network failure in a network element using a temporal multimodal convolutional neural network (CNN). More precisely, DeepSIP predicts the time to recovery from the failure and the loss of traffic volume due to the failure in a network on the basis of information from syslog messages and traffic volume. Since the time to recovery is useful information for a service level agreement (SLA) and the loss of traffic volume is directly related to the severity of the failure, we regard the time to recovery and the loss of traffic volume as the service impact. The service impact is challenging to predict, since it depends on types of network failures and traffic volume when the failure occurs. Moreover, network elements do not explicitly contain any information about the service impact. To extract the type of network failures and predict the service impact, we use syslog messages and past traffic volume. However, syslog messages and traffic volume are also challenging to analyze because these data are multimodal, are strongly correlated, and have temporal dependencies. To extract useful features for prediction, we develop a temporal multimodal CNN. We experimentally evaluated DeepSIP in terms of accuracy by comparing it with other NN-based methods by using synthetic and real datasets. For both datasets, the results show that DeepSIP outperformed the baselines.
Akio WATANABE Keisuke ISHIBASHI Tsuyoshi TOYONO Keishiro WATANABE Tatsuaki KIMURA Yoichi MATSUO Kohei SHIOMOTO Ryoichi KAWAHARA
In current large-scale IT systems, troubleshooting has become more complicated due to the diversification in the causes of failures, which has increased operational costs. Thus, clarifying the troubleshooting process also becomes important, though it is also time-consuming. We propose a method of automatically extracting a workflow, a graph indicating a troubleshooting process, using multiple trouble tickets. Our method extracts an operator's actions from free-format texts and aligns relative sentences between multiple trouble tickets. Our method uses a stochastic model to detect a resolution, a frequent action pattern that helps us understand how to solve a problem. We validated our method using real trouble-ticket data captured from a real network operation and showed that it can extract a workflow to identify the cause of a failure.