Atsushi KOBAYASHI Shingo KASHIMA Hiroshi KURAKAMI Keisuke ISHIBASHI
An anomalous change in traffic distributions caused by an external inter-domain routing change leads to congestion of some network links, which then affects the network quality or disrupts traffic. Thus, network operators need to promptly deal with these problems by changing the routing policy or by soliciting the help of an involved or neighboring network operator through operator channels. In addition, they need to diagnose situations in which customers are affected by the incident or in which destinations become unreachable. Although this task is indispensable, understanding the situation is difficult since the cause lies outside the operators' network domains. To alleviate the load on operators, we developed a system for monitoring traffic shifts and the disruptions caused by BGP routing changes. It is challenging to extract information that is more valid from a large amount of BGP update messages and traffic flow records. By correlating these data, the system provides meaningful reports and visualized traffic statistics, and it enables operators to easily detect the cause of traffic changes and to investigate the extent of damage. We demonstrate the effectiveness of the system and evaluate its feasibility by applying it to an ISP backbone network. In addition, we present a case study of traffic changes that the system detected.
Ryoichi KAWAHARA Keisuke ISHIBASHI Takuya ASAKA Shuichi SUMITA Takeo ABE
We develop a method of dimensioning and managing the bandwidth of a link on which TCP flows from access links are aggregated. To do this, we extend the application of the processor-sharing queue model to TCP performance evaluation by using flow statistics. To handle various factors that affect actual TCP behavior, such as round-trip time, window-size, and restrictions other than access-link bandwidth, we extend the model by replacing the access-link bandwidth with the actual file-transfer speed of a flow when the aggregation link is not congested. We only use the number of active flows and the link utilization to estimate the file-transfer speed. Unlike previous studies, the extended model based on the actual transfer speed does not require any assumptions/predeterminations about file-size, packet-size, and round-trip times, etc. Using the extended model, we predict the TCP performance when the link utilization increases. We also show a method of dimensioning the bandwidth needed to maintain TCP performance. We show the effectiveness of our method through simulation analysis.
Keisuke ISHIBASHI Ryoichi KAWAHARA Takuya ASAKA Masaki AIDA Satoshi ONO Shoichiro ASANO
In this paper, we propose a method of detecting TCP performance degradation using only bottleneck-link utilization statistics: mean and variance. The variance of link utilization normally increases as the mean link-utilization increases. However, because link-utilization has a maximum of 100%, as the mean approaches 100%, the possible range of fluctuation becomes narrow and the variance decreases to zero. In this paper, using the M/G/R processor sharing model, we relate this phenomenon to the behavior of flows. We also show that by using this relationship, we can detect TCP performance degradation using the mean and variance of link utilization. In particular, this method enables a network operator to determine whether or not the degradation originates from the congestion of his/her own network. Because our method requires us to measure only link utilization, the cost of performance management can be greatly decreased compared with the conventional method, which requires dedicated functions for directly measuring the TCP performance.
Masaki AIDA Keisuke ISHIBASHI Hiroyoshi MIWA Chisa TAKANO Shin-ichi KURIBAYASHI
The number of customers of a service for Internet access from cellular phones in Japan has been explosively increasing for some time. We analyze the relation between the number of customers and the volume of traffic, with a view to finding clues to the structure of human relations among the very large set of potential customers of the service. The traffic data reveals that this structure is a scale-free network, and we calculate the exponent that governs the distribution of node degree in this network. The data also indicates that people who have many friends tend to subscribe to the service at an earlier stage. These results are useful for investigating various fields, including marketing strategies, the propagation of rumors, the spread of computer viruses, and so on.
Tatsuaki KIMURA Akio WATANABE Tsuyoshi TOYONO Keisuke ISHIBASHI
Recent carrier-grade networks use many network elements (switches, routers) and servers for various network-based services (e.g., video on demand, online gaming) that demand higher quality and better reliability. Network log data generated from these elements, such as router syslogs, are rich sources for quickly detecting the signs of critical failures to maintain service quality. However, log data contain a large number of text messages written in an unstructured format and contain various types of network events (e.g., operator's login, link down); thus, genuinely important log messages for network operation are difficult to find automatically. We propose a proactive failure-detection system for large-scale networks. It automatically finds abnormal patterns of log messages from a massive amount of data without requiring previous knowledge of data formats used and can detect critical failures before they occur. To handle unstructured log messages, the system has an online log-template-extraction part for automatically extracting the format of a log message. After template extraction, the system associates critical failures with the log data that appeared before them on the basis of supervised machine learning. By associating each log message with a log template, we can characterize the generation patterns of log messages, such as burstiness, not just the keywords in log messages (e.g. ERROR, FAIL). We used real log data collected from a large production network to validate our system and evaluated the system in detecting signs of actual failures of network equipment through a case study.
Keisuke ISHIBASHI Tatsuya MORI Ryoichi KAWAHARA Yutaka HIROKAWA Atsushi KOBAYASHI Kimihiro YAMAMOTO Hitoaki SAKAMOTO Shoichiro ASANO
We propose an algorithm for finding heavy hitters in terms of cardinality (the number of distinct items in a set) in massive traffic data using a small amount of memory. Examples of such cardinality heavy-hitters are hosts that send large numbers of flows, or hosts that communicate with large numbers of other hosts. Finding these hosts is crucial to the provision of good communication quality because they significantly affect the communications of other hosts via either malicious activities such as worm scans, spam distribution, or botnet control or normal activities such as being a member of a flash crowd or performing peer-to-peer (P2P) communication. To precisely determine the cardinality of a host we need tables of previously seen items for each host (e.g., flow tables for every host) and this may infeasible for a high-speed environment with a massive amount of traffic. In this paper, we use a cardinality estimation algorithm that does not require these tables but needs only a little information called the cardinality summary. This is made possible by relaxing the goal from exact counting to estimation of cardinality. In addition, we propose an algorithm that does not need to maintain the cardinality summary for each host, but only for partitioned addresses of a host. As a result, the required number of tables can be significantly decreased. We evaluated our algorithm using actual backbone traffic data to find the heavy-hitters in the number of flows and estimate the number of these flows. We found that while the accuracy degraded when estimating for hosts with few flows, the algorithm could accurately find the top-100 hosts in terms of the number of flows using a limited-sized memory. In addition, we found that the number of tables required to achieve a pre-defined accuracy increased logarithmically with respect to the total number of hosts, which indicates that our method is applicable for large traffic data for a very large number of hosts. We also introduce an application of our algorithm to anomaly detection. With actual traffic data, our method could successfully detect a sudden network scan.
Atsushi KOBAYASHI Keisuke ISHIBASHI
We present the development of a VoIP quality of service (QoS) measurement system that enables operators to diagnose a QoS degradation segment. Our system uses a flow-based passive measurement method to fulfill the requirement for QoS measurement in large-scale IP networks. In particular, we adopt an access control list (ACL)-based filtering function that selects traffic to monitor and develop a function for correlating signals and media data records. This correlation function is required to dynamically configure ACL-based filtering for monitoring media streams whose port numbers are determined by a signaling protocol. To improve the scalability of existing measurement systems, we also develop a hardware-based filtering engine on a commercial switch as well as a mediation box that performs QoS calculation based on traffic records exported by the engine in a distributed manner. We demonstrate the feasibility of the measurement system by evaluating a prototype system.
Ryoichi KAWAHARA Keisuke ISHIBASHI Takuya ASAKA Katsunori ORI
We propose a method of IP traffic management where the TCP performance at a bottleneck link is estimated from monitored data about the behavior of the number of active flows versus link utilization, which are both easy to measure. This method is based on our findings that (i) TCP performance remains constant as long as the link utilization is below some threshold value, but becomes degraded when it exceeds this value and (ii) the number of active flows increases linearly with link utilization up to the same value, and the increase becomes nonlinear above it. Though this threshold may vary depending on traffic/network conditions, our method requires neither predetermination of a threshold on the basis of assumed traffic conditions nor direct measurement of TCP performance.
Keisuke ISHIBASHI Kazumichi SATO
We introduce the notion of hierarchical aggregate entropy and apply it to identify DNS client hosts that wastefully consume server resources. Entropy of DNS query traffic can capture client query patterns, e.g., the concentration of queries to a specific domain or dispersion to a large domain name space. However, entropy alone cannot capture the spatial structure of the traffic. That is, even if queries disperse to various domains but concentrate in the same upper domain, entropy among domain names provides no information on the upper domain structure, which is an important characteristic of DNS traffic. On the other hand, entropies of aggregated upper domains do not have detailed information on individual domains. To overcome this difficulty, we introduce the notion of hierarchical aggregate entropy, where queries are recursively aggregated into upper domains along the DNS domain tree, and their entropies are calculated. Thus, this method enables us to analyze the spatial characteristics of DNS traffic in a multi-resolution manner. We calculate the hierarchical aggregate entropies for actual DNS heavy-hitters and observed that the entropies of normal heavy-hitters were concentrated in a specific range. On the basis of this observation, we adopt the support vector machine method to identify the range and to classify DNS heavy-hitters as anomalous or normal. It is shown that with hierarchical aggregate entropy can halve the classification error compared to non-hierarchical entropies.
Yousuke TAKAHASHI Keisuke ISHIBASHI Masayuki TSUJINO Noriaki KAMIYAMA Kohei SHIOMOTO Tatsuya OTOSHI Yuichi OHSITA Masayuki MURATA
To efficiently use network resources, internet service providers need to conduct traffic engineering that dynamically controls traffic routes to accommodate traffic change with limited network resources. The performance of traffic engineering (TE) depends on the accuracy of traffic prediction. However, the size of traffic change has been drastically increasing in recent years due to the growth in various types of network services, which has made traffic prediction difficult. Our approach to tackle this issue is to separate traffic into predictable and unpredictable parts and to apply different control policies. However, there are two challenges to achieving this: dynamically separating traffic according to predictability and dynamically controlling routes for each separated traffic part. In this paper, we propose a macroflow-based TE scheme that uses different routing policies in accordance with traffic predictability. We also propose a traffic-separation algorithm based on real-time traffic analysis and a framework for controlling separated traffic with software-defined networking technology, particularly OpenFlow. An evaluation of actual traffic measured in an Internet2 network shows that compared with current TE schemes the proposed scheme can reduce the maximum link load by 34% (at the most congested time) and the average link load by an average of 11%.
Tatsuaki KIMURA Keisuke ISHIBASHI Tatsuya MORI Hiroshi SAWADA Tsuyoshi TOYONO Ken NISHIMATSU Akio WATANABE Akihiro SHIMODA Kohei SHIOMOTO
Network equipment, such as routers, switches, and RADIUS servers, generate various log messages induced by network events such as hardware failures and protocol flaps. In large production networks, analyzing the log messages is crucial for diagnosing network anomalies; however, it has become challenging due to the following two reasons. First, the log messages are composed of unstructured text messages generated in accordance with vendor-specific rules. Second, network events that induce the log messages span several geographical locations, network layers, protocols, and services. We developed a method to tackle these obstacles consisting of two techniques: statistical template extraction (STE) and log tensor factorization (LTF). The former leverages a statistical clustering technique to automatically extract primary templates from unstructured log messages. The latter builds a statistical model that collects spatial-temporal patterns of log messages. Such spatial-temporal patterns provide useful insights into understanding the impact and patterns of hidden network events. We evaluate our techniques using a massive amount of network log messages collected from a large operating network and confirm that our model fits the data well. We also investigate several case studies that validate the usefulness of our method.
Keisuke ISHIBASHI Mika ISHIZUKA Masaki AIDA Shin-ichi KURIBAYASHI
This paper discusses research into the capacity dimensioning of Virtual Private Network (VPN) access links for elastic traffic, such as the Web or ftp. Assuming that the core-VPN network is provisioned with a sufficiently large capacity, managing the capacity of the VPN access link comes to sharing the bandwidth for the elastic traffic of the two bottlenecks, the ingress and egress access links. In the case of a single bottleneck with a limited capacity for access links, the processor-sharing model provides a simple formula for mean transfer time, but here, the value may be less than the actual transfer time because multiple flow may compete the bandwidth of both ingress and egress links. In contrast, max-min fair sharing provides an accurate sharing model which is similar to the TCP, but it is difficult to obtain a closed form for performance statistics. We propose a closed form approximation for a max-min fair sharing model, within a specific but realistic topology, through an investigation into the difference between the max-min and the processor sharing model. Using approximation, we calculate the capacity dimensioning of VPN access links.
Kazumichi SATO Keisuke ISHIBASHI Tsuyoshi TOYONO Haruhisa HASEGAWA Hideaki YOSHINO
Botnet threats, such as server attacks or sending of spam e-mail, have been increasing. Therefore, infected hosts must be found and their malicious activities mitigated. An effective method for finding infected hosts is to use a blacklist of domain names. When a bot receives attack commands from a Command and Control (C&C) server, it attempts to resolve domain names of C&C servers. We can thus detect infected hosts by finding these that send queries on black domain names. However, we cannot find all infected hosts because of the inaccuracy of blacklists. There are many black domain names, and the lifetimes of these domain names are short; therefore a blacklist cannot cover all black domain names. We thus present a method for finding unknown black domain names by using DNS query data and an existing blacklist of known black domain names. To achieve this, we focus on DNS queries sent by infected hosts. One bot sends several queries on black domain names due to C&C server redundancy. We use the co-occurrence relation of two different domain names to find unknown black domain names and extend the blacklist. If a domain name frequently co-occurs with a known black name, we assume that the domain name is also black. A cross-validation evaluation of the proposed method showed that 91.2% of domain names that are on the validation list scored in the top 1%.
Ryoichi KAWAHARA Tatsuya MORI Keisuke ISHIBASHI Noriaki KAMIYAMA Hideaki YOSHINO
Managing the performance at the flow level through traffic measurement is crucial for effective network management. With the rapid rise in link speeds, collecting all packets has become difficult, so packet sampling has been attracting attention as a scalable means of measuring flow statistics. In this paper, we firstly propose a method of estimating TCP flow rates of sampled flows through packet sampling, and then develop a method of detecting performance degradation at the TCP flow level from the estimated flow rates. In the method of estimating flow rates, we use sequence numbers of sampled packets, which make it possible to improve markedly the accuracy of estimating the flow rates of sampled flows. Using both an analytical model and measurement data, we show that this method gives accurate estimations. We also show that, by observing the estimated rates of sampled flows, we can detect TCP performance degradation. The method of detecting performance degradation is based on the following two findings: (i) sampled flows tend to have high flow-rates and (ii) when a link becomes congested, the performance of high-rate flows becomes degraded first. These characteristics indicate that sampled flows are sensitive to congestion, so we can detect performance degradation of flows that are sensitive to congestion by observing the rate of sampled flows. We also show the effectiveness of our method using measurement data.
Akio WATANABE Keisuke ISHIBASHI Tsuyoshi TOYONO Keishiro WATANABE Tatsuaki KIMURA Yoichi MATSUO Kohei SHIOMOTO Ryoichi KAWAHARA
In current large-scale IT systems, troubleshooting has become more complicated due to the diversification in the causes of failures, which has increased operational costs. Thus, clarifying the troubleshooting process also becomes important, though it is also time-consuming. We propose a method of automatically extracting a workflow, a graph indicating a troubleshooting process, using multiple trouble tickets. Our method extracts an operator's actions from free-format texts and aligns relative sentences between multiple trouble tickets. Our method uses a stochastic model to detect a resolution, a frequent action pattern that helps us understand how to solve a problem. We validated our method using real trouble-ticket data captured from a real network operation and showed that it can extract a workflow to identify the cause of a failure.
Keisuke ISHIBASHI Masaki AIDA Shin-ichi KURIBAYASHI
We previously proposed a change-of-measure based performance measurement method which combines active and passive measurement to estimate performance experienced by user packets and applied this to estimate packet delay. In this paper, we apply it to estimating loss rate. Since packets are rarely lost in current networks, rate measurement usually requires a huge number of probe packets, which imposes a non-negligible load on networks. We propose a loss-rate estimation method which requires significantly fewer number of probe packets. In our proposed method, the correlation between delay and loss is measured in advance, and at the time of measurement, the time-averaged loss rate is estimated by using the delay of probe packets and the correlation. We also applied our change-of-measure framework to estimating the loss rate in user packets by using this time-averaged loss rate. We prove that the mean square error in our method is lower than that simple loss measurement, which is estimated by dividing the number of lost packets by the total number of sent packets. We evaluated our method through simulations and actual measurements and found that it can estimate below 10-3 packet loss rate with only 900 probe packets.
Akito SUZUKI Ryoichi KAWAHARA Masahiro KOBAYASHI Shigeaki HARADA Yousuke TAKAHASHI Keisuke ISHIBASHI
Network functions virtualization (NFV) enables telecommunications service providers to realize various network services by flexibly combining multiple virtual network functions (VNFs). To provide such services, an NFV control method should optimally allocate such VNFs into physical networks and servers by taking account of the combination(s) of objective functions and constraints for each metric defined for each VNF type, e.g., VNF placements and routes between the VNFs. The NFV control method should also be extendable for adding new metrics or changing the combination of metrics. One approach for NFV control to optimize allocations is to construct an algorithm that simultaneously solves the combined optimization problem. However, this approach is not extendable because the problem needs to be reformulated every time a new metric is added or a combination of metrics is changed. Another approach involves using an extendable network-control architecture that coordinates multiple control algorithms specified for individual metrics. However, to the best of our knowledge, no method has been developed that can optimize allocations through this kind of coordination. In this paper, we propose an extendable NFV-integrated control method by coordinating multiple control algorithms. We also propose an efficient coordination algorithm based on reinforcement learning. Finally, we evaluate the effectiveness of the proposed method through simulations.
Keisuke ISHIBASHI Ryoichi KAWAHARA Tatsuya MORI Tsuyoshi KONDOH Shoichiro ASANO
We quantitatively evaluate how sampling and spatio/temporal granularity in traffic monitoring affect the detectability of anomalous traffic. Those parameters also affect the monitoring burden, so network operators face a trade-off between the monitoring burden and detectability and need to know which are the optimal paramter values. We derive equations to calculate the false positive ratio and false negative ratio for given values of the sampling rate, granularity, statistics of normal traffic, and volume of anomalies to be detected. Specifically, assuming that the normal traffic has a Gaussian distribution, which is parameterized by its mean and standard deviation, we analyze how sampling and monitoring granularity change these distribution parameters. This analysis is based on observation of the backbone traffic, which exhibits spatially uncorrelated and temporally long-range dependence. Then we derive the equations for detectability. With those equations, we can answer the practical questions that arise in actual network operations: what sampling rate to set to find the given volume of anomaly, or, if the sampling is too high for actual operation, what granularity is optimal to find the anomaly for a given lower limit of sampling rate.
Keisuke ISHIBASHI Shigeaki HARADA Ryoichi KAWAHARA
In this paper, we propose a CTRIL (Common Trend and Regression with Independent Loss) model to infer latent traffic demand in overloaded links as well as how much it is reduced due to QoS (Quality of Service) degradation. To appropriately provision link bandwidth for such overloaded links, we need to infer how much traffic will increase without QoS degradation. Because original latent traffic demand cannot be observed, we propose a method that compares the other traffic time series of an underloaded link, and by assuming that the latent traffic demands in both overloaded and underloaded are common, and actualized traffic demand in the overloaded link is decreased from common pattern due to the effect of QoS degradation. To realize the method, we developed a CTRIL model on the basis of a state-space model where observed traffic is generated from a latent trend but is decreased by the QoS degradation. By applying the CTRIL model to actual HTTP (Hypertext transfer protocol) traffic and QoS time series data, we reveal that 1% packet loss decreases traffic demand by 12.3%, and the estimated latent traffic demand is larger than the observed one by 23.0%.