Noriaki KAMIYAMA Ryoichi KAWAHARA Tatsuya MORI Haruhisa HASEGAWA
In Video on Demand (VoD) services, the demand for content items greatly changes daily over the course of the day. Because service providers are required to maintain a stable service during peak hours, they need to design system resources on the basis of peak demand time, so reducing the server load at peak times is important. To reduce the peak load of a content server, we propose to multicast popular content items to all users independently of actual requests as well as providing on-demand unicast delivery. With this solution, however, the hit ratio of pre-distributed content items is small, and large-capacity storage is required at each set-top box (STB). We can expect to cope with this problem by limiting the number of pre-distributed content items or clustering users based on their viewing histories. We evaluated the effect of these techniques by using actual VoD access log data. We also evaluated the total cost of the multicast pre-distribution VoD system with the proposed two techniques.
Yasuhiro IKEDA Ryoichi KAWAHARA Noriaki KAMIYAMA Tatsuaki KIMURA Tatsuya MORI
We analyze measured traffic data to investigate the characteristics of TCP quality metrics such as packet retransmission rate, roundtrip time (RTT), and throughput of connections classified by their type (client-server (C/S) or peer-to-peer (P2P)), or by the location of the connection host (domestic or overseas). Our findings are as follows. (i) The TCP quality metrics of the measured traffic data are not necessarily consistent with a theoretical formula proposed in a previous study. However, the average RTT and retransmission rate are negatively correlated with the throughput, which is similar to this formula. Furthermore, the maximum idle time, which is defined as the maximum length of the packet interarrival times, is negatively correlated with throughput. (ii) Each TCP quality metric of C/S connections is higher than that of P2P connections. Here “higher quality” means that either the throughput is higher, or the other TCP quality metrics lead to higher throughput; for example the average RTT is lower or the retransmission rate is lower. Specifically, the median throughput of C/S connections is 2.5 times higher than that of P2P connections in the incoming direction of domestic traffic. (iii) The characteristics of TCP quality metrics depend on the location of the host of the TCP connection. There are cases in which overseas servers might use a different TCP congestion control scheme. Even if we eliminate these servers, there is still a difference in the degree of impact the average RTT has on the throughput between domestic and overseas traffic. One reason for this is thought to be the difference in the maximum idle time, and another is the fact that congestion levels of these types of traffic differ, even if their average RTTs are the same.
Noriaki KAMIYAMA Tatsuya MORI Ryoichi KAWAHARA Haruhisa HASEGAWA
Recently, the number of users downloading video content on the Internet has dramatically increased, and it is highly anticipated that downloading huge size, rich content such as movie files will become a popular use of the Internet in the near future. The transmission bandwidth consumed by delivering rich content is enormous, so it is urgent for ISPs to design an efficient delivery system that minimizes the amount of network resources consumed. To deliver web content efficiently, a content delivery network (CDN) is often used. CDN providers collocate a huge number of servers within multiple ISPs without being informed of detailed network information, i.e., network topologies, from ISPs. Minimizing the amount of network resources consumed is difficult because a CDN provider selects a server for each request based on only rough estimates of response time. Therefore, an ordinary CDN is not suited for delivering rich content. P2P-based delivery systems are becoming popular as scalable delivery systems. However, by using a P2P-based system, we still cannot obtain the ideal delivery pattern that is optimal for ISPs because the server locations depend on users behaving selfishly. To provide rich content to users economically and efficiently, an ISP itself should optimally provide servers with huge storage capacities at a limited number of locations within its network. In this paper, we investigate the content deployment method, the content delivery process, and the server allocation method that are desirable for this ISP-operated CDN. Moreover, we evaluate the effectiveness of the ISP-operated CDN using the actual network topologies of commercial ISPs.
Naoya MAKI Takayuki NISHIO Ryoichi SHINKUMA Tatsuya MORI Noriaki KAMIYAMA Ryoichi KAWAHARA Tatsuro TAKAHASHI
In content services where people purchase and download large-volume contents, minimizing network traffic is crucial for the service provider and the network operator since they want to lower the cost charged for bandwidth and the cost for network infrastructure, respectively. Traffic localization is an effective way of reducing network traffic. Network traffic is localized when a client can obtain the requested content files from other a near-by altruistic client instead of the source servers. The concept of the peer-assisted content distribution network (CDN) can reduce the overall traffic with this mechanism and enable service providers to minimize traffic without deploying or borrowing distributed storage. To localize traffic effectively, content files that are likely to be requested by many clients should be cached locally. This paper presents a novel traffic engineering scheme for peer-assisted CDN models. Its key idea is to control the behavior of clients by using content-oriented incentive mechanism. This approach enables us to optimize traffic flows by letting altruistic clients download content files that are most likely contributed to localizing traffic among clients. In order to let altruistic clients request the desired files, we combine content files while keeping the price equal to the one for a single content. This paper presents a solution for optimizing the selection of content files to be combined so that cross traffic in a network is minimized. We also give a model for analyzing the upper-bound performance and the numerical results.
Tatsuaki KIMURA Keisuke ISHIBASHI Tatsuya MORI Hiroshi SAWADA Tsuyoshi TOYONO Ken NISHIMATSU Akio WATANABE Akihiro SHIMODA Kohei SHIOMOTO
Network equipment, such as routers, switches, and RADIUS servers, generate various log messages induced by network events such as hardware failures and protocol flaps. In large production networks, analyzing the log messages is crucial for diagnosing network anomalies; however, it has become challenging due to the following two reasons. First, the log messages are composed of unstructured text messages generated in accordance with vendor-specific rules. Second, network events that induce the log messages span several geographical locations, network layers, protocols, and services. We developed a method to tackle these obstacles consisting of two techniques: statistical template extraction (STE) and log tensor factorization (LTF). The former leverages a statistical clustering technique to automatically extract primary templates from unstructured log messages. The latter builds a statistical model that collects spatial-temporal patterns of log messages. Such spatial-temporal patterns provide useful insights into understanding the impact and patterns of hidden network events. We evaluate our techniques using a massive amount of network log messages collected from a large operating network and confirm that our model fits the data well. We also investigate several case studies that validate the usefulness of our method.
Bo SUN Akinori FUJINO Tatsuya MORI Tao BAN Takeshi TAKAHASHI Daisuke INOUE
Analyzing a malware sample requires much more time and cost than creating it. To understand the behavior of a given malware sample, security analysts often make use of API call logs collected by the dynamic malware analysis tools such as a sandbox. As the amount of the log generated for a malware sample could become tremendously large, inspecting the log requires a time-consuming effort. Meanwhile, antivirus vendors usually publish malware analysis reports (vendor reports) on their websites. These malware analysis reports are the results of careful analysis done by security experts. The problem is that even though there are such analyzed examples for malware samples, associating the vendor reports with the sandbox logs is difficult. This makes security analysts not able to retrieve useful information described in vendor reports. To address this issue, we developed a system called AMAR-Generator that aims to automate the generation of malware analysis reports based on sandbox logs by making use of existing vendor reports. Aiming at a convenient assistant tool for security analysts, our system employs techniques including template matching, API behavior mapping, and malicious behavior database to produce concise human-readable reports that describe the malicious behaviors of malware programs. Through the performance evaluation, we first demonstrate that AMAR-Generator can generate human-readable reports that can be used by a security analyst as the first step of the malware analysis. We also demonstrate that AMAR-Generator can identify the malicious behaviors that are conducted by malware from the sandbox logs; the detection rates are up to 96.74%, 100%, and 74.87% on the sandbox logs collected in 2013, 2014, and 2015, respectively. We also present that it can detect malicious behaviors from unknown types of sandbox logs.
Takuya WATANABE Mitsuaki AKIYAMA Tetsuya SAKAI Hironori WASHIZAKI Tatsuya MORI
Permission warnings and privacy policy enforcement are widely used to inform mobile app users of privacy threats. These mechanisms disclose information about use of privacy-sensitive resources such as user location or contact list. However, it has been reported that very few users pay attention to these mechanisms during installation. Instead, a user may focus on a more user-friendly source of information: text description, which is written by a developer who has an incentive to attract user attention. When a user searches for an app in a marketplace, his/her query keywords are generally searched on text descriptions of mobile apps. Then, users review the search results, often by reading the text descriptions; i.e., text descriptions are associated with user expectation. Given these observations, this paper aims to address the following research question: What are the primary reasons that text descriptions of mobile apps fail to refer to the use of privacy-sensitive resources? To answer the research question, we performed empirical large-scale study using a huge volume of apps with our ACODE (Analyzing COde and DEscription) framework, which combines static code analysis and text analysis. We developed light-weight techniques so that we can handle hundred of thousands of distinct text descriptions. We note that our text analysis technique does not require manually labeled descriptions; hence, it enables us to conduct a large-scale measurement study without requiring expensive labeling tasks. Our analysis of 210,000 apps, including free and paid, and multilingual text descriptions collected from official and third-party Android marketplaces revealed four primary factors that are associated with the inconsistencies between text descriptions and the use of privacy-sensitive resources: (1) existence of app building services/frameworks that tend to add API permissions/code unnecessarily, (2) existence of prolific developers who publish many applications that unnecessarily install permissions and code, (3) existence of secondary functions that tend to be unmentioned, and (4) existence of third-party libraries that access to the privacy-sensitive resources. We believe that these findings will be useful for improving users' awareness of privacy on mobile software distribution platforms.
This work develops a system called CLAP that detects and classifies “potentially unwanted applications” (PUAs) such as adware or remote monitoring tools. Our approach leverages DNS queries made by apps. Using a large sample of Android apps from third-party marketplaces, we first reveal that DNS queries can provide useful information for detection and classification of PUAs. We then show that existing DNS blacklists are limited when performing these tasks. Finally, we demonstrate that the CLAP system performs with high accuracy.
Tatsuya MORI Tetsuya TAKINE Jianping PAN Ryoichi KAWAHARA Masato UCHIDA Shigeki GOTO
With the rapid increase of link speed in recent years, packet sampling has become a very attractive and scalable means in collecting flow statistics; however, it also makes inferring original flow characteristics much more difficult. In this paper, we develop techniques and schemes to identify flows with a very large number of packets (also known as heavy-hitter flows) from sampled flow statistics. Our approach follows a two-stage strategy: We first parametrically estimate the original flow length distribution from sampled flows. We then identify heavy-hitter flows with Bayes' theorem, where the flow length distribution estimated at the first stage is used as an a priori distribution. Our approach is validated and evaluated with publicly available packet traces. We show that our approach provides a very flexible framework in striking an appropriate balance between false positives and false negatives when sampling frequency is given.
Ryoichi KAWAHARA Tatsuya MORI Keisuke ISHIBASHI Noriaki KAMIYAMA Hideaki YOSHINO
Managing the performance at the flow level through traffic measurement is crucial for effective network management. With the rapid rise in link speeds, collecting all packets has become difficult, so packet sampling has been attracting attention as a scalable means of measuring flow statistics. In this paper, we firstly propose a method of estimating TCP flow rates of sampled flows through packet sampling, and then develop a method of detecting performance degradation at the TCP flow level from the estimated flow rates. In the method of estimating flow rates, we use sequence numbers of sampled packets, which make it possible to improve markedly the accuracy of estimating the flow rates of sampled flows. Using both an analytical model and measurement data, we show that this method gives accurate estimations. We also show that, by observing the estimated rates of sampled flows, we can detect TCP performance degradation. The method of detecting performance degradation is based on the following two findings: (i) sampled flows tend to have high flow-rates and (ii) when a link becomes congested, the performance of high-rate flows becomes degraded first. These characteristics indicate that sampled flows are sensitive to congestion, so we can detect performance degradation of flows that are sensitive to congestion by observing the rate of sampled flows. We also show the effectiveness of our method using measurement data.
Yumehisa HAGA Yuta TAKATA Mitsuaki AKIYAMA Tatsuya MORI
Web tracking is widely used as a means to track user's behavior on websites. While web tracking provides new opportunities of e-commerce, it also includes certain risks such as privacy infringement. Therefore, analyzing such risks in the wild Internet is meaningful to make the user's privacy transparent. This work aims to understand how the web tracking has been adopted to prominent websites. We also aim to understand their resilience to the ad-blocking techniques. Web tracking-enabled websites collect the information called the web browser fingerprints, which can be used to identify users. We develop a scalable system that can detect fingerprinting by using both dynamic and static analyses. If a tracking site makes use of many and strong fingerprints, the site is likely resilient to the ad-blocking techniques. We also analyze the connectivity of the third-party tracking sites, which are linked from multiple websites. The link analysis allows us to extract the group of associated tracking sites and understand how influential these sites are. Based on the analyses of 100,000 websites, we quantify the potential risks of the web tracking-enabled websites. We reveal that there are 226 websites that adopt fingerprints that cannot be detected with the most of off-the-shelf anti-tracking tools. We also reveal that a major, resilient third-party tracking site is linked to 50.0 % of the top-100,000 popular websites.
Takuya WATANABE Mitsuaki AKIYAMA Tatsuya MORI
We developed a novel, proof-of-concept side-channel attack framework called RouteDetector, which identifies a route for a train trip by simply reading smart device sensors: an accelerometer, magnetometer, and gyroscope. All these sensors are commonly used by many apps without requiring any permissions. The key technical components of RouteDetector can be summarized as follows. First, by applying a machine-learning technique to the data collected from sensors, RouteDetector detects the activity of a user, i.e., “walking,” “in moving vehicle,” or “other.” Next, it extracts departure/arrival times of vehicles from the sequence of the detected human activities. Finally, by correlating the detected departure/arrival times of the vehicle with timetables/route maps collected from all the railway companies in the rider's country, it identifies potential routes that can be used for a trip. We demonstrate that the strategy is feasible through field experiments and extensive simulation experiments using timetables and route maps for 9,090 railway stations of 172 railway companies.
An enormous number of malware samples pose a major threat to our networked society. Antivirus software and intrusion detection systems are widely implemented on the hosts and networks as fundamental countermeasures. However, they may fail to detect evasive malware. Thus, setting a high priority for new varieties of malware is necessary to conduct in-depth analyses and take preventive measures. In this paper, we present a traffic model for malware that can classify network behaviors of malware and identify new varieties of malware. Our model comprises malware-specific features and general traffic features that are extracted from packet traces obtained from a dynamic analysis of the malware. We apply a clustering analysis to generate a classifier and evaluate our proposed model using large-scale live malware samples. The results of our experiment demonstrate the effectiveness of our model in finding new varieties of malware.
Takuya WATANABE Mitsuaki AKIYAMA Fumihiro KANEI Eitaro SHIOJI Yuta TAKATA Bo SUN Yuta ISHII Toshiki SHIBAHARA Takeshi YAGI Tatsuya MORI
This paper reports a large-scale study that aims to understand how mobile application (app) vulnerabilities are associated with software libraries. We analyze both free and paid apps. Studying paid apps was quite meaningful because it helped us understand how differences in app development/maintenance affect the vulnerabilities associated with libraries. We analyzed 30k free and paid apps collected from the official Android marketplace. Our extensive analyses revealed that approximately 70%/50% of vulnerabilities of free/paid apps stem from software libraries, particularly from third-party libraries. Somewhat paradoxically, we found that more expensive/popular paid apps tend to have more vulnerabilities. This comes from the fact that more expensive/popular paid apps tend to have more functionality, i.e., more code and libraries, which increases the probability of vulnerabilities. Based on our findings, we provide suggestions to stakeholders of mobile app distribution ecosystems.
Takuya WATANABE Eitaro SHIOJI Mitsuaki AKIYAMA Keito SASAOKA Takeshi YAGI Tatsuya MORI
This paper presents a practical side-channel attack that identifies the social web service account of a visitor to an attacker's website. Our attack leverages the widely adopted user-blocking mechanism, abusing its inherent property that certain pages return different web content depending on whether a user is blocked from another user. Our key insight is that an account prepared by an attacker can hold an attacker-controllable binary state of blocking/non-blocking with respect to an arbitrary user on the same service; provided that the user is logged in to the service, this state can be retrieved as one-bit data through the conventional cross-site timing attack when a user visits the attacker's website. We generalize and refer to such a property as visibility control, which we consider as the fundamental assumption of our attack. Building on this primitive, we show that an attacker with a set of controlled accounts can gain a complete and flexible control over the data leaked through the side channel. Using this mechanism, we show that it is possible to design and implement a robust, large-scale user identification attack on a wide variety of social web services. To verify the feasibility of our attack, we perform an extensive empirical study using 16 popular social web services and demonstrate that at least 12 of these are vulnerable to our attack. Vulnerable services include not only popular social networking sites such as Twitter and Facebook, but also other types of web services that provide social features, e.g., eBay and Xbox Live. We also demonstrate that the attack can achieve nearly 100% accuracy and can finish within a sufficiently short time in a practical setting. We discuss the fundamental principles, practical aspects, and limitations of the attack as well as possible defenses. We have successfully addressed this attack by collaborative working with service providers and browser vendors.
Keika MORI Takuya WATANABE Yunao ZHOU Ayako AKIYAMA HASEGAWA Mitsuaki AKIYAMA Tatsuya MORI
This work aims to determine the propensity of password creation through the lens of language spheres. To this end, we consider four different countries, each with a different culture/language: China/Chinese, United Kingdom (UK) and India/English, and Japan/Japanese. We first employ a user study to verify whether language and culture are reflected in password creation. We found that users in India, Japan, and the UK prefer to create their passwords from base words, and the kinds of words they are incorporated into passwords vary between countries. We then test whether the findings obtained through the user study are reflected in a corpus of leaked passwords. We found that users in China and Japan prefer dates, while users in India, Japan, and the UK prefer names. We also found that cultural words (e.g., “sakura” in Japan and “football” in the UK) are frequently used to create passwords. Finally, we demonstrate that the knowledge on the linguistic background of targeted users can be exploited to increase the speed of the password guessing process.
Keisuke ISHIBASHI Ryoichi KAWAHARA Tatsuya MORI Tsuyoshi KONDOH Shoichiro ASANO
We quantitatively evaluate how sampling and spatio/temporal granularity in traffic monitoring affect the detectability of anomalous traffic. Those parameters also affect the monitoring burden, so network operators face a trade-off between the monitoring burden and detectability and need to know which are the optimal paramter values. We derive equations to calculate the false positive ratio and false negative ratio for given values of the sampling rate, granularity, statistics of normal traffic, and volume of anomalies to be detected. Specifically, assuming that the normal traffic has a Gaussian distribution, which is parameterized by its mean and standard deviation, we analyze how sampling and monitoring granularity change these distribution parameters. This analysis is based on observation of the backbone traffic, which exhibits spatially uncorrelated and temporally long-range dependence. Then we derive the equations for detectability. With those equations, we can answer the practical questions that arise in actual network operations: what sampling rate to set to find the given volume of anomaly, or, if the sampling is too high for actual operation, what granularity is optimal to find the anomaly for a given lower limit of sampling rate.
Kazuya TAKAHASHI Tatsuya MORI Yusuke HIROTA Hideki TODE Koso MURAKAMI
In recent years, real-time streaming has become widespread as a major service on the Internet. However, real-time streaming has a strict playback deadline. Application level multicasts using multiple distribution trees, which are known as forests, are an effective approach for reducing delay and jitter. However, the failure or departure of nodes during forest-based multicast transfer can severely affect the performance of other nodes. Thus, the multimedia data quality is degraded until the distribution trees are repaired. This means that increasing the speed of recovery from isolation is very important, especially in real-time streaming services. In this paper, we propose three methods for resolving this problem. The first method is a random-based proactive method that achieves rapid recovery from isolation and gives efficient “Randomized Forwarding” via cooperation among distribution trees. Each node forwards the data it receives to child nodes in its tree, and then, the node randomly transferring it to other trees with a predetermined probability. The second method is a reactive method, which provides a reliable isolation recovery method with low overheads. In this method, an isolated node requests “Continuous Forwarding” from other nodes if it detects a problem with a parent node. Forwarding to the nearest nodes in the IP network ensures that this method is efficient. The third method is a hybrid method that combines these two methods to achieve further performance improvements. We evaluated the performances of these proposed methods using computer simulations. The simulation results demonstrated that our proposed methods delivered isolation recovery and that the hybrid method was the most suitable for real-time streaming.
Noriaki KAMIYAMA Tatsuya MORI Ryoichi KAWAHARA Shigeaki HARADA
We have proposed a method of identifying superspreaders by flow sampling and a method of filtering legitimate hosts from the identified superspreaders using a white list. However, the problem of how to optimally set parameters of φ, the measurement period length, m*, the identification threshold of the flow count m within φ, and H*, the identification probability for hosts with m=m*, remained unsolved. These three parameters seriously impact the ability to identify the spread of infection. Our contributions in this work are two-fold: (1) we propose a method of optimally designing these three parameters to satisfy the condition that the ratio of the number of active worm-infected hosts divided by the number of all vulnerable hosts is bound by a given upper-limit during the time T required to develop a patch or an anti-worm vaccine, and (2) the proposed method can optimize the identification accuracy of worm-infected hosts by maximally using a limited amount of memory resource of monitors.
Takahiro MATSUDA Tatsuya MORITA Takanori KUDO Tetsuya TAKINE
In this paper, we study robust Principal Component Analysis (PCA)-based anomaly detection techniques in network traffic, which can detect traffic anomalies by projecting measured traffic data onto a normal subspace and an anomalous subspace. In a PCA-based anomaly detection, outliers, anomalies with excessively large traffic volume, may contaminate the subspaces and degrade the performance of the detector. To solve this problem, robust PCA methods have been studied. In a robust PCA-based anomaly detection scheme, outliers can be removed from the measured traffic data before constructing the subspaces. Although the robust PCA methods are promising, they incure high computational cost to obtain the optimal location vector and scatter matrix for the subspace. We propose a novel anomaly detection scheme by extending the minimum covariance determinant (MCD) estimator, a robust PCA method. The proposed scheme utilizes the daily periodicity in traffic volume and attempts to detect anomalies for every period of measured traffic. In each period, before constructing the subspace, outliers are removed from the measured traffic data by using a location vector and a scatter matrix obtained in the preceding period. We validate the proposed scheme by applying it to measured traffic data in the Abiline network. Numerical results show that the proposed scheme provides robust anomaly detection with less computational cost.