Daiki CHIBA Ayako AKIYAMA HASEGAWA Takashi KOIDE Yuta SAWABE Shigeki GOTO Mitsuaki AKIYAMA
Internationalized domain names (IDNs) are abused to create domain names that are visually similar to those of legitimate/popular brands. In this work, we systematize such domain names, which we call deceptive IDNs, and analyze the risks associated with them. In particular, we propose a new system called DomainScouter to detect various deceptive IDNs and calculate a deceptive IDN score, a new metric indicating the number of users that are likely to be misled by a deceptive IDN. We perform a comprehensive measurement study on the identified deceptive IDNs using over 4.4 million registered IDNs under 570 top-level domains (TLDs). The measurement results demonstrate that there are many previously unexplored deceptive IDNs targeting non-English brands or combining other domain squatting methods. Furthermore, we conduct online surveys to examine and highlight vulnerabilities in user perceptions when encountering such IDNs. Finally, we discuss the practical countermeasures that stakeholders can take against deceptive IDNs.
Toshiki SHIBAHARA Takeshi YAGI Mitsuaki AKIYAMA Daiki CHIBA Kunio HATO
Malware-infected hosts have typically been detected using network-based Intrusion Detection Systems on the basis of characteristic patterns of HTTP requests collected with dynamic malware analysis. Since attackers continuously modify malicious HTTP requests to evade detection, novel HTTP requests sent from new malware samples need to be exhaustively collected in order to maintain a high detection rate. However, analyzing all new malware samples for a long period is infeasible in a limited amount of time. Therefore, we propose a system for efficiently collecting HTTP requests with dynamic malware analysis. Specifically, our system analyzes a malware sample for a short period and then determines whether the analysis should be continued or suspended. Our system identifies malware samples whose analyses should be continued on the basis of the network behavior in their short-period analyses. To make an accurate determination, we focus on the fact that malware communications resemble natural language from the viewpoint of data structure. We apply the recursive neural network, which has recently exhibited high classification performance in the field of natural language processing, to our proposed system. In the evaluation with 42,856 malware samples, our proposed system collected 94% of novel HTTP requests and reduced analysis time by 82% in comparison with the system that continues all analyses.
Takuya WATANABE Mitsuaki AKIYAMA Tetsuya SAKAI Hironori WASHIZAKI Tatsuya MORI
Permission warnings and privacy policy enforcement are widely used to inform mobile app users of privacy threats. These mechanisms disclose information about use of privacy-sensitive resources such as user location or contact list. However, it has been reported that very few users pay attention to these mechanisms during installation. Instead, a user may focus on a more user-friendly source of information: text description, which is written by a developer who has an incentive to attract user attention. When a user searches for an app in a marketplace, his/her query keywords are generally searched on text descriptions of mobile apps. Then, users review the search results, often by reading the text descriptions; i.e., text descriptions are associated with user expectation. Given these observations, this paper aims to address the following research question: What are the primary reasons that text descriptions of mobile apps fail to refer to the use of privacy-sensitive resources? To answer the research question, we performed empirical large-scale study using a huge volume of apps with our ACODE (Analyzing COde and DEscription) framework, which combines static code analysis and text analysis. We developed light-weight techniques so that we can handle hundred of thousands of distinct text descriptions. We note that our text analysis technique does not require manually labeled descriptions; hence, it enables us to conduct a large-scale measurement study without requiring expensive labeling tasks. Our analysis of 210,000 apps, including free and paid, and multilingual text descriptions collected from official and third-party Android marketplaces revealed four primary factors that are associated with the inconsistencies between text descriptions and the use of privacy-sensitive resources: (1) existence of app building services/frameworks that tend to add API permissions/code unnecessarily, (2) existence of prolific developers who publish many applications that unnecessarily install permissions and code, (3) existence of secondary functions that tend to be unmentioned, and (4) existence of third-party libraries that access to the privacy-sensitive resources. We believe that these findings will be useful for improving users' awareness of privacy on mobile software distribution platforms.
Toshiki SHIBAHARA Kohei YAMANISHI Yuta TAKATA Daiki CHIBA Taiga HOKAGUCHI Mitsuaki AKIYAMA Takeshi YAGI Yuichi OHSITA Masayuki MURATA
The number of infected hosts on enterprise networks has been increased by drive-by download attacks. In these attacks, users of compromised popular websites are redirected toward websites that exploit vulnerabilities of a browser and its plugins. To prevent damage, detection of infected hosts on the basis of proxy logs rather than blacklist-based filtering has started to be researched. This is because blacklists have become difficult to create due to the short lifetime of malicious domains and concealment of exploit code. To detect accesses to malicious websites from proxy logs, we propose a system for detecting malicious URL sequences on the basis of three key ideas: focusing on sequences of URLs that include artifacts of malicious redirections, designing new features related to software other than browsers, and generating new training data with data augmentation. To find an effective approach for classifying URL sequences, we compared three approaches: an individual-based approach, a convolutional neural network (CNN), and our new event de-noising CNN (EDCNN). Our EDCNN reduces the negative effects of benign URLs redirected from compromised websites included in malicious URL sequences. Evaluation results show that only our EDCNN with proposed features and data augmentation achieved a practical classification performance: a true positive rate of 99.1%, and a false positive rate of 3.4%.
Yumehisa HAGA Yuta TAKATA Mitsuaki AKIYAMA Tatsuya MORI
Web tracking is widely used as a means to track user's behavior on websites. While web tracking provides new opportunities of e-commerce, it also includes certain risks such as privacy infringement. Therefore, analyzing such risks in the wild Internet is meaningful to make the user's privacy transparent. This work aims to understand how the web tracking has been adopted to prominent websites. We also aim to understand their resilience to the ad-blocking techniques. Web tracking-enabled websites collect the information called the web browser fingerprints, which can be used to identify users. We develop a scalable system that can detect fingerprinting by using both dynamic and static analyses. If a tracking site makes use of many and strong fingerprints, the site is likely resilient to the ad-blocking techniques. We also analyze the connectivity of the third-party tracking sites, which are linked from multiple websites. The link analysis allows us to extract the group of associated tracking sites and understand how influential these sites are. Based on the analyses of 100,000 websites, we quantify the potential risks of the web tracking-enabled websites. We reveal that there are 226 websites that adopt fingerprints that cannot be detected with the most of off-the-shelf anti-tracking tools. We also reveal that a major, resilient third-party tracking site is linked to 50.0 % of the top-100,000 popular websites.
Takuya WATANABE Mitsuaki AKIYAMA Tatsuya MORI
We developed a novel, proof-of-concept side-channel attack framework called RouteDetector, which identifies a route for a train trip by simply reading smart device sensors: an accelerometer, magnetometer, and gyroscope. All these sensors are commonly used by many apps without requiring any permissions. The key technical components of RouteDetector can be summarized as follows. First, by applying a machine-learning technique to the data collected from sensors, RouteDetector detects the activity of a user, i.e., “walking,” “in moving vehicle,” or “other.” Next, it extracts departure/arrival times of vehicles from the sequence of the detected human activities. Finally, by correlating the detected departure/arrival times of the vehicle with timetables/route maps collected from all the railway companies in the rider's country, it identifies potential routes that can be used for a trip. We demonstrate that the strategy is feasible through field experiments and extensive simulation experiments using timetables and route maps for 9,090 railway stations of 172 railway companies.
Takuya WATANABE Mitsuaki AKIYAMA Fumihiro KANEI Eitaro SHIOJI Yuta TAKATA Bo SUN Yuta ISHII Toshiki SHIBAHARA Takeshi YAGI Tatsuya MORI
This paper reports a large-scale study that aims to understand how mobile application (app) vulnerabilities are associated with software libraries. We analyze both free and paid apps. Studying paid apps was quite meaningful because it helped us understand how differences in app development/maintenance affect the vulnerabilities associated with libraries. We analyzed 30k free and paid apps collected from the official Android marketplace. Our extensive analyses revealed that approximately 70%/50% of vulnerabilities of free/paid apps stem from software libraries, particularly from third-party libraries. Somewhat paradoxically, we found that more expensive/popular paid apps tend to have more vulnerabilities. This comes from the fact that more expensive/popular paid apps tend to have more functionality, i.e., more code and libraries, which increases the probability of vulnerabilities. Based on our findings, we provide suggestions to stakeholders of mobile app distribution ecosystems.
Takuya WATANABE Eitaro SHIOJI Mitsuaki AKIYAMA Keito SASAOKA Takeshi YAGI Tatsuya MORI
This paper presents a practical side-channel attack that identifies the social web service account of a visitor to an attacker's website. Our attack leverages the widely adopted user-blocking mechanism, abusing its inherent property that certain pages return different web content depending on whether a user is blocked from another user. Our key insight is that an account prepared by an attacker can hold an attacker-controllable binary state of blocking/non-blocking with respect to an arbitrary user on the same service; provided that the user is logged in to the service, this state can be retrieved as one-bit data through the conventional cross-site timing attack when a user visits the attacker's website. We generalize and refer to such a property as visibility control, which we consider as the fundamental assumption of our attack. Building on this primitive, we show that an attacker with a set of controlled accounts can gain a complete and flexible control over the data leaked through the side channel. Using this mechanism, we show that it is possible to design and implement a robust, large-scale user identification attack on a wide variety of social web services. To verify the feasibility of our attack, we perform an extensive empirical study using 16 popular social web services and demonstrate that at least 12 of these are vulnerable to our attack. Vulnerable services include not only popular social networking sites such as Twitter and Facebook, but also other types of web services that provide social features, e.g., eBay and Xbox Live. We also demonstrate that the attack can achieve nearly 100% accuracy and can finish within a sufficiently short time in a practical setting. We discuss the fundamental principles, practical aspects, and limitations of the attack as well as possible defenses. We have successfully addressed this attack by collaborative working with service providers and browser vendors.
Fumihiro KANEI Daiki CHIBA Kunio HATO Katsunari YOSHIOKA Tsutomu MATSUMOTO Mitsuaki AKIYAMA
While the online advertisement is widely used on the web and on mobile applications, the monetary damages by advertising frauds (ad frauds) have become a severe problem. Countermeasures against ad frauds are evaded since they rely on noticeable features (e.g., burstiness of ad requests) that attackers can easily change. We propose an ad-fraud-detection method that leverages robust features against attacker evasion. We designed novel features on the basis of the statistics observed in an ad network calculated from a large amount of ad requests from legitimate users, such as the popularity of publisher websites and the tendencies of client environments. We assume that attackers cannot know of or manipulate these statistics and that features extracted from fraudulent ad requests tend to be outliers. These features are used to construct a machine-learning model for detecting fraudulent ad requests. We evaluated our proposed method by using ad-request logs observed within an actual ad network. The results revealed that our designed features improved the recall rate by 10% and had about 100,000-160,000 fewer false negatives per day than conventional features based on the burstiness of ad requests. In addition, by evaluating detection performance with long-term dataset, we confirmed that the proposed method is robust against performance degradation over time. Finally, we applied our proposed method to a large dataset constructed on an ad network and found several characteristics of the latest ad frauds in the wild, for example, a large amount of fraudulent ad requests is sent from cloud servers.
Keika MORI Takuya WATANABE Yunao ZHOU Ayako AKIYAMA HASEGAWA Mitsuaki AKIYAMA Tatsuya MORI
This work aims to determine the propensity of password creation through the lens of language spheres. To this end, we consider four different countries, each with a different culture/language: China/Chinese, United Kingdom (UK) and India/English, and Japan/Japanese. We first employ a user study to verify whether language and culture are reflected in password creation. We found that users in India, Japan, and the UK prefer to create their passwords from base words, and the kinds of words they are incorporated into passwords vary between countries. We then test whether the findings obtained through the user study are reflected in a corpus of leaked passwords. We found that users in China and Japan prefer dates, while users in India, Japan, and the UK prefer names. We also found that cultural words (e.g., “sakura” in Japan and “football” in the UK) are frequently used to create passwords. Finally, we demonstrate that the knowledge on the linguistic background of targeted users can be exploited to increase the speed of the password guessing process.
Mitsuaki AKIYAMA Makoto IWAMURA Yuhei KAWAKOYA Kazufumi AOKI Mitsutaka ITOH
Nowadays, the number of web-browser targeted attacks that lead users to adversaries' web sites and exploit web browser vulnerabilities is increasing, and a clarification of their methods and countermeasures is urgently needed. In this paper, we introduce the design and implementation of a new client honeypot for drive-by-download attacks that has the capacity to detect and investigate a variety of malicious web sites. On the basis of the problems of existing client honeypots, we enumerate the requirements of a client honeypot: 1) detection accuracy and variety, 2) collection variety, 3) performance efficiency, and 4) safety and stability. We improve our system with regard to these requirements. The key features of our developed system are stepwise detection focusing on exploit phases, multiple crawler processing, tracking of malware distribution networks, and malware infection prevention. Our evaluation of our developed system in a laboratory experiment and field experiment indicated that its detection variety and crawling performance are higher than those of existing client honeypots. In addition, our system is able to collect information for countermeasures and is secure and stable for continuous operation. We conclude that our system can investigate malicious web sites comprehensively and support countermeasures.
Takayuki TOMATSURI Daiki CHIBA Mitsuaki AKIYAMA Masato UCHIDA
On the Internet, there are lots of unused domain names that are not used for any actual services. Domain parking is a monetization mechanism for displaying online advertisements in such unused domain names. Some domain names used in cyber attacks are known to leverage domain parking services after the attack. However, the temporal relationships between domain parking services and malicious domain names have not been studied well. In this study, we investigated how malicious domain names using domain parking services change over time. We conducted a large-scale measurement study of more than 66.8 million domain names that have used domain parking services in the past 19 months. We reveal the existence of 3,964 domain names that have been malicious after using domain parking. We further identify what types of malicious activities (e.g., phishing and malware) such malicious domain names tend to be used for. We also reveal the existence of 3.02 million domain names that utilized multiple parking services simultaneously or while switching between them. Our study can contribute to the efficient analysis of malicious domain names using domain parking services.
Takashi KOIDE Daiki CHIBA Mitsuaki AKIYAMA Katsunari YOSHIOKA Tsutomu MATSUMOTO
Web-based social engineering (SE) attacks manipulate users to perform specific actions, such as downloading malware and exposing personal information. Aiming to effectively lure users, some SE attacks, which we call multi-step SE attacks, constitute a sequence of web pages starting from a landing page and require browser interactions at each web page. Also, different browser interactions executed on a web page often branch to multiple sequences to redirect users to different SE attacks. Although common systems analyze only landing pages or conduct browser interactions limited to a specific attack, little effort has been made to follow such sequences of web pages to collect multi-step SE attacks. We propose STRAYSHEEP, a system to automatically crawl a sequence of web pages and detect diverse multi-step SE attacks. We evaluate the effectiveness of STRAYSHEEP's three modules (landing-page-collection, web-crawling, and SE-detection) in terms of the rate of collected landing pages leading to SE attacks, efficiency of web crawling to reach more SE attacks, and accuracy in detecting the attacks. Our experimental results indicate that STRAYSHEEP can lead to 20% more SE attacks than Alexa top sites and search results of trend words, crawl five times more efficiently than a simple crawling module, and detect SE attacks with 95.5% accuracy. We demonstrate that STRAYSHEEP can collect various SE attacks, not limited to a specific attack. We also clarify attackers' techniques for tricking users and browser interactions, redirecting users to attacks.
Yuta ISHII Takuya WATANABE Mitsuaki AKIYAMA Tatsuya MORI
Android is one of the most popular mobile device platforms. However, since Android apps can be disassembled easily, attackers inject additional advertisements or malicious codes to the original apps and redistribute them. There are a non-negligible number of such repackaged apps. We generally call those malicious repackaged apps “clones.” However, there are apps that are not clones but are similar to each other. We call such apps “relatives.” In this work, we developed a framework called APPraiser that extracts similar apps and classifies them into clones and relatives from the large dataset. We used the APPraiser framework to study over 1.3 million apps collected from both official and third-party marketplaces. Our extensive analysis revealed the following findings: In the official marketplace, 79% of similar apps were attributed to relatives, while in the third-party marketplace, 50% of similar apps were attributed to clones. The majority of relatives are apps developed by prolific developers in both marketplaces. We also found that in the third-party market, of the clones that were originally published in the official market, 76% of them are malware.
Yuta TAKATA Mitsuaki AKIYAMA Takeshi YAGI Takeshi YADA Shigeki GOTO
An incident response organization such as a CSIRT contributes to preventing the spread of malware infection by analyzing compromised websites and sending abuse reports with detected URLs to webmasters. However, these abuse reports with only URLs are not sufficient to clean up the websites. In addition, it is difficult to analyze malicious websites across different client environments because these websites change behavior depending on a client environment. To expedite compromised website clean-up, it is important to provide fine-grained information such as malicious URL relations, the precise position of compromised web content, and the target range of client environments. In this paper, we propose a new method of constructing a redirection graph with context, such as which web content redirects to malicious websites. The proposed method analyzes a website in a multi-client environment to identify which client environment is exposed to threats. We evaluated our system using crawling datasets of approximately 2,000 compromised websites. The result shows that our system successfully identified malicious URL relations and compromised web content, and the number of URLs and the amount of web content to be analyzed were sufficient for incident responders by 15.0% and 0.8%, respectively. Furthermore, it can also identify the target range of client environments in 30.4% of websites and a vulnerability that has been used in malicious websites by leveraging target information. This fine-grained analysis by our system would contribute to improving the daily work of incident responders.
Hiroki NAKANO Fumihiro KANEI Yuta TAKATA Mitsuaki AKIYAMA Katsunari YOSHIOKA
Android app developers sometimes copy code snippets posted on a question-and-answer (Q&A) website and use them in their apps. However, if a code snippet has vulnerabilities, Android apps containing the vulnerable snippet could also have the same vulnerabilities. Despite this, the effect of such vulnerable snippets on the Android apps has not been investigated in depth. In this paper, we investigate the correspondence between the vulnerable code snippets and vulnerable apps. we collect code snippets from a Q&A website, extract possibly vulnerable snippets, and calculate similarity between those snippets and bytecode on vulnerable apps. Our experimental results show that 15.8% of all evaluated apps that have SSL implementation vulnerabilities (Improper host name verification), 31.7% that have SSL certificate verification vulnerabilities, and 3.8% that have WEBVIEW remote code execution vulnerabilities contain possibly vulnerable code snippets from Stack Overflow. In the worst case, a single problematic snippet has caused 4,844 apps to contain a vulnerability, accounting for 31.2% of all collected apps with that vulnerability.
Yuta TAKATA Mitsuaki AKIYAMA Takeshi YAGI Takeo HARIU Kazuhiko OHKUBO Shigeki GOTO
Security researchers/vendors detect malicious websites based on several website features extracted by honeyclient analysis. However, web-based attacks continue to be more sophisticated along with the development of countermeasure techniques. Attackers detect the honeyclient and evade analysis using sophisticated JavaScript code. The evasive code indirectly identifies vulnerable clients by abusing the differences among JavaScript implementations. Attackers deliver malware only to targeted clients on the basis of the evasion results while avoiding honeyclient analysis. Therefore, we are faced with a problem in that honeyclients cannot analyze malicious websites. Nevertheless, we can observe the evasion nature, i.e., the results in accessing malicious websites by using targeted clients are different from those by using honeyclients. In this paper, we propose a method of extracting evasive code by leveraging the above differences to investigate current evasion techniques. Our method analyzes HTTP transactions of the same website obtained using two types of clients, a real browser as a targeted client and a browser emulator as a honeyclient. As a result of evaluating our method with 8,467 JavaScript samples executed in 20,272 malicious websites, we discovered previously unknown evasion techniques that abuse the differences among JavaScript implementations. These findings will contribute to improving the analysis capabilities of conventional honeyclients.
Bo SUN Mitsuaki AKIYAMA Takeshi YAGI Mitsuhiro HATADA Tatsuya MORI
Modern web users may encounter a browser security threat called drive-by-download attacks when surfing on the Internet. Drive-by-download attacks make use of exploit codes to take control of user's web browser. Many web users do not take such underlying threats into account while clicking URLs. URL Blacklist is one of the practical approaches to thwarting browser-targeted attacks. However, URL Blacklist cannot cope with previously unseen malicious URLs. Therefore, to make a URL blacklist effective, it is crucial to keep the URLs updated. Given these observations, we propose a framework called automatic blacklist generator (AutoBLG) that automates the collection of new malicious URLs by starting from a given existing URL blacklist. The primary mechanism of AutoBLG is expanding the search space of web pages while reducing the amount of URLs to be analyzed by applying several pre-filters such as similarity search to accelerate the process of generating blacklists. AutoBLG consists of three primary components: URL expansion, URL filtration, and URL verification. Through extensive analysis using a high-performance web client honeypot, we demonstrate that AutoBLG can successfully discover new and previously unknown drive-by-download URLs from the vast web space.
Ayako A. HASEGAWA Mitsuaki AKIYAMA Naomi YAMASHITA Daisuke INOUE Tatsuya MORI
Although security and privacy technologies are incorporated into every device and service, the complexity of these concepts confuses non-expert users. Prior research has shown that non-expert users ask strangers for advice about digital media use online. In this study, to clarify the security and privacy concerns of non-expert users in their daily lives, we investigated security- and privacy-related question posts on a Question-and-Answer (Q&A) site for non-expert users. We conducted a thematic analysis of 445 question posts. We identified seven themes among the questions and found that users asked about cyberattacks the most, followed by authentication and security software. We also found that there was a strong demand for answers, especially for questions related to privacy abuse and account/device management. Our findings provide key insights into what non-experts are struggling with when it comes to privacy and security and will help service providers and researchers make improvements to address these concerns.
Mitsuaki AKIYAMA Takeshi YAGI Youki KADOBAYASHI Takeo HARIU Suguru YAMAGUCHI
We investigated client honeypots for detecting and circumstantially analyzing drive-by download attacks. A client honeypot requires both improved inspection performance and in-depth analysis for inspecting and discovering malicious websites. However, OS overhead in recent client honeypot operation cannot be ignored when improving honeypot multiplication performance. We propose a client honeypot system that is a combination of multi-OS and multi-process honeypot approaches, and we implemented this system to evaluate its performance. The process sandbox mechanism, a security measure for our multi-process approach, provides a virtually isolated environment for each web browser. It prevents system alteration from a compromised browser process by I/O redirection of file/registry access. To solve the inconsistency problem of file/registry view by I/O redirection, our process sandbox mechanism enables the web browser and corresponding plug-ins to share a virtual system view. Therefore, it enables multiple processes to be run simultaneously without interference behavior of processes on a single OS. In a field trial, we confirmed that the use of our multi-process approach was three or more times faster than that of a single process, and our multi-OS approach linearly improved system performance according to the number of honeypot instances. In addition, our long-term investigation indicated that 72.3% of exploitations target browser-helper processes. If a honeypot restricts all process creation events, it cannot identify an exploitation targeting a browser-helper process. In contrast, our process sandbox mechanism permits the creation of browser-helper processes, so it can identify these types of exploitations without resulting in false negatives. Thus, our proposed system with these multiplication approaches improves performance efficiency and enables in-depth analysis on high interaction systems.