1-17hit |
Considering the rapidly increasing number of academic papers, searching for and citing appropriate references has become a nontrivial task during manuscript composition. Recommending a handful of candidate papers to a working draft could ease the burden of the authors. Conventional approaches to citation recommendation generally consider recommending one ground-truth citation from an input manuscript for a query context. However, it is common for a given context to be supported by two or more co-citation pairs. Here, we propose a novel scientific paper modelling for citation recommendations, namely Multi-Positive BERT Model for Citation Recommendation (MP-BERT4REC), complied with a series of Multi-Positive Triplet objectives to recommend multiple positive citations for a query context. The proposed approach has the following advantages: First, the proposed multi-positive objectives are effective in recommending multiple positive candidates. Second, we adopt noise distributions on the basis of historical co-citation frequencies; thus, MP-BERT4REC is not only effective in recommending high-frequency co-citation pairs, but it also significantly improves the performance of retrieving low-frequency ones. Third, the proposed dynamic context sampling strategy captures macroscopic citing intents from a manuscript and empowers the citation embeddings to be content-dependent, which allows the algorithm to further improve performance. Single and multiple positive recommendation experiments confirmed that MP-BERT4REC delivers significant improvements over current methods. It also effectively retrieves the full list of co-citations and historically low-frequency pairs better than prior works.
Ryusei NAGASAWA Keisuke FURUMOTO Makoto TAKITA Yoshiaki SHIRAISHI Takeshi TAKAHASHI Masami MOHRI Yasuhiro TAKANO Masakatu MORII
The Topics over Time (TOT) model allows users to be aware of changes in certain topics over time. The proposed method inputs the divided dataset of security blog posts based on a fixed period using an overlap period to the TOT. The results suggest the extraction of topics that include malware and attack campaign names that are appropriate for the multi-labeling of cyber threat intelligence reports.
Yuki NOYORI Hironori WASHIZAKI Yoshiaki FUKAZAWA Hideyuki KANUKA Keishi OOSHIMA Shuhei NOJIRI Ryosuke TSUCHIYA
Resource limitations require that bugs be resolved efficiently. The bug modification process uses bug reports, which are generated from service user reports. Developers read these reports and fix bugs. Developers discuss bugs by posting comments directly in bug reports. Although several studies have investigated the initial report in bug reports, few have researched the comments. Our research focuses on bug reports. Currently, everyone is free to comment, but the bug fixing time may be affected by how to comment. Herein we investigate the topic of comments in bug reports. Mixed topics do not affect the bug fixing time. However, the bug fixing time tends to be shorter when the discussion length of the phenomenon is short.
Chun-Jung WU Shin-Ying HUANG Katsunari YOSHIOKA Tsutomu MATSUMOTO
A drastic increase in cyberattacks targeting Internet of Things (IoT) devices using telnet protocols has been observed. IoT malware continues to evolve, and the diversity of OS and environments increases the difficulty of executing malware samples in an observation setting. To address this problem, we sought to develop an alternative means of investigation by using the telnet logs of IoT honeypots and analyzing malware without executing it. In this paper, we present a malware classification method based on malware binaries, command sequences, and meta-features. We employ both unsupervised or supervised learning algorithms and text-mining algorithms for handling unstructured data. Clustering analysis is applied for finding malware family members and revealing their inherent features for better explanation. First, the malware binaries are grouped using similarity analysis. Then, we extract key patterns of interaction behavior using an N-gram model. We also train a multiclass classifier to identify IoT malware categories based on common infection behavior. For misclassified subclasses, second-stage sub-training is performed using a file meta-feature. Our results demonstrate 96.70% accuracy, with high precision and recall. The clustering results reveal variant attack vectors and one denial of service (DoS) attack that used pure Linux commands.
Chen CHANG Jianjun CAO Qin FENG Nianfeng WENG Yuling SHANG
Most existing truth discovery approaches are designed for structured data, and cannot meet the strong need to extract trustworthy information from raw text data for its unique characteristics such as multifactorial property of text answers (i.e., an answer may contain multiple key factors) and the diversity of word usages (i.e., different words may have the same semantic meaning). As for text answers, there are no absolute correctness or errors, most answers may be partially correct, which is quite different from the situation of traditional truth discovery. To solve these challenges, we propose an optimization-based text truth discovery model which jointly groups keywords extracted from the answers of the specific question into a set of multiple factors. Then, we select the subset of multiple factors as identified truth set for each question by parallel ant colony synchronization optimization algorithm. After that, the answers to each question can be ranked based on the similarities between factors answer provided and identified truth factors. The experiment results on real dataset show that though text data structures are complex, our model can still find reliable answers compared with retrieval-based and state-of-the-art approaches.
Jung-Been LEE Taek LEE Hoh Peter IN
Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language pre-processing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.
Xiaoqiong ZHAO Shanping LI Huan YU Ye WANG Weiwei QIU
Background: The applying of third-party libraries is an integral part of many applications. But the libraries choosing is time-consuming even for experienced developers. The automated recommendation system for libraries recommendation is widely researched to help developers to choose libraries. Aim: from software engineering aspect, our research aims to give developers a reliable recommended list of third-party libraries at the early phase of software development lifecycle to help them build their development environment faster; and from technical aspect, our research aims to build a generalizable recommendation system framework which combines collaborative filtering and topic modeling techniques, in order to improve the performance of libraries recommendation significantly. Our works on this research: 1) we design a hybrid methodology to combine collaborative filtering and LDA text mining technology; 2) we build a recommendation system framework successfully based on the above hybrid methodology; 3) we make a well-designed experiment to validate the methodology and framework which use the data of 1,013 mobile application projects; 4) we do the evaluation for the result of the experiment. Conclusions: 1) hybrid methodology with collaborative filtering and LDA can improve the performance of libraries recommendation significantly; 2) based on the hybrid methodology, the framework works very well on the libraries recommendation for helping developers' libraries choosing. Further research is necessary to improve the performance of the libraries recommendation including: 1) use more accurate NLP technologies improve the correlation analysis; 2) try other similarity calculation methodology for collaborative filtering to rise the accuracy; 3) on this research, we just bring the time-series approach to the framework and make an experiment as comparative trial, the result shows that the performance improves continuously, so in further research we plan to use time-series data-mining as the basic methodology to update the framework.
Youwei LU Shogo OKADA Katsumi NITTA
We propose a novel method, built upon the hierarchical Dirichlet process hidden semi-Markov model, to reveal the content structures of unstructured domain-specific texts. The content structures of texts consisting of sequential local contexts are useful for tasks, such as text retrieval, classification, and text mining. The prominent feature of our model is the use of the recursive uniform partitioning, a stochastic process taking a view different from existing HSMMs in modeling state duration. We show that the recursive uniform partitioning plays an important role in avoiding the rapid switching between hidden states. Remarkably, our method greatly outperforms others in terms of ranking performance in our text retrieval experiments, and provides more accurate features for SVM to achieve higher F1 scores in our text classification experiments. These experiment results suggest that our method can yield improved representations of domain-specific texts. Furthermore, we present a method of automatically discovering the local contexts that serve to account for why a text is classified as a positive instance, in the supervised learning settings.
Kwanho KIM Josué OBREGON Jae-Yoon JUNG
As the recent growth of online social network services such as Facebook and Twitter, people are able to easily share information with each other by writing posts or commenting for another's posts. In this paper, we firstly suggest a method of discovering information flows of posts on Facebook and their underlying contexts by incorporating process mining and text mining techniques. Based on comments collected from Facebook, the experiment results illustrate how the proposed method can be applied to analyze information flows and contexts of posts on social network services.
Yoshitatsu MATSUDA Kazunori YAMAGUCHI
In order to implement multidimensional scaling (MDS) efficiently, we propose a new method named “global mapping analysis” (GMA), which applies stochastic approximation to minimizing MDS criteria. GMA can solve MDS more efficiently in both the linear case (classical MDS) and non-linear one (e.g., ALSCAL) if only the MDS criteria are polynomial. GMA separates the polynomial criteria into the local factors and the global ones. Because the global factors need to be calculated only once in each iteration, GMA is of linear order in the number of objects. Numerical experiments on artificial data verify the efficiency of GMA. It is also shown that GMA can find out various interesting structures from massive document collections.
Won-Gyo JUNG Sang-Sung PARK Dong-Sik JANG
Whether a patent is registered or not is usually based on the subjective judgment of the patent examiners. However, the patent examiners may determine whether the patent is registered or not according to their personal knowledge, backgrounds etc. In this paper, we propose a novel patent registration method based on patent data. The method estimates whether a patent is registered or not by utilizing the objective past history of patent data instead of existing methods of subjective judgments. The proposed method constructs an estimation model by applying multivariate statistics algorithm. In the prediction model, the application date, activity index, IPC code and similarity of registration refusal are set to the input values, and patent registration and rejection are set to the output values. We believe that our method will contribute to improved reliability of patent registration in that it achieves highly reliable estimation results through the past history of patent data, contrary to most previous methods of subjective judgments by patent agents.
Hideki KAWAI Adam JATOWT Katsumi TANAKA Kazuo KUNIEDA Keiji YAMADA
This paper introduces a future and past search engine, ChronoSeeker, which can help users to develop long-term strategies for their organizations. To provide on-demand searches, we tackled two technical issues: (1) organizing efficient event searches and (2) filtering out noises from search results. Our system employed query expansion with typical expressions related to event information such as year expressions, temporal modifiers, and context terms for efficient event searches. We utilized a machine-learning technique of filtering noise to classify candidates into information or non-event information, using heuristic features and lexical patterns derived from a text-mining approach. Our experiment revealed that filtering achieved an 85% F-measure, and that query expansion could collect dozens more events than those without expansion.
Hiroyuki SAKAI Shigeru MASUYAMA
We propose a method of assigning polarity to causal information extracted from Japanese financial articles concerning business performance of companies. Our method assigns polarity (positive or negative) to causal information in accordance with business performance, e.g. "zidousya no uriage ga koutyou: (Sales of cars are good)" (The polarity positive is assigned in this example). We may use causal expressions assigned polarity by our method, e.g., to analyze content of articles concerning business performance circumstantially. First, our method classifies articles concerning business performance into positive articles and negative articles. Using them, our method assigns polarity (positive or negative) to causal information extracted from the set of articles concerning business performance. Although our method needs training dataset for classifying articles concerning business performance into positive and negative ones, our method does not need a training dataset for assigning polarity to causal information. Hence, even if causal information not appearing in the training dataset for classifying articles concerning business performance into positive and negative ones exist, our method is able to assign it polarity by using statistical information of this classified sets of articles. We evaluated our method and confirmed that it attained 74.4% precision and 50.4% recall of assigning polarity positive, and 76.8% precision and 61.5% recall of assigning polarity negative, respectively.
Michiko YASUKAWA Hui Tian LIM Hidetoshi YOKOO
In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.
Bui Quang HUNG Masanori OTSUBO Yoshinori HIJIKATA Shogo NISHIDA
Recently, semantic text portion (STP) is getting popular in the field of Web mining. STP is a text portion in the original page which is semantically related to the anchor pointing to the target page. STPs may include the facts and the people's opinions about the target pages. STPs can be used for various upper-level applications such as automatic summarization and document categorization. In this paper, we concentrate on extracting STPs. We conduct a survey of STP to see the positions of STPs in original pages and find out HTML tags which can divide STPs from the other text portions in original pages. We then develop a method for extracting STPs based on the result of the survey. The experimental results show that our method achieves high performance.
Yukitaka KUSUMURA Yoshinori HIJIKATA Shogo NISHIDA
Net auctions have been widely utilized with the recent development of the Internet. However, it is a problem that there are too many items for bidders to select the most suitable one. We aim at supporting the bidders on net auctions by automatically generating a table which contains the features of several items for comparison. We construct a system called NTM-Agent (Net auction Text Mining Agent). The system collects web pages of items and extracts the items' features from the pages. After that, it generates a table which contains the extracted features. This research focuses on two problems in the process. The first problem is that if the system collects items automatically, the results contain the items which is different from the items of the user's target. The second problem is that the descriptions in net auctions are not uniform (There are different formats such as sentences, items and tables. The subjects of some sentences are omitted. ). Therefore, it is difficult to extract the information from the descriptions by conventional methods of information extraction. This research proposes methods to solve the problems. For the first problem, NTM-Agent filters the items by correlation rules about the keywords in the titles and the item descriptions. These rules are created semi-automatically by a support tool. For the second problem, NTM-Agent extracts the information by distinguishing the formats. It also learns the feature values from plain examples for the future extraction.
Akira TERADA Takenobu TOKUNAGA
Nominalization is a linguistic phenomenon in which events usually described in terms of clauses are expressed in the form of noun phrases. Extracting event structures is an important task in text mining applications. To achieve this goal, clauses are parsed and the argument structure of main verbs are extracted from the parsed results. This kind of preprocessing has been commonly done in the past research. In order to extract event structure from nominalized phrases as well, we need to establish a technique to transform nominalized phrases into clauses. In this paper, we propose a method to transform nominalized phrases into clauses by using corpus-based approach. The proposed method first enumerates possible predicate/argument structures by referring to a nominalized phrase (noun phrase) and makes their ranking based on the frequency of each argument in the corpus. The algorithm based on this method was evaluated using a corpus consisting of 24,626 aviation safety reports in English and it achieved a 78% accuracy in transformation. The algorithm was also evaluated by applying a text mining application to extract events and their cause-effect relations from the texts. This application produced an improvement in the text mining application's performance.