The search functionality is under construction.

Keyword Search Result

[Keyword] text mining(17hit)

1-17hit
  • MP-BERT4REC: Recommending Multiple Positive Citations for Academic Manuscripts via Content-Dependent BERT and Multi-Positive Triplet

    Yang ZHANG  Qiang MA  

     
    PAPER-Natural Language Processing

      Pubricized:
    2022/08/08
      Vol:
    E105-D No:11
      Page(s):
    1957-1968

    Considering the rapidly increasing number of academic papers, searching for and citing appropriate references has become a nontrivial task during manuscript composition. Recommending a handful of candidate papers to a working draft could ease the burden of the authors. Conventional approaches to citation recommendation generally consider recommending one ground-truth citation from an input manuscript for a query context. However, it is common for a given context to be supported by two or more co-citation pairs. Here, we propose a novel scientific paper modelling for citation recommendations, namely Multi-Positive BERT Model for Citation Recommendation (MP-BERT4REC), complied with a series of Multi-Positive Triplet objectives to recommend multiple positive citations for a query context. The proposed approach has the following advantages: First, the proposed multi-positive objectives are effective in recommending multiple positive candidates. Second, we adopt noise distributions on the basis of historical co-citation frequencies; thus, MP-BERT4REC is not only effective in recommending high-frequency co-citation pairs, but it also significantly improves the performance of retrieving low-frequency ones. Third, the proposed dynamic context sampling strategy captures macroscopic citing intents from a manuscript and empowers the citation embeddings to be content-dependent, which allows the algorithm to further improve performance. Single and multiple positive recommendation experiments confirmed that MP-BERT4REC delivers significant improvements over current methods. It also effectively retrieves the full list of co-citations and historically low-frequency pairs better than prior works.

  • Partition-then-Overlap Method for Labeling Cyber Threat Intelligence Reports by Topics over Time

    Ryusei NAGASAWA  Keisuke FURUMOTO  Makoto TAKITA  Yoshiaki SHIRAISHI  Takeshi TAKAHASHI  Masami MOHRI  Yasuhiro TAKANO  Masakatu MORII  

     
    LETTER

      Pubricized:
    2021/02/24
      Vol:
    E104-D No:5
      Page(s):
    556-561

    The Topics over Time (TOT) model allows users to be aware of changes in certain topics over time. The proposed method inputs the divided dataset of security blog posts based on a fixed period using an overlap period to the TOT. The results suggest the extraction of topics that include malware and attack campaign names that are appropriate for the multi-labeling of cyber threat intelligence reports.

  • What are the Features of Good Discussions for Shortening Bug Fixing Time?

    Yuki NOYORI  Hironori WASHIZAKI  Yoshiaki FUKAZAWA  Hideyuki KANUKA  Keishi OOSHIMA  Shuhei NOJIRI  Ryosuke TSUCHIYA  

     
    PAPER

      Pubricized:
    2020/09/18
      Vol:
    E104-D No:1
      Page(s):
    106-116

    Resource limitations require that bugs be resolved efficiently. The bug modification process uses bug reports, which are generated from service user reports. Developers read these reports and fix bugs. Developers discuss bugs by posting comments directly in bug reports. Although several studies have investigated the initial report in bug reports, few have researched the comments. Our research focuses on bug reports. Currently, everyone is free to comment, but the bug fixing time may be affected by how to comment. Herein we investigate the topic of comments in bug reports. Mixed topics do not affect the bug fixing time. However, the bug fixing time tends to be shorter when the discussion length of the phenomenon is short.

  • IoT Malware Analysis and New Pattern Discovery Through Sequence Analysis Using Meta-Feature Information

    Chun-Jung WU  Shin-Ying HUANG  Katsunari YOSHIOKA  Tsutomu MATSUMOTO  

     
    PAPER-Fundamental Theories for Communications

      Pubricized:
    2019/08/05
      Vol:
    E103-B No:1
      Page(s):
    32-42

    A drastic increase in cyberattacks targeting Internet of Things (IoT) devices using telnet protocols has been observed. IoT malware continues to evolve, and the diversity of OS and environments increases the difficulty of executing malware samples in an observation setting. To address this problem, we sought to develop an alternative means of investigation by using the telnet logs of IoT honeypots and analyzing malware without executing it. In this paper, we present a malware classification method based on malware binaries, command sequences, and meta-features. We employ both unsupervised or supervised learning algorithms and text-mining algorithms for handling unstructured data. Clustering analysis is applied for finding malware family members and revealing their inherent features for better explanation. First, the malware binaries are grouped using similarity analysis. Then, we extract key patterns of interaction behavior using an N-gram model. We also train a multiclass classifier to identify IoT malware categories based on common infection behavior. For misclassified subclasses, second-stage sub-training is performed using a file meta-feature. Our results demonstrate 96.70% accuracy, with high precision and recall. The clustering results reveal variant attack vectors and one denial of service (DoS) attack that used pure Linux commands.

  • Truth Discovery of Multi-Source Text Data

    Chen CHANG  Jianjun CAO  Qin FENG  Nianfeng WENG  Yuling SHANG  

     
    LETTER-Fundamentals of Information Systems

      Pubricized:
    2019/08/22
      Vol:
    E102-D No:11
      Page(s):
    2249-2252

    Most existing truth discovery approaches are designed for structured data, and cannot meet the strong need to extract trustworthy information from raw text data for its unique characteristics such as multifactorial property of text answers (i.e., an answer may contain multiple key factors) and the diversity of word usages (i.e., different words may have the same semantic meaning). As for text answers, there are no absolute correctness or errors, most answers may be partially correct, which is quite different from the situation of traditional truth discovery. To solve these challenges, we propose an optimization-based text truth discovery model which jointly groups keywords extracted from the answers of the specific question into a set of multiple factors. Then, we select the subset of multiple factors as identified truth set for each question by parallel ant colony synchronization optimization algorithm. After that, the answers to each question can be ranked based on the similarities between factors answer provided and identified truth factors. The experiment results on real dataset show that though text data structures are complex, our model can still find reliable answers compared with retrieval-based and state-of-the-art approaches.

  • Automatic Stop Word Generation for Mining Software Artifact Using Topic Model with Pointwise Mutual Information

    Jung-Been LEE  Taek LEE  Hoh Peter IN  

     
    PAPER-Software Engineering

      Pubricized:
    2019/05/27
      Vol:
    E102-D No:9
      Page(s):
    1761-1772

    Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language pre-processing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.

  • Accurate Library Recommendation Using Combining Collaborative Filtering and Topic Model for Mobile Development

    Xiaoqiong ZHAO  Shanping LI  Huan YU  Ye WANG  Weiwei QIU  

     
    PAPER-Software Engineering

      Pubricized:
    2018/12/18
      Vol:
    E102-D No:3
      Page(s):
    522-536

    Background: The applying of third-party libraries is an integral part of many applications. But the libraries choosing is time-consuming even for experienced developers. The automated recommendation system for libraries recommendation is widely researched to help developers to choose libraries. Aim: from software engineering aspect, our research aims to give developers a reliable recommended list of third-party libraries at the early phase of software development lifecycle to help them build their development environment faster; and from technical aspect, our research aims to build a generalizable recommendation system framework which combines collaborative filtering and topic modeling techniques, in order to improve the performance of libraries recommendation significantly. Our works on this research: 1) we design a hybrid methodology to combine collaborative filtering and LDA text mining technology; 2) we build a recommendation system framework successfully based on the above hybrid methodology; 3) we make a well-designed experiment to validate the methodology and framework which use the data of 1,013 mobile application projects; 4) we do the evaluation for the result of the experiment. Conclusions: 1) hybrid methodology with collaborative filtering and LDA can improve the performance of libraries recommendation significantly; 2) based on the hybrid methodology, the framework works very well on the libraries recommendation for helping developers' libraries choosing. Further research is necessary to improve the performance of the libraries recommendation including: 1) use more accurate NLP technologies improve the correlation analysis; 2) try other similarity calculation methodology for collaborative filtering to rise the accuracy; 3) on this research, we just bring the time-series approach to the framework and make an experiment as comparative trial, the result shows that the performance improves continuously, so in further research we plan to use time-series data-mining as the basic methodology to update the framework.

  • Modeling Content Structures of Domain-Specific Texts with RUP-HDP-HSMM and Its Applications

    Youwei LU  Shogo OKADA  Katsumi NITTA  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2017/06/09
      Vol:
    E100-D No:9
      Page(s):
    2126-2137

    We propose a novel method, built upon the hierarchical Dirichlet process hidden semi-Markov model, to reveal the content structures of unstructured domain-specific texts. The content structures of texts consisting of sequential local contexts are useful for tasks, such as text retrieval, classification, and text mining. The prominent feature of our model is the use of the recursive uniform partitioning, a stochastic process taking a view different from existing HSMMs in modeling state duration. We show that the recursive uniform partitioning plays an important role in avoiding the rapid switching between hidden states. Remarkably, our method greatly outperforms others in terms of ranking performance in our text retrieval experiments, and provides more accurate features for SVM to achieve higher F1 scores in our text classification experiments. These experiment results suggest that our method can yield improved representations of domain-specific texts. Furthermore, we present a method of automatically discovering the local contexts that serve to account for why a text is classified as a positive instance, in the supervised learning settings.

  • Analyzing Information Flow and Context for Facebook Fan Pages Open Access

    Kwanho KIM  Josué OBREGON  Jae-Yoon JUNG  

     
    LETTER

      Vol:
    E97-D No:4
      Page(s):
    811-814

    As the recent growth of online social network services such as Facebook and Twitter, people are able to easily share information with each other by writing posts or commenting for another's posts. In this paper, we firstly suggest a method of discovering information flows of posts on Facebook and their underlying contexts by incorporating process mining and text mining techniques. Based on comments collected from Facebook, the experiment results illustrate how the proposed method can be applied to analyze information flows and contexts of posts on social network services.

  • Global Mapping Analysis: Stochastic Gradient Algorithm in Multidimensional Scaling

    Yoshitatsu MATSUDA  Kazunori YAMAGUCHI  

     
    PAPER-Artificial Intelligence, Data Mining

      Vol:
    E95-D No:2
      Page(s):
    596-603

    In order to implement multidimensional scaling (MDS) efficiently, we propose a new method named “global mapping analysis” (GMA), which applies stochastic approximation to minimizing MDS criteria. GMA can solve MDS more efficiently in both the linear case (classical MDS) and non-linear one (e.g., ALSCAL) if only the MDS criteria are polynomial. GMA separates the polynomial criteria into the local factors and the global ones. Because the global factors need to be calculated only once in each iteration, GMA is of linear order in the number of objects. Numerical experiments on artificial data verify the efficiency of GMA. It is also shown that GMA can find out various interesting structures from massive document collections.

  • Patent Registration Prediction Methodology Using Multivariate Statistics

    Won-Gyo JUNG  Sang-Sung PARK  Dong-Sik JANG  

     
    PAPER-Artificial Intelligence, Data Mining

      Vol:
    E94-D No:11
      Page(s):
    2219-2226

    Whether a patent is registered or not is usually based on the subjective judgment of the patent examiners. However, the patent examiners may determine whether the patent is registered or not according to their personal knowledge, backgrounds etc. In this paper, we propose a novel patent registration method based on patent data. The method estimates whether a patent is registered or not by utilizing the objective past history of patent data instead of existing methods of subjective judgments. The proposed method constructs an estimation model by applying multivariate statistics algorithm. In the prediction model, the application date, activity index, IPC code and similarity of registration refusal are set to the input values, and patent registration and rejection are set to the output values. We believe that our method will contribute to improved reliability of patent registration in that it achieves highly reliable estimation results through the past history of patent data, contrary to most previous methods of subjective judgments by patent agents.

  • Query Expansion and Text Mining for ChronoSeeker -- Search Engine for Future/Past Events --

    Hideki KAWAI  Adam JATOWT  Katsumi TANAKA  Kazuo KUNIEDA  Keiji YAMADA  

     
    PAPER

      Vol:
    E94-D No:3
      Page(s):
    552-563

    This paper introduces a future and past search engine, ChronoSeeker, which can help users to develop long-term strategies for their organizations. To provide on-demand searches, we tackled two technical issues: (1) organizing efficient event searches and (2) filtering out noises from search results. Our system employed query expansion with typical expressions related to event information such as year expressions, temporal modifiers, and context terms for efficient event searches. We utilized a machine-learning technique of filtering noise to classify candidates into information or non-event information, using heuristic features and lexical patterns derived from a text-mining approach. Our experiment revealed that filtering achieved an 85% F-measure, and that query expansion could collect dozens more events than those without expansion.

  • Assigning Polarity to Causal Information in Financial Articles on Business Performance of Companies

    Hiroyuki SAKAI  Shigeru MASUYAMA  

     
    PAPER-Document Analysis

      Vol:
    E92-D No:12
      Page(s):
    2341-2350

    We propose a method of assigning polarity to causal information extracted from Japanese financial articles concerning business performance of companies. Our method assigns polarity (positive or negative) to causal information in accordance with business performance, e.g. "zidousya no uriage ga koutyou: (Sales of cars are good)" (The polarity positive is assigned in this example). We may use causal expressions assigned polarity by our method, e.g., to analyze content of articles concerning business performance circumstantially. First, our method classifies articles concerning business performance into positive articles and negative articles. Using them, our method assigns polarity (positive or negative) to causal information extracted from the set of articles concerning business performance. Although our method needs training dataset for classifying articles concerning business performance into positive and negative ones, our method does not need a training dataset for assigning polarity to causal information. Hence, even if causal information not appearing in the training dataset for classifying articles concerning business performance into positive and negative ones exist, our method is able to assign it polarity by using statistical information of this classified sets of articles. We evaluated our method and confirmed that it attained 74.4% precision and 50.4% recall of assigning polarity positive, and 76.8% precision and 61.5% recall of assigning polarity negative, respectively.

  • Stemming Malay Text and Its Application in Automatic Text Categorization

    Michiko YASUKAWA  Hui Tian LIM  Hidetoshi YOKOO  

     
    PAPER-Document Analysis

      Vol:
    E92-D No:12
      Page(s):
    2351-2359

    In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.

  • Extraction of Semantic Text Portion Related to Anchor Link

    Bui Quang HUNG  Masanori OTSUBO  Yoshinori HIJIKATA  Shogo NISHIDA  

     
    PAPER-Language

      Vol:
    E89-D No:6
      Page(s):
    1834-1847

    Recently, semantic text portion (STP) is getting popular in the field of Web mining. STP is a text portion in the original page which is semantically related to the anchor pointing to the target page. STPs may include the facts and the people's opinions about the target pages. STPs can be used for various upper-level applications such as automatic summarization and document categorization. In this paper, we concentrate on extracting STPs. We conduct a survey of STP to see the positions of STPs in original pages and find out HTML tags which can divide STPs from the other text portions in original pages. We then develop a method for extracting STPs based on the result of the survey. The experimental results show that our method achieves high performance.

  • NTM-Agent: Text Mining Agent for Net Auction

    Yukitaka KUSUMURA  Yoshinori HIJIKATA  Shogo NISHIDA  

     
    PAPER

      Vol:
    E87-D No:6
      Page(s):
    1386-1396

    Net auctions have been widely utilized with the recent development of the Internet. However, it is a problem that there are too many items for bidders to select the most suitable one. We aim at supporting the bidders on net auctions by automatically generating a table which contains the features of several items for comparison. We construct a system called NTM-Agent (Net auction Text Mining Agent). The system collects web pages of items and extracts the items' features from the pages. After that, it generates a table which contains the extracted features. This research focuses on two problems in the process. The first problem is that if the system collects items automatically, the results contain the items which is different from the items of the user's target. The second problem is that the descriptions in net auctions are not uniform (There are different formats such as sentences, items and tables. The subjects of some sentences are omitted. ). Therefore, it is difficult to extract the information from the descriptions by conventional methods of information extraction. This research proposes methods to solve the problems. For the first problem, NTM-Agent filters the items by correlation rules about the keywords in the titles and the item descriptions. These rules are created semi-automatically by a support tool. For the second problem, NTM-Agent extracts the information by distinguishing the formats. It also learns the feature values from plain examples for the future extraction.

  • Corpus Based Method of Transforming Nominalized Phrases into Clauses for Text Mining Application

    Akira TERADA  Takenobu TOKUNAGA  

     
    PAPER

      Vol:
    E86-D No:9
      Page(s):
    1736-1744

    Nominalization is a linguistic phenomenon in which events usually described in terms of clauses are expressed in the form of noun phrases. Extracting event structures is an important task in text mining applications. To achieve this goal, clauses are parsed and the argument structure of main verbs are extracted from the parsed results. This kind of preprocessing has been commonly done in the past research. In order to extract event structure from nominalized phrases as well, we need to establish a technique to transform nominalized phrases into clauses. In this paper, we propose a method to transform nominalized phrases into clauses by using corpus-based approach. The proposed method first enumerates possible predicate/argument structures by referring to a nominalized phrase (noun phrase) and makes their ranking based on the frequency of each argument in the corpus. The algorithm based on this method was evaluated using a corpus consisting of 24,626 aviation safety reports in English and it achieved a 78% accuracy in transformation. The algorithm was also evaluated by applying a text mining application to extract events and their cause-effect relations from the texts. This application produced an improvement in the text mining application's performance.