The search functionality is under construction.

Keyword Search Result

[Keyword] topic model(19hit)

1-19hit
  • Partition-then-Overlap Method for Labeling Cyber Threat Intelligence Reports by Topics over Time

    Ryusei NAGASAWA  Keisuke FURUMOTO  Makoto TAKITA  Yoshiaki SHIRAISHI  Takeshi TAKAHASHI  Masami MOHRI  Yasuhiro TAKANO  Masakatu MORII  

     
    LETTER

      Pubricized:
    2021/02/24
      Vol:
    E104-D No:5
      Page(s):
    556-561

    The Topics over Time (TOT) model allows users to be aware of changes in certain topics over time. The proposed method inputs the divided dataset of security blog posts based on a fixed period using an overlap period to the TOT. The results suggest the extraction of topics that include malware and attack campaign names that are appropriate for the multi-labeling of cyber threat intelligence reports.

  • Automatic Stop Word Generation for Mining Software Artifact Using Topic Model with Pointwise Mutual Information

    Jung-Been LEE  Taek LEE  Hoh Peter IN  

     
    PAPER-Software Engineering

      Pubricized:
    2019/05/27
      Vol:
    E102-D No:9
      Page(s):
    1761-1772

    Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language pre-processing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.

  • A Novel Recommendation Algorithm Incorporating Temporal Dynamics, Reviews and Item Correlation

    Ting WU  Yong FENG  JiaXing SANG  BaoHua QIANG  YaNan WANG  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2018/05/18
      Vol:
    E101-D No:8
      Page(s):
    2027-2034

    Recommender systems (RS) exploit user ratings on items and side information to make personalized recommendations. In order to recommend the right products to users, RS must accurately model the implicit preferences of each user and the properties of each product. In reality, both user preferences and item properties are changing dynamically over time, so treating the historical decisions of a user or the received comments of an item as static is inappropriate. Besides, the review text accompanied with a rating score can help us to understand why a user likes or dislikes an item, so temporal dynamics and text information in reviews are important side information for recommender systems. Moreover, compared with the large number of available items, the number of items a user can buy is very limited, which is called the sparsity problem. In order to solve this problem, utilizing item correlation provides a promising solution. Although famous methods like TimeSVD++, TopicMF and CoFactor partially take temporal dynamics, reviews and correlation into consideration, none of them combine these information together for accurate recommendation. Therefore, in this paper we propose a novel combined model called TmRevCo which is based on matrix factorization. Our model combines the dynamic user factor of TimeSVD++ with the hidden topic of each review text mined by the topic model of TopicMF through a new transformation function. Meanwhile, to support our five-scoring datasets, we use a more appropriate item correlation measure in CoFactor and associate the item factors of CoFactor with that of matrix factorization. Our model comprehensively combines the temporal dynamics, review information and item correlation simultaneously. Experimental results on three real-world datasets show that our proposed model leads to significant improvement compared with the baseline methods.

  • Sequential Bayesian Nonparametric Multimodal Topic Models for Video Data Analysis

    Jianfei XUE  Koji EGUCHI  

     
    PAPER

      Pubricized:
    2018/01/18
      Vol:
    E101-D No:4
      Page(s):
    1079-1087

    Topic modeling as a well-known method is widely applied for not only text data mining but also multimedia data analysis such as video data analysis. However, existing models cannot adequately handle time dependency and multimodal data modeling for video data that generally contain image information and speech information. In this paper, we therefore propose a novel topic model, sequential symmetric correspondence hierarchical Dirichlet processes (Seq-Sym-cHDP) extended from sequential conditionally independent hierarchical Dirichlet processes (Seq-CI-HDP) and sequential correspondence hierarchical Dirichlet processes (Seq-cHDP), to improve the multimodal data modeling mechanism via controlling the pivot assignments with a latent variable. An inference scheme for Seq-Sym-cHDP based on a posterior representation sampler is also developed in this work. We finally demonstrate that our model outperforms other baseline models via experiments.

  • Stochastic Divergence Minimization for Biterm Topic Models

    Zhenghang CUI  Issei SATO  Masashi SUGIYAMA  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2017/12/20
      Vol:
    E101-D No:3
      Page(s):
    668-677

    As the emergence and the thriving development of social networks, a huge number of short texts are accumulated and need to be processed. Inferring latent topics of collected short texts is an essential task for understanding its hidden structure and predicting new contents. A biterm topic model (BTM) was recently proposed for short texts to overcome the sparseness of document-level word co-occurrences by directly modeling the generation process of word pairs. Stochastic inference algorithms based on collapsed Gibbs sampling (CGS) and collapsed variational inference have been proposed for BTM. However, they either require large computational complexity, or rely on very crude estimation that does not preserve sufficient statistics. In this work, we develop a stochastic divergence minimization (SDM) inference algorithm for BTM to achieve better predictive likelihood in a scalable way. Experiments show that SDM-BTM trained by 30% data outperforms the best existing algorithm trained by full data.

  • Relation Prediction in Multilingual Data Based on Multimodal Relational Topic Models

    Yosuke SAKATA  Koji EGUCHI  

     
    PAPER

      Pubricized:
    2017/01/17
      Vol:
    E100-D No:4
      Page(s):
    741-749

    There are increasing demands for improved analysis of multimodal data that consist of multiple representations, such as multilingual documents and text-annotated images. One promising approach for analyzing such multimodal data is latent topic models. In this paper, we propose conditionally independent generalized relational topic models (CI-gRTM) for predicting unknown relations across different multiple representations of multimodal data. We developed CI-gRTM as a multimodal extension of discriminative relational topic models called generalized relational topic models (gRTM). We demonstrated through experiments with multilingual documents that CI-gRTM can more effectively predict both multilingual representations and relations between two different language representations compared with several state-of-the-art baseline models that enable to predict either multilingual representations or unimodal relations.

  • Inferring User Consumption Preferences from Social Media

    Yang LI  Jing JIANG  Ting LIU  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2016/12/09
      Vol:
    E100-D No:3
      Page(s):
    537-545

    Social Media has already become a new arena of our lives and involved different aspects of our social presence. Users' personal information and activities on social media presumably reveal their personal interests, which offer great opportunities for many e-commerce applications. In this paper, we propose a principled latent variable model to infer user consumption preferences at the category level (e.g. inferring what categories of products a user would like to buy). Our model naturally links users' published content and following relations on microblogs with their consumption behaviors on e-commerce websites. Experimental results show our model outperforms the state-of-the-art methods significantly in inferring a new user's consumption preference. Our model can also learn meaningful consumption-specific topics automatically.

  • Video Data Modeling Using Sequential Correspondence Hierarchical Dirichlet Processes

    Jianfei XUE  Koji EGUCHI  

     
    PAPER

      Pubricized:
    2016/10/07
      Vol:
    E100-D No:1
      Page(s):
    33-41

    Video data mining based on topic models as an emerging technique recently has become a very popular research topic. In this paper, we present a novel topic model named sequential correspondence hierarchical Dirichlet processes (Seq-cHDP) to learn the hidden structure within video data. The Seq-cHDP model can be deemed as an extended hierarchical Dirichlet processes (HDP) model containing two important features: one is the time-dependency mechanism that connects neighboring video frames on the basis of a time dependent Markovian assumption, and the other is the correspondence mechanism that provides a solution for dealing with the multimodal data such as the mixture of visual words and speech words extracted from video files. A cascaded Gibbs sampling method is applied for implementing the inference task of Seq-cHDP. We present a comprehensive evaluation for Seq-cHDP through experimentation and finally demonstrate that Seq-cHDP outperforms other baseline models.

  • Automated Duplicate Bug Report Detection Using Multi-Factor Analysis

    Jie ZOU  Ling XU  Mengning YANG  Xiaohong ZHANG  Jun ZENG  Sachio HIROKAWA  

     
    PAPER-Software Engineering

      Pubricized:
    2016/04/01
      Vol:
    E99-D No:7
      Page(s):
    1762-1775

    The bug reports expressed in natural language text usually suffer from vast, ambiguous and poorly written, which causes the challenge to the duplicate bug reports detection. Current automatic duplicate bug reports detection techniques have mainly focused on textual information and ignored some useful factors. To improve the detection accuracy, in this paper, we propose a new approach calls LNG (LDA and N-gram) model which takes advantages of the topic model LDA and word-based model N-gram. The LNG considers multiple factors, including textual information, semantic correlation, word order, contextual connections, and categorial information, that potentially affect the detection accuracy. Besides, the N-gram adopted in our LNG model is improved by modifying the similarity algorithm. The experiment is conducted under more than 230,000 real bug reports of the Eclipse project. In the evaluation, we propose a new evaluation metric, namely exact-accuracy (EA) rate, which can be used to enhance the understanding of the performance of duplicates detection. The evaluation results show that all the recall rate, precision rate, and EA rate of the proposed method are higher than treating them separately. Also, the recall rate is improved by 2.96%-10.53% compared to the state-of-art approach DBTM.

  • Topic Representation of Researchers' Interests in a Large-Scale Academic Database and Its Application to Author Disambiguation

    Marie KATSURAI  Ikki OHMUKAI  Hideaki TAKEDA  

     
    PAPER

      Pubricized:
    2016/01/14
      Vol:
    E99-D No:4
      Page(s):
    1010-1018

    It is crucial to promote interdisciplinary research and recommend collaborators from different research fields via academic database analysis. This paper addresses a problem to characterize researchers' interests with a set of diverse research topics found in a large-scale academic database. Specifically, we first use latent Dirichlet allocation to extract topics as distributions over words from a training dataset. Then, we convert the textual features of a researcher's publications to topic vectors, and calculate the centroid of these vectors to summarize the researcher's interest as a single vector. In experiments conducted on CiNii Articles, which is the largest academic database in Japan, we show that the extracted topics reflect the diversity of the research fields in the database. The experiment results also indicate the applicability of the proposed topic representation to the author disambiguation problem.

  • A Hybrid Topic Model for Multi-Document Summarization

    JinAn XU  JiangMing LIU  Kenji ARAKI  

     
    PAPER-Natural Language Processing

      Pubricized:
    2015/02/09
      Vol:
    E98-D No:5
      Page(s):
    1089-1094

    Topic features are useful in improving text summarization. However, independency among topics is a strong restriction on most topic models, and alleviating this restriction can deeply capture text structure. This paper proposes a hybrid topic model to generate multi-document summaries using a combination of the Hidden Topic Markov Model (HTMM), the surface texture model and the topic transition model. Based on the topic transition model, regular topic transition probability is used during generating summary. This approach eliminates the topic independence assumption in the Latent Dirichlet Allocation (LDA) model. Meanwhile, the results of experiments show the advantage of the combination of the three kinds of models. This paper includes alleviating topic independency, and integrating surface texture and shallow semantic in documents to improve summarization. In short, this paper attempts to realize an advanced summarization system.

  • Inferring Geographical Partitions by Exploiting User Mobility in Urban Area

    Feng XIANG  Benxiong HUANG  Lai TU  Duan HU  

     
    PAPER

      Vol:
    E97-D No:10
      Page(s):
    2623-2631

    Understanding the structure and evolution of spatial-temporal networks is crucial for different fields ranging from urbanism to epidemiology. As location based technologies are pervasively used in our daily life, large amount of sensing data has brought the opportunities to study human activities and city dynamics. Ubiquitous cell phones can be such a sensor to analyze the social connection and boundaries of geographical regions. In this paper, we exploit user mobility based on large-scale mobile phone records to study urban areas. We collect the call data records from 1 million anonymous subscribers of 8 weeks and study the user mobility flux between different regions. First we construct the urban areas as a spatial network and use modularity detection algorithm to study the intrinsic connection between map areas. Second, another generative model which is widely used in linguistic context is adopted to explore the functions of regions. Based on mobile call records we are able to derive the partitions which match boundaries of the administrative districts. Our results can also catch the dynamics of urban area as the basis for city planning and policy making.

  • Multimedia Topic Models Considering Burstiness of Local Features Open Access

    Yang XIE  Koji EGUCHI  

     
    PAPER

      Vol:
    E97-D No:4
      Page(s):
    714-720

    A number of studies have been conducted on topic modeling for various types of data, including text and image data. We focus particularly on the burstiness of the local features in modeling topics within video data in this paper. Burstiness is a phenomenon that is often discussed for text data. The idea is that if a word is used once in a document, it is more likely to be used again within the document. It is also observed in video data; for example, an object or visual word in video data is more likely to appear repeatedly within the same video data. Based on the idea mentioned above, we propose a new topic model, the Correspondence Dirichlet Compound Multinomial LDA (Corr-DCMLDA), which takes into account the burstiness of the local features in video data. The unknown parameters and latent variables in the model are estimated by conducting a collapsed Gibbs sampling and the hyperparameters are estimated by focusing on the fixed-point iterations. We demonstrate through experimentation on the genre classification of social video data that our model works more effectively than several baselines.

  • Hybrid Parallel Inference for Hierarchical Dirichlet Processes Open Access

    Tsukasa OMOTO  Koji EGUCHI  Shotaro TORA  

     
    LETTER

      Vol:
    E97-D No:4
      Page(s):
    815-820

    The hierarchical Dirichlet process (HDP) can provide a nonparametric prior for a mixture model with grouped data, where mixture components are shared across groups. However, the computational cost is generally very high in terms of both time and space complexity. Therefore, developing a method for fast inference of HDP remains a challenge. In this paper, we assume a symmetric multiprocessing (SMP) cluster, which has been widely used in recent years. To speed up the inference on an SMP cluster, we explore hybrid two-level parallelization of the Chinese restaurant franchise sampling scheme for HDP, especially focusing on the application to topic modeling. The methods we developed, Hybrid-AD-HDP and Hybrid-Diff-AD-HDP, make better use of SMP clusters, resulting in faster HDP inference. While the conventional parallel algorithms with a full message-passing interface does not benefit from using SMP clusters due to higher communication costs, the proposed hybrid parallel algorithms have lower communication costs and make better use of the computational resources.

  • Characterizing Web APIs Combining Supervised Topic Model with Ontology

    Yuanbin HAN  Shizhan CHEN  Zhiyong FENG  

     
    LETTER-Data Engineering, Web Information Systems

      Vol:
    E96-D No:7
      Page(s):
    1548-1551

    This paper presents a novel topic modeling (TM) approach for discovering meaningful topics for Web APIs, which is a potential dimensionality reduction way for efficient and effective classification, retrieval, organization, and management of numerous APIs. We exploit the possibility of conducting TM on multi-labeled APIs by combining a supervised TM (known as Labeled LDA) with ontology. Experiments conducting on real-world API data set show that the proposed method outperforms standard Labeled LDA with an average gain of 7.0% in measuring quality of the generated topics. In addition, we also evaluate the similarity matching between topics generated by our method and standard Labeled LDA, which demonstrates the significance of incorporating ontology.

  • MPI/OpenMP Hybrid Parallel Inference Methods for Latent Dirichlet Allocation – Approximation and Evaluation

    Shotaro TORA  Koji EGUCHI  

     
    PAPER-Advanced Search

      Vol:
    E96-D No:5
      Page(s):
    1006-1015

    Recently, probabilistic topic models have been applied to various types of data, including text, and their effectiveness has been demonstrated. Latent Dirichlet allocation (LDA) is a well known topic model. Variational Bayesian inference or collapsed Gibbs sampling is often used to estimate parameters in LDA; however, these inference methods incur high computational cost for large-scale data. Therefore, highly efficient technology is needed for this purpose. We use parallel computation technology for efficient collapsed Gibbs sampling inference for LDA. We assume a symmetric multiprocessing (SMP) cluster, which has been widely used in recent years. In prior work on parallel inference for LDA, either MPI or OpenMP has often been used alone. For an SMP cluster, however, it is more suitable to adopt hybrid parallelization that uses message passing for communication between SMP nodes and loop directives for parallelization within each SMP node. We developed an MPI/OpenMP hybrid parallel inference method for LDA, and evaluated the performance of the inference under various settings of an SMP cluster. We further investigated the approximation that controls the inter-node communications, and found out that it achieved noticeable increase in inference speed while maintaining inference accuracy.

  • Spoken Document Retrieval Leveraging Unsupervised and Supervised Topic Modeling Techniques

    Kuan-Yu CHEN  Hsin-Min WANG  Berlin CHEN  

     
    PAPER-Speech Processing

      Vol:
    E95-D No:5
      Page(s):
    1195-1205

    This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.

  • Enhancing Digital Book Clustering by LDAC Model

    Lidong WANG  Yuan JIE  

     
    PAPER

      Vol:
    E95-D No:4
      Page(s):
    982-988

    In Digital Library (DL) applications, digital book clustering is an important and urgent research task. However, it is difficult to conduct effectively because of the great length of digital books. To do the correct clustering for digital books, a novel method based on probabilistic topic model is proposed. Firstly, we build a topic model named LDAC. The main goal of LDAC topic modeling is to effectively extract topics from digital books. Subsequently, Gibbs sampling is applied for parameter inference. Once the model parameters are learned, each book is assigned to the cluster which maximizes the posterior probability. Experimental results demonstrate that our approach based on LDAC is able to achieve significant improvement as compared to the related methods.

  • Entity Network Prediction Using Multitype Topic Models

    Hitohiro SHIOZAKI  Koji EGUCHI  Takenao OHKAWA  

     
    PAPER-Knowledge Discovery and Data Mining

      Vol:
    E91-D No:11
      Page(s):
    2589-2598

    Conveying information about who, what, when and where is a primary purpose of some genres of documents, typically news articles. Statistical models that capture dependencies between named entities and topics can play an important role in handling such information. Although some relationships between who and where should be mentioned in such a document, no statistical topic models explicitly address the textual interactions between a who-entity and a where-entity. This paper presents a statistical model that directly captures the dependencies between an arbitrary number of word types, such as who-entities, where-entities and topics, mentioned in each document. We show that this multitype topic model performs better at making predictions on entity networks, in which each vertex represents an entity and each edge weight represents how a pair of entities at the incident vertices is closely related, through our experiments on predictions of who-entities and links between them. We also demonstrate the scale-free property in the weighted networks of entities extracted from written mentions.