Ryusei NAGASAWA Keisuke FURUMOTO Makoto TAKITA Yoshiaki SHIRAISHI Takeshi TAKAHASHI Masami MOHRI Yasuhiro TAKANO Masakatu MORII
The Topics over Time (TOT) model allows users to be aware of changes in certain topics over time. The proposed method inputs the divided dataset of security blog posts based on a fixed period using an overlap period to the TOT. The results suggest the extraction of topics that include malware and attack campaign names that are appropriate for the multi-labeling of cyber threat intelligence reports.
Jung-Been LEE Taek LEE Hoh Peter IN
Mining software artifacts is a useful way to understand the source code of software projects. Topic modeling in particular has been widely used to discover meaningful information from software artifacts. However, software artifacts are unstructured and contain a mix of textual types within the natural text. These software artifact characteristics worsen the performance of topic modeling. Among several natural language pre-processing tasks, removing stop words to reduce meaningless and uninteresting terms is an efficient way to improve the quality of topic models. Although many approaches are used to generate effective stop words, the lists are outdated or too general to apply to mining software artifacts. In addition, the performance of the topic model is sensitive to the datasets used in the training for each approach. To resolve these problems, we propose an automatic stop word generation approach for topic models of software artifacts. By measuring topic coherence among words in the topic using Pointwise Mutual Information (PMI), we added words with a low PMI score to our stop words list for every topic modeling loop. Through our experiment, we proved that our stop words list results in a higher performance of the topic model than lists from other approaches.
Rungsiman NARARATWONG Natthawut KERTKEIDKACHORN Nagul COOHAROJANANONE Hitoshi OKADA
Word boundary ambiguity in word segmentation has long been a fundamental challenge within Thai language processing. The Conditional Random Fields (CRF) model is among the best-known methods to have achieved remarkably accurate segmentation. Nevertheless, current advancements appear to have left the problem of compound words unaccounted for. Compound words lose their meaning or context once segmented. Hence, we introduce a dictionary-based word-merging algorithm, which merges all kinds of compound words. Our evaluation shows that the algorithm can accomplish a high-accuracy of word segmentation, with compound words being preserved. Moreover, it can also restore some incorrectly segmented words. Another problem involving a different word-chunking approach is sentence boundary ambiguity. In tackling the problem, utilizing the part of speech (POS) of a segmented word has been found previously to help boost the accuracy of CRF-based sentence segmentation. However, not all segmented words can be tagged. Thus, we propose a POS-based word-splitting algorithm, which splits words in order to increase POS tags. We found that with more identifiable POS tags, the CRF model performs better in segmenting sentences. To demonstrate the contributions of both methods, we experimented with three of their applications. With the word merging algorithm, we found that intact compound words in the product of topic extraction can help to preserve their intended meanings, offering more precise information for human interpretation. The algorithm, together with the POS-based word-splitting algorithm, can also be used to amend word-level Thai-English translations. In addition, the word-splitting algorithm improves sentence segmentation, thus enhancing text summarization.
Ting WU Yong FENG JiaXing SANG BaoHua QIANG YaNan WANG
Recommender systems (RS) exploit user ratings on items and side information to make personalized recommendations. In order to recommend the right products to users, RS must accurately model the implicit preferences of each user and the properties of each product. In reality, both user preferences and item properties are changing dynamically over time, so treating the historical decisions of a user or the received comments of an item as static is inappropriate. Besides, the review text accompanied with a rating score can help us to understand why a user likes or dislikes an item, so temporal dynamics and text information in reviews are important side information for recommender systems. Moreover, compared with the large number of available items, the number of items a user can buy is very limited, which is called the sparsity problem. In order to solve this problem, utilizing item correlation provides a promising solution. Although famous methods like TimeSVD++, TopicMF and CoFactor partially take temporal dynamics, reviews and correlation into consideration, none of them combine these information together for accurate recommendation. Therefore, in this paper we propose a novel combined model called TmRevCo which is based on matrix factorization. Our model combines the dynamic user factor of TimeSVD++ with the hidden topic of each review text mined by the topic model of TopicMF through a new transformation function. Meanwhile, to support our five-scoring datasets, we use a more appropriate item correlation measure in CoFactor and associate the item factors of CoFactor with that of matrix factorization. Our model comprehensively combines the temporal dynamics, review information and item correlation simultaneously. Experimental results on three real-world datasets show that our proposed model leads to significant improvement compared with the baseline methods.
Qian LI Xiaojuan LI Bin WU Yunpeng XIAO
In social networks, predicting user behavior under social hotspots can aid in understanding the development trend of a topic. In this paper, we propose a retweeting prediction method for social hotspots based on tensor decomposition, using user information, relationship and behavioral data. The method can be used to predict the behavior of users and analyze the evolvement of topics. Firstly, we propose a tensor-based mechanism for mining user interaction, and then we propose that the tensor be used to solve the problem of inaccuracy that arises when interactively calculating intensity for sparse user interaction data. At the same time, we can analyze the influence of the following relationship on the interaction between users based on characteristics of the tensor in data space conversion and projection. Secondly, time decay function is introduced for the tensor to quantify further the evolution of user behavior in current social hotspots. That function can be fit to the behavior of a user dynamically, and can also solve the problem of interaction between users with time decay. Finally, we invoke time slices and discretization of the topic life cycle and construct a user retweeting prediction model based on logistic regression. In this way, we can both explore the temporal characteristics of user behavior in social hotspots and also solve the problem of uneven interaction behavior between users. Experiments show that the proposed method can improve the accuracy of user behavior prediction effectively and aid in understanding the development trend of a topic.
Topic modeling as a well-known method is widely applied for not only text data mining but also multimedia data analysis such as video data analysis. However, existing models cannot adequately handle time dependency and multimodal data modeling for video data that generally contain image information and speech information. In this paper, we therefore propose a novel topic model, sequential symmetric correspondence hierarchical Dirichlet processes (Seq-Sym-cHDP) extended from sequential conditionally independent hierarchical Dirichlet processes (Seq-CI-HDP) and sequential correspondence hierarchical Dirichlet processes (Seq-cHDP), to improve the multimodal data modeling mechanism via controlling the pivot assignments with a latent variable. An inference scheme for Seq-Sym-cHDP based on a posterior representation sampler is also developed in this work. We finally demonstrate that our model outperforms other baseline models via experiments.
Zhenghang CUI Issei SATO Masashi SUGIYAMA
As the emergence and the thriving development of social networks, a huge number of short texts are accumulated and need to be processed. Inferring latent topics of collected short texts is an essential task for understanding its hidden structure and predicting new contents. A biterm topic model (BTM) was recently proposed for short texts to overcome the sparseness of document-level word co-occurrences by directly modeling the generation process of word pairs. Stochastic inference algorithms based on collapsed Gibbs sampling (CGS) and collapsed variational inference have been proposed for BTM. However, they either require large computational complexity, or rely on very crude estimation that does not preserve sufficient statistics. In this work, we develop a stochastic divergence minimization (SDM) inference algorithm for BTM to achieve better predictive likelihood in a scalable way. Experiments show that SDM-BTM trained by 30% data outperforms the best existing algorithm trained by full data.
E-mails, which vary in length, are a special form of text. The difference in the lengths of e-mails increases the difficulty of text analysis. To better analyze e-mail, our models must analyze not only long e-mails but also short e-mails. Unlike normal documents, short texts have some unique characteristics, such as data sparsity and ambiguity problems, making it difficult to obtain useful information from them. However, long text and short text cannot be analyzed in the same manner. Therefore, we have to analyze the characteristics of both. We present the Biterm Author Topic in the Sentences Model (BATS) model; it can discover relevant topics of corpus and accurately capture the relationship between the topics and authors of e-mails. The Author Topic (AT) model learns from a single word in a document, while the BATS is modeled on word co-occurrence in the entire corpus. We assume that all words in a single sentence are generated from the same topic. Accordingly, our method uses only word co-occurrence patterns at the sentence level, rather than the document or corpus level. Experiments on the Enron data set indicate that our proposed method achieves better performance on e-mails than the baseline methods. What's more, our method analyzes long texts effectively and solves the data sparsity problems of short texts.
There are increasing demands for improved analysis of multimodal data that consist of multiple representations, such as multilingual documents and text-annotated images. One promising approach for analyzing such multimodal data is latent topic models. In this paper, we propose conditionally independent generalized relational topic models (CI-gRTM) for predicting unknown relations across different multiple representations of multimodal data. We developed CI-gRTM as a multimodal extension of discriminative relational topic models called generalized relational topic models (gRTM). We demonstrated through experiments with multilingual documents that CI-gRTM can more effectively predict both multilingual representations and relations between two different language representations compared with several state-of-the-art baseline models that enable to predict either multilingual representations or unimodal relations.
Social Media has already become a new arena of our lives and involved different aspects of our social presence. Users' personal information and activities on social media presumably reveal their personal interests, which offer great opportunities for many e-commerce applications. In this paper, we propose a principled latent variable model to infer user consumption preferences at the category level (e.g. inferring what categories of products a user would like to buy). Our model naturally links users' published content and following relations on microblogs with their consumption behaviors on e-commerce websites. Experimental results show our model outperforms the state-of-the-art methods significantly in inferring a new user's consumption preference. Our model can also learn meaningful consumption-specific topics automatically.
Video data mining based on topic models as an emerging technique recently has become a very popular research topic. In this paper, we present a novel topic model named sequential correspondence hierarchical Dirichlet processes (Seq-cHDP) to learn the hidden structure within video data. The Seq-cHDP model can be deemed as an extended hierarchical Dirichlet processes (HDP) model containing two important features: one is the time-dependency mechanism that connects neighboring video frames on the basis of a time dependent Markovian assumption, and the other is the correspondence mechanism that provides a solution for dealing with the multimodal data such as the mixture of visual words and speech words extracted from video files. A cascaded Gibbs sampling method is applied for implementing the inference task of Seq-cHDP. We present a comprehensive evaluation for Seq-cHDP through experimentation and finally demonstrate that Seq-cHDP outperforms other baseline models.
Web search queries are usually vague, ambiguous, or tend to have multiple intents. Users have different search intents while issuing the same query. Understanding the intents through mining subtopics underlying a query has gained much interest in recent years. Query suggestions provided by search engines hold some intents of the original query, however, suggested queries are often noisy and contain a group of alternative queries with similar meaning. Therefore, identifying the subtopics covering possible intents behind a query is a formidable task. Moreover, both the query and subtopics are short in length, it is challenging to estimate the similarity between a pair of short texts and rank them accordingly. In this paper, we propose a method for mining and ranking subtopics where we introduce multiple semantic and content-aware features, a bipartite graph-based ranking (BGR) method, and a similarity function for short texts. Given a query, we aggregate the suggested queries from search engines as candidate subtopics and estimate the relevance of them with the given query based on word embedding and content-aware features by modeling a bipartite graph. To estimate the similarity between two short texts, we propose a Jensen-Shannon divergence based similarity function through the probability distributions of the terms in the top retrieved documents from a search engine. A diversified ranked list of subtopics covering possible intents of a query is assembled by balancing the relevance and novelty. We experimented and evaluated our method on the NTCIR-10 INTENT-2 and NTCIR-12 IMINE-2 subtopic mining test collections. Our proposed method outperforms the baselines, known related methods, and the official participants of the INTENT-2 and IMINE-2 competitions.
Jie ZOU Ling XU Mengning YANG Xiaohong ZHANG Jun ZENG Sachio HIROKAWA
The bug reports expressed in natural language text usually suffer from vast, ambiguous and poorly written, which causes the challenge to the duplicate bug reports detection. Current automatic duplicate bug reports detection techniques have mainly focused on textual information and ignored some useful factors. To improve the detection accuracy, in this paper, we propose a new approach calls LNG (LDA and N-gram) model which takes advantages of the topic model LDA and word-based model N-gram. The LNG considers multiple factors, including textual information, semantic correlation, word order, contextual connections, and categorial information, that potentially affect the detection accuracy. Besides, the N-gram adopted in our LNG model is improved by modifying the similarity algorithm. The experiment is conducted under more than 230,000 real bug reports of the Eclipse project. In the evaluation, we propose a new evaluation metric, namely exact-accuracy (EA) rate, which can be used to enhance the understanding of the performance of duplicates detection. The evaluation results show that all the recall rate, precision rate, and EA rate of the proposed method are higher than treating them separately. Also, the recall rate is improved by 2.96%-10.53% compared to the state-of-art approach DBTM.
Marie KATSURAI Ikki OHMUKAI Hideaki TAKEDA
It is crucial to promote interdisciplinary research and recommend collaborators from different research fields via academic database analysis. This paper addresses a problem to characterize researchers' interests with a set of diverse research topics found in a large-scale academic database. Specifically, we first use latent Dirichlet allocation to extract topics as distributions over words from a training dataset. Then, we convert the textual features of a researcher's publications to topic vectors, and calculate the centroid of these vectors to summarize the researcher's interest as a single vector. In experiments conducted on CiNii Articles, which is the largest academic database in Japan, we show that the extracted topics reflect the diversity of the research fields in the database. The experiment results also indicate the applicability of the proposed topic representation to the author disambiguation problem.
JinAn XU JiangMing LIU Kenji ARAKI
Topic features are useful in improving text summarization. However, independency among topics is a strong restriction on most topic models, and alleviating this restriction can deeply capture text structure. This paper proposes a hybrid topic model to generate multi-document summaries using a combination of the Hidden Topic Markov Model (HTMM), the surface texture model and the topic transition model. Based on the topic transition model, regular topic transition probability is used during generating summary. This approach eliminates the topic independence assumption in the Latent Dirichlet Allocation (LDA) model. Meanwhile, the results of experiments show the advantage of the combination of the three kinds of models. This paper includes alleviating topic independency, and integrating surface texture and shallow semantic in documents to improve summarization. In short, this paper attempts to realize an advanced summarization system.
Feng XIANG Benxiong HUANG Lai TU Duan HU
Understanding the structure and evolution of spatial-temporal networks is crucial for different fields ranging from urbanism to epidemiology. As location based technologies are pervasively used in our daily life, large amount of sensing data has brought the opportunities to study human activities and city dynamics. Ubiquitous cell phones can be such a sensor to analyze the social connection and boundaries of geographical regions. In this paper, we exploit user mobility based on large-scale mobile phone records to study urban areas. We collect the call data records from 1 million anonymous subscribers of 8 weeks and study the user mobility flux between different regions. First we construct the urban areas as a spatial network and use modularity detection algorithm to study the intrinsic connection between map areas. Second, another generative model which is widely used in linguistic context is adopted to explore the functions of regions. Based on mobile call records we are able to derive the partitions which match boundaries of the administrative districts. Our results can also catch the dynamics of urban area as the basis for city planning and policy making.
Xiaohong YANG Mingxing XU Yufang YANG
The research reported in this paper is an attempt to elucidate the predictors of pause duration in read-aloud discourse. Through simple linear regression analysis and stepwise multiple linear regression, we examined how different factors (namely, syntactic structure, discourse hierarchy, topic structure, preboundary length, and postboundary length) influenced pause duration both separately and jointly. Results from simple regression analysis showed that discourse hierarchy, syntactic structure, topic structure, and postboundary length had significant impacts on boundary pause duration. However, when these factors were tested in a stepwise regression analysis, only discourse hierarchy, syntactic structure, and postboundary length were found to have significant impacts on boundary pause duration. The regression model that best predicted boundary pause duration in discourse context was the one that first included syntactic structure, and then included discourse hierarchy and postboundary length. This model could account for about 80% of the variance of pause duration. Tests of mediation models showed that the effects of topic structure and discourse hierarchy were significantly mediated by syntactic structure, which was most closely correlated with pause duration. These results support an integrated model combining the influence of several factors and can be applied to text-to-speech systems.
In the process of production design, engineers usually find it is difficult to seek and reuse others' empirical knowledge which is in the forms of lesson-learned documents. This study proposed a novel approach, which uses a semantic-based topic knowledge map system (STKMS) to support timely and precisely lesson-learned documents finding and reusing. The architecture of STKMS is designed, which has five major functional modules: lesson-learned documents pre-processing, topic extraction, topic relation computation, topic weights computation, and topic knowledge map generation modules. Then STKMS implementation is briefly introduced. We have conducted two sets of experiments to evaluate quality of knowledge map and the performance of utilizing STKMS in outfitting design of a ship-building company. The first experiment shows that knowledge maps generated by STKMS are accepted by domain experts from the evaluation since precision and recall are high. The second experiment shows that STKMS-based group outperforms browse-based group in both learning score and satisfaction level, which are two measurements of performance of utilizing STKMS. The promising results confirm the feasibility of STKMS in helping engineers to find needed lesson-learned documents and reuse related knowledge easily and precisely.
A number of studies have been conducted on topic modeling for various types of data, including text and image data. We focus particularly on the burstiness of the local features in modeling topics within video data in this paper. Burstiness is a phenomenon that is often discussed for text data. The idea is that if a word is used once in a document, it is more likely to be used again within the document. It is also observed in video data; for example, an object or visual word in video data is more likely to appear repeatedly within the same video data. Based on the idea mentioned above, we propose a new topic model, the Correspondence Dirichlet Compound Multinomial LDA (Corr-DCMLDA), which takes into account the burstiness of the local features in video data. The unknown parameters and latent variables in the model are estimated by conducting a collapsed Gibbs sampling and the hyperparameters are estimated by focusing on the fixed-point iterations. We demonstrate through experimentation on the genre classification of social video data that our model works more effectively than several baselines.
Tsukasa OMOTO Koji EGUCHI Shotaro TORA
The hierarchical Dirichlet process (HDP) can provide a nonparametric prior for a mixture model with grouped data, where mixture components are shared across groups. However, the computational cost is generally very high in terms of both time and space complexity. Therefore, developing a method for fast inference of HDP remains a challenge. In this paper, we assume a symmetric multiprocessing (SMP) cluster, which has been widely used in recent years. To speed up the inference on an SMP cluster, we explore hybrid two-level parallelization of the Chinese restaurant franchise sampling scheme for HDP, especially focusing on the application to topic modeling. The methods we developed, Hybrid-AD-HDP and Hybrid-Diff-AD-HDP, make better use of SMP clusters, resulting in faster HDP inference. While the conventional parallel algorithms with a full message-passing interface does not benefit from using SMP clusters due to higher communication costs, the proposed hybrid parallel algorithms have lower communication costs and make better use of the computational resources.