The search functionality is under construction.

IEICE TRANSACTIONS on Information

The Biterm Author Topic in the Sentences Model for E-Mail Analysis

Xiuze ZHOU, Shunxiang WU

  • Full Text Views

    0

  • Cite this

Summary :

E-mails, which vary in length, are a special form of text. The difference in the lengths of e-mails increases the difficulty of text analysis. To better analyze e-mail, our models must analyze not only long e-mails but also short e-mails. Unlike normal documents, short texts have some unique characteristics, such as data sparsity and ambiguity problems, making it difficult to obtain useful information from them. However, long text and short text cannot be analyzed in the same manner. Therefore, we have to analyze the characteristics of both. We present the Biterm Author Topic in the Sentences Model (BATS) model; it can discover relevant topics of corpus and accurately capture the relationship between the topics and authors of e-mails. The Author Topic (AT) model learns from a single word in a document, while the BATS is modeled on word co-occurrence in the entire corpus. We assume that all words in a single sentence are generated from the same topic. Accordingly, our method uses only word co-occurrence patterns at the sentence level, rather than the document or corpus level. Experiments on the Enron data set indicate that our proposed method achieves better performance on e-mails than the baseline methods. What's more, our method analyzes long texts effectively and solves the data sparsity problems of short texts.

Publication
IEICE TRANSACTIONS on Information Vol.E100-D No.8 pp.1852-1859
Publication Date
2017/08/01
Publicized
2017/04/25
Online ISSN
1745-1361
DOI
10.1587/transinf.2016EDP7382
Type of Manuscript
PAPER
Category
Artificial Intelligence, Data Mining

Authors

Xiuze ZHOU
  Xiamen University
Shunxiang WU
  Xiamen University

Keyword