The Biterm Author Topic in the Sentences Model for E-Mail Analysis

Xiuze ZHOU; Shunxiang WU

doi:10.1587/transinf.2016EDP7382

IEICE TRANSACTIONS on Information

The Biterm Author Topic in the Sentences Model for E-Mail Analysis

Xiuze ZHOU, Shunxiang WU

Full Text Views

0

Cite this

Summary :

E-mails, which vary in length, are a special form of text. The difference in the lengths of e-mails increases the difficulty of text analysis. To better analyze e-mail, our models must analyze not only long e-mails but also short e-mails. Unlike normal documents, short texts have some unique characteristics, such as data sparsity and ambiguity problems, making it difficult to obtain useful information from them. However, long text and short text cannot be analyzed in the same manner. Therefore, we have to analyze the characteristics of both. We present the Biterm Author Topic in the Sentences Model (BATS) model; it can discover relevant topics of corpus and accurately capture the relationship between the topics and authors of e-mails. The Author Topic (AT) model learns from a single word in a document, while the BATS is modeled on word co-occurrence in the entire corpus. We assume that all words in a single sentence are generated from the same topic. Accordingly, our method uses only word co-occurrence patterns at the sentence level, rather than the document or corpus level. Experiments on the Enron data set indicate that our proposed method achieves better performance on e-mails than the baseline methods. What's more, our method analyzes long texts effectively and solves the data sparsity problems of short texts.

Publication: IEICE TRANSACTIONS on Information Vol.E100-D No.8 pp.1852-1859

Publication Date: 2017/08/01

Publicized: 2017/04/25

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2016EDP7382

Type of Manuscript: PAPER

Category: Artificial Intelligence, Data Mining

Authors

Xiuze ZHOU
Xiamen University
Shunxiang WU
Xiamen University

Keyword

e-mail analysis, author topic, BATS model, sentence

Cite this

Copy

Xiuze ZHOU, Shunxiang WU, "The Biterm Author Topic in the Sentences Model for E-Mail Analysis" in IEICE TRANSACTIONS on Information, vol. E100-D, no. 8, pp. 1852-1859, August 2017, doi: 10.1587/transinf.2016EDP7382.
Abstract: E-mails, which vary in length, are a special form of text. The difference in the lengths of e-mails increases the difficulty of text analysis. To better analyze e-mail, our models must analyze not only long e-mails but also short e-mails. Unlike normal documents, short texts have some unique characteristics, such as data sparsity and ambiguity problems, making it difficult to obtain useful information from them. However, long text and short text cannot be analyzed in the same manner. Therefore, we have to analyze the characteristics of both. We present the Biterm Author Topic in the Sentences Model (BATS) model; it can discover relevant topics of corpus and accurately capture the relationship between the topics and authors of e-mails. The Author Topic (AT) model learns from a single word in a document, while the BATS is modeled on word co-occurrence in the entire corpus. We assume that all words in a single sentence are generated from the same topic. Accordingly, our method uses only word co-occurrence patterns at the sentence level, rather than the document or corpus level. Experiments on the Enron data set indicate that our proposed method achieves better performance on e-mails than the baseline methods. What's more, our method analyzes long texts effectively and solves the data sparsity problems of short texts.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016EDP7382/_p

Copy

@ARTICLE{e100-d_8_1852,
author={Xiuze ZHOU, Shunxiang WU, },
journal={IEICE TRANSACTIONS on Information},
title={The Biterm Author Topic in the Sentences Model for E-Mail Analysis},
year={2017},
volume={E100-D},
number={8},
pages={1852-1859},
abstract={E-mails, which vary in length, are a special form of text. The difference in the lengths of e-mails increases the difficulty of text analysis. To better analyze e-mail, our models must analyze not only long e-mails but also short e-mails. Unlike normal documents, short texts have some unique characteristics, such as data sparsity and ambiguity problems, making it difficult to obtain useful information from them. However, long text and short text cannot be analyzed in the same manner. Therefore, we have to analyze the characteristics of both. We present the Biterm Author Topic in the Sentences Model (BATS) model; it can discover relevant topics of corpus and accurately capture the relationship between the topics and authors of e-mails. The Author Topic (AT) model learns from a single word in a document, while the BATS is modeled on word co-occurrence in the entire corpus. We assume that all words in a single sentence are generated from the same topic. Accordingly, our method uses only word co-occurrence patterns at the sentence level, rather than the document or corpus level. Experiments on the Enron data set indicate that our proposed method achieves better performance on e-mails than the baseline methods. What's more, our method analyzes long texts effectively and solves the data sparsity problems of short texts.},
keywords={},
doi={10.1587/transinf.2016EDP7382},
ISSN={1745-1361},
month={August},}

Copy

TY - JOUR
TI - The Biterm Author Topic in the Sentences Model for E-Mail Analysis
T2 - IEICE TRANSACTIONS on Information
SP - 1852
EP - 1859
AU - Xiuze ZHOU
AU - Shunxiang WU
PY - 2017
DO - 10.1587/transinf.2016EDP7382
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E100-D
IS - 8
JA - IEICE TRANSACTIONS on Information
Y1 - August 2017
AB - E-mails, which vary in length, are a special form of text. The difference in the lengths of e-mails increases the difficulty of text analysis. To better analyze e-mail, our models must analyze not only long e-mails but also short e-mails. Unlike normal documents, short texts have some unique characteristics, such as data sparsity and ambiguity problems, making it difficult to obtain useful information from them. However, long text and short text cannot be analyzed in the same manner. Therefore, we have to analyze the characteristics of both. We present the Biterm Author Topic in the Sentences Model (BATS) model; it can discover relevant topics of corpus and accurately capture the relationship between the topics and authors of e-mails. The Author Topic (AT) model learns from a single word in a document, while the BATS is modeled on word co-occurrence in the entire corpus. We assume that all words in a single sentence are generated from the same topic. Accordingly, our method uses only word co-occurrence patterns at the sentence level, rather than the document or corpus level. Experiments on the Enron data set indicate that our proposed method achieves better performance on e-mails than the baseline methods. What's more, our method analyzes long texts effectively and solves the data sparsity problems of short texts.
ER -

IEICE TRANSACTIONS on Information