Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
Xiang WANG
National University of Defense Technology
Yan JIA
National University of Defense Technology
Ruhua CHEN
National University of Defense Technology
Hua FAN
National University of Defense Technology
Bin ZHOU
National University of Defense Technology
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Xiang WANG, Yan JIA, Ruhua CHEN, Hua FAN, Bin ZHOU, "Improving Text Categorization with Semantic Knowledge in Wikipedia" in IEICE TRANSACTIONS on Information,
vol. E96-D, no. 12, pp. 2786-2794, December 2013, doi: 10.1587/transinf.E96.D.2786.
Abstract: Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E96.D.2786/_p
Copy
@ARTICLE{e96-d_12_2786,
author={Xiang WANG, Yan JIA, Ruhua CHEN, Hua FAN, Bin ZHOU, },
journal={IEICE TRANSACTIONS on Information},
title={Improving Text Categorization with Semantic Knowledge in Wikipedia},
year={2013},
volume={E96-D},
number={12},
pages={2786-2794},
abstract={Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.},
keywords={},
doi={10.1587/transinf.E96.D.2786},
ISSN={1745-1361},
month={December},}
Copy
TY - JOUR
TI - Improving Text Categorization with Semantic Knowledge in Wikipedia
T2 - IEICE TRANSACTIONS on Information
SP - 2786
EP - 2794
AU - Xiang WANG
AU - Yan JIA
AU - Ruhua CHEN
AU - Hua FAN
AU - Bin ZHOU
PY - 2013
DO - 10.1587/transinf.E96.D.2786
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E96-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2013
AB - Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
ER -