Corpus Expansion for Neural CWS on Microblog-Oriented Data with <I>λ</I>-Active Learning Approach

Jing ZHANG; Degen HUANG; Kaiyu HUANG; Zhuang LIU; Fuji REN

doi:10.1587/transinf.2017EDP7239

Corpus Expansion for Neural CWS on Microblog-Oriented Data with λ-Active Learning Approach

Jing ZHANG, Degen HUANG, Kaiyu HUANG, Zhuang LIU, Fuji REN

Full Text Views

0

Cite this

Summary :

Microblog data contains rich information of real-world events with great commercial values, so microblog-oriented natural language processing (NLP) tasks have grabbed considerable attention of researchers. However, the performance of microblog-oriented Chinese Word Segmentation (CWS) based on deep neural networks (DNNs) is still not satisfying. One critical reason is that the existing microblog-oriented training corpus is inadequate to train effective weight matrices for DNNs. In this paper, we propose a novel active learning method to extend the scale of the training corpus for DNNs. However, due to a large amount of partially overlapped sentences in the microblogs, it is difficult to select samples with high annotation values from raw microblogs during the active learning procedure. To select samples with higher annotation values, parameter λ is introduced to control the number of repeatedly selected samples. Meanwhile, various strategies are adopted to measure the overall annotation values of a sample during the active learning procedure. Experiments on the benchmark datasets of NLPCC 2015 show that our λ-active learning method outperforms the baseline system and the state-of-the-art method. Besides, the results also demonstrate that the performances of the DNNs trained on the extended corpus are significantly improved.

Publication: IEICE TRANSACTIONS on Information Vol.E101-D No.3 pp.778-785

Publication Date: 2018/03/01

Publicized: 2017/12/08

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2017EDP7239

Type of Manuscript: PAPER

Category: Natural Language Processing

Authors

Jing ZHANG
  Dalian University of Technology
Degen HUANG
  Dalian University of Technology
Kaiyu HUANG
  Dalian University of Technology
Zhuang LIU
  Dalian University of Technology
Fuji REN
  Tokushima University

Keyword

Chinese word segmentation, active learning, deep neural networks, corpus expansion

Cite this

Copy

Jing ZHANG, Degen HUANG, Kaiyu HUANG, Zhuang LIU, Fuji REN, "Corpus Expansion for Neural CWS on Microblog-Oriented Data with λ-Active Learning Approach" in IEICE TRANSACTIONS on Information, vol. E101-D, no. 3, pp. 778-785, March 2018, doi: 10.1587/transinf.2017EDP7239.
Abstract: Microblog data contains rich information of real-world events with great commercial values, so microblog-oriented natural language processing (NLP) tasks have grabbed considerable attention of researchers. However, the performance of microblog-oriented Chinese Word Segmentation (CWS) based on deep neural networks (DNNs) is still not satisfying. One critical reason is that the existing microblog-oriented training corpus is inadequate to train effective weight matrices for DNNs. In this paper, we propose a novel active learning method to extend the scale of the training corpus for DNNs. However, due to a large amount of partially overlapped sentences in the microblogs, it is difficult to select samples with high annotation values from raw microblogs during the active learning procedure. To select samples with higher annotation values, parameter λ is introduced to control the number of repeatedly selected samples. Meanwhile, various strategies are adopted to measure the overall annotation values of a sample during the active learning procedure. Experiments on the benchmark datasets of NLPCC 2015 show that our λ-active learning method outperforms the baseline system and the state-of-the-art method. Besides, the results also demonstrate that the performances of the DNNs trained on the extended corpus are significantly improved.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2017EDP7239/_p

Copy

@ARTICLE{e101-d_3_778,
author={Jing ZHANG, Degen HUANG, Kaiyu HUANG, Zhuang LIU, Fuji REN, },
journal={IEICE TRANSACTIONS on Information},
title={Corpus Expansion for Neural CWS on Microblog-Oriented Data with λ-Active Learning Approach},
year={2018},
volume={E101-D},
number={3},
pages={778-785},
abstract={Microblog data contains rich information of real-world events with great commercial values, so microblog-oriented natural language processing (NLP) tasks have grabbed considerable attention of researchers. However, the performance of microblog-oriented Chinese Word Segmentation (CWS) based on deep neural networks (DNNs) is still not satisfying. One critical reason is that the existing microblog-oriented training corpus is inadequate to train effective weight matrices for DNNs. In this paper, we propose a novel active learning method to extend the scale of the training corpus for DNNs. However, due to a large amount of partially overlapped sentences in the microblogs, it is difficult to select samples with high annotation values from raw microblogs during the active learning procedure. To select samples with higher annotation values, parameter λ is introduced to control the number of repeatedly selected samples. Meanwhile, various strategies are adopted to measure the overall annotation values of a sample during the active learning procedure. Experiments on the benchmark datasets of NLPCC 2015 show that our λ-active learning method outperforms the baseline system and the state-of-the-art method. Besides, the results also demonstrate that the performances of the DNNs trained on the extended corpus are significantly improved.},
keywords={},
doi={10.1587/transinf.2017EDP7239},
ISSN={1745-1361},
month={March},}

Copy

TY - JOUR
TI - Corpus Expansion for Neural CWS on Microblog-Oriented Data with λ-Active Learning Approach
T2 - IEICE TRANSACTIONS on Information
SP - 778
EP - 785
AU - Jing ZHANG
AU - Degen HUANG
AU - Kaiyu HUANG
AU - Zhuang LIU
AU - Fuji REN
PY - 2018
DO - 10.1587/transinf.2017EDP7239
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E101-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2018
AB - Microblog data contains rich information of real-world events with great commercial values, so microblog-oriented natural language processing (NLP) tasks have grabbed considerable attention of researchers. However, the performance of microblog-oriented Chinese Word Segmentation (CWS) based on deep neural networks (DNNs) is still not satisfying. One critical reason is that the existing microblog-oriented training corpus is inadequate to train effective weight matrices for DNNs. In this paper, we propose a novel active learning method to extend the scale of the training corpus for DNNs. However, due to a large amount of partially overlapped sentences in the microblogs, it is difficult to select samples with high annotation values from raw microblogs during the active learning procedure. To select samples with higher annotation values, parameter λ is introduced to control the number of repeatedly selected samples. Meanwhile, various strategies are adopted to measure the overall annotation values of a sample during the active learning procedure. Experiments on the benchmark datasets of NLPCC 2015 show that our λ-active learning method outperforms the baseline system and the state-of-the-art method. Besides, the results also demonstrate that the performances of the DNNs trained on the extended corpus are significantly improved.
ER -