In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging. Our word-character hybrid model offers high performance since it can handle both known and unknown words. We describe our strategies that yield good balance for learning the characteristics of known and unknown words and propose an error-driven policy that delivers such balance by acquiring examples of unknown words from particular errors in a training corpus. We describe an efficient framework for training our model based on the Margin Infused Relaxed Algorithm (MIRA), evaluate our approach on the Penn Chinese Treebank, and show that it achieves superior performance compared to the state-of-the-art approaches reported in the literature.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Canasai KRUENGKRAI, Kiyotaka UCHIMOTO, Jun'ichi KAZAMA, Yiou WANG, Kentaro TORISAWA, Hitoshi ISAHARA, "Joint Chinese Word Segmentation and POS Tagging Using an Error-Driven Word-Character Hybrid Model" in IEICE TRANSACTIONS on Information,
vol. E92-D, no. 12, pp. 2298-2305, December 2009, doi: 10.1587/transinf.E92.D.2298.
Abstract: In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging. Our word-character hybrid model offers high performance since it can handle both known and unknown words. We describe our strategies that yield good balance for learning the characteristics of known and unknown words and propose an error-driven policy that delivers such balance by acquiring examples of unknown words from particular errors in a training corpus. We describe an efficient framework for training our model based on the Margin Infused Relaxed Algorithm (MIRA), evaluate our approach on the Penn Chinese Treebank, and show that it achieves superior performance compared to the state-of-the-art approaches reported in the literature.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E92.D.2298/_p
Copy
@ARTICLE{e92-d_12_2298,
author={Canasai KRUENGKRAI, Kiyotaka UCHIMOTO, Jun'ichi KAZAMA, Yiou WANG, Kentaro TORISAWA, Hitoshi ISAHARA, },
journal={IEICE TRANSACTIONS on Information},
title={Joint Chinese Word Segmentation and POS Tagging Using an Error-Driven Word-Character Hybrid Model},
year={2009},
volume={E92-D},
number={12},
pages={2298-2305},
abstract={In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging. Our word-character hybrid model offers high performance since it can handle both known and unknown words. We describe our strategies that yield good balance for learning the characteristics of known and unknown words and propose an error-driven policy that delivers such balance by acquiring examples of unknown words from particular errors in a training corpus. We describe an efficient framework for training our model based on the Margin Infused Relaxed Algorithm (MIRA), evaluate our approach on the Penn Chinese Treebank, and show that it achieves superior performance compared to the state-of-the-art approaches reported in the literature.},
keywords={},
doi={10.1587/transinf.E92.D.2298},
ISSN={1745-1361},
month={December},}
Copy
TY - JOUR
TI - Joint Chinese Word Segmentation and POS Tagging Using an Error-Driven Word-Character Hybrid Model
T2 - IEICE TRANSACTIONS on Information
SP - 2298
EP - 2305
AU - Canasai KRUENGKRAI
AU - Kiyotaka UCHIMOTO
AU - Jun'ichi KAZAMA
AU - Yiou WANG
AU - Kentaro TORISAWA
AU - Hitoshi ISAHARA
PY - 2009
DO - 10.1587/transinf.E92.D.2298
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E92-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2009
AB - In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging. Our word-character hybrid model offers high performance since it can handle both known and unknown words. We describe our strategies that yield good balance for learning the characteristics of known and unknown words and propose an error-driven policy that delivers such balance by acquiring examples of unknown words from particular errors in a training corpus. We describe an efficient framework for training our model based on the Margin Infused Relaxed Algorithm (MIRA), evaluate our approach on the Penn Chinese Treebank, and show that it achieves superior performance compared to the state-of-the-art approaches reported in the literature.
ER -