Training Set Selection for Building Compact and Efficient Language Models

Keiji YASUDA; Hirofumi YAMAMOTO; Eiichiro SUMITA

doi:10.1587/transinf.E92.D.506

IEICE TRANSACTIONS on Information

Training Set Selection for Building Compact and Efficient Language Models

Keiji YASUDA, Hirofumi YAMAMOTO, Eiichiro SUMITA

Full Text Views

0

Cite this

Summary :

For statistical language model training, target domain matched corpora are required. However, training corpora sometimes include both target domain matched and unmatched sentences. In such a case, training set selection is effective for both reducing model size and improving model performance. In this paper, training set selection method for statistical language model training is described. The method provides two advantages for training a language model. One is its capacity to improve the language model performance, and the other is its capacity to reduce computational loads for the language model. The method has four steps. 1) Sentence clustering is applied to all available corpora. 2) Language models are trained on each cluster. 3) Perplexity on the development set is calculated using the language models. 4) For the final language model training, we use the clusters whose language models yield low perplexities. The experimental results indicate that the language model trained on the data selected by our method gives lower perplexity on an open test set than a language model trained on all available corpora.

Publication: IEICE TRANSACTIONS on Information Vol.E92-D No.3 pp.506-511

Publication Date: 2009/03/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E92.D.506

Type of Manuscript: PAPER

Category: Natural Language Processing

Cite this

Copy

Keiji YASUDA, Hirofumi YAMAMOTO, Eiichiro SUMITA, "Training Set Selection for Building Compact and Efficient Language Models" in IEICE TRANSACTIONS on Information, vol. E92-D, no. 3, pp. 506-511, March 2009, doi: 10.1587/transinf.E92.D.506.
Abstract: For statistical language model training, target domain matched corpora are required. However, training corpora sometimes include both target domain matched and unmatched sentences. In such a case, training set selection is effective for both reducing model size and improving model performance. In this paper, training set selection method for statistical language model training is described. The method provides two advantages for training a language model. One is its capacity to improve the language model performance, and the other is its capacity to reduce computational loads for the language model. The method has four steps. 1) Sentence clustering is applied to all available corpora. 2) Language models are trained on each cluster. 3) Perplexity on the development set is calculated using the language models. 4) For the final language model training, we use the clusters whose language models yield low perplexities. The experimental results indicate that the language model trained on the data selected by our method gives lower perplexity on an open test set than a language model trained on all available corpora.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E92.D.506/_p

Copy

@ARTICLE{e92-d_3_506,
author={Keiji YASUDA, Hirofumi YAMAMOTO, Eiichiro SUMITA, },
journal={IEICE TRANSACTIONS on Information},
title={Training Set Selection for Building Compact and Efficient Language Models},
year={2009},
volume={E92-D},
number={3},
pages={506-511},
abstract={For statistical language model training, target domain matched corpora are required. However, training corpora sometimes include both target domain matched and unmatched sentences. In such a case, training set selection is effective for both reducing model size and improving model performance. In this paper, training set selection method for statistical language model training is described. The method provides two advantages for training a language model. One is its capacity to improve the language model performance, and the other is its capacity to reduce computational loads for the language model. The method has four steps. 1) Sentence clustering is applied to all available corpora. 2) Language models are trained on each cluster. 3) Perplexity on the development set is calculated using the language models. 4) For the final language model training, we use the clusters whose language models yield low perplexities. The experimental results indicate that the language model trained on the data selected by our method gives lower perplexity on an open test set than a language model trained on all available corpora.},
keywords={},
doi={10.1587/transinf.E92.D.506},
ISSN={1745-1361},
month={March},}

Copy

TY - JOUR
TI - Training Set Selection for Building Compact and Efficient Language Models
T2 - IEICE TRANSACTIONS on Information
SP - 506
EP - 511
AU - Keiji YASUDA
AU - Hirofumi YAMAMOTO
AU - Eiichiro SUMITA
PY - 2009
DO - 10.1587/transinf.E92.D.506
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E92-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2009
AB - For statistical language model training, target domain matched corpora are required. However, training corpora sometimes include both target domain matched and unmatched sentences. In such a case, training set selection is effective for both reducing model size and improving model performance. In this paper, training set selection method for statistical language model training is described. The method provides two advantages for training a language model. One is its capacity to improve the language model performance, and the other is its capacity to reduce computational loads for the language model. The method has four steps. 1) Sentence clustering is applied to all available corpora. 2) Language models are trained on each cluster. 3) Perplexity on the development set is calculated using the language models. 4) For the final language model training, we use the clusters whose language models yield low perplexities. The experimental results indicate that the language model trained on the data selected by our method gives lower perplexity on an open test set than a language model trained on all available corpora.
ER -

IEICE TRANSACTIONS on Information

Training Set Selection for Building Compact and Efficient Language Models

Summary :

Authors

Keyword

Latest Issue

Contents

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

IEICE TRANSACTIONS on Information

Training Set Selection for Building Compact and Efficient Language Models

Summary :

Authors

Keyword

Latest Issue

Contents

Copyrights notice of machine-translated contents

Cite this

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles