Bilingual Cluster Based Models for Statistical Machine Translation

Hirofumi YAMAMOTO; Eiichiro SUMITA

doi:10.1093/ietisy/e91-d.3.588

IEICE TRANSACTIONS on Information

Bilingual Cluster Based Models for Statistical Machine Translation

Hirofumi YAMAMOTO, Eiichiro SUMITA

Full Text Views

0

Cite this

Summary :

We propose a domain specific model for statistical machine translation. It is well-known that domain specific language models perform well in automatic speech recognition. We show that domain specific language and translation models also benefit statistical machine translation. However, there are two problems with using domain specific models. The first is the data sparseness problem. We employ an adaptation technique to overcome this problem. The second issue is domain prediction. In order to perform adaptation, the domain must be provided, however in many cases, the domain is not known or changes dynamically. For these cases, not only the translation target sentence but also the domain must be predicted. This paper focuses on the domain prediction problem for statistical machine translation. In the proposed method, a bilingual training corpus, is automatically clustered into sub-corpora. Each sub-corpus is deemed to be a domain. The domain of a source sentence is predicted by using its similarity to the sub-corpora. The predicted domain (sub-corpus) specific language and translation models are then used for the translation decoding. This approach gave an improvement of 2.7 in BLEU score on the IWSLT05 Japanese to English evaluation corpus (improving the score from 52.4 to 55.1). This is a substantial gain and indicates the validity of the proposed bilingual cluster based models.

Publication: IEICE TRANSACTIONS on Information Vol.E91-D No.3 pp.588-597

Publication Date: 2008/03/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1093/ietisy/e91-d.3.588

Type of Manuscript: Special Section PAPER (Special Section on Robust Speech Processing in Realistic Environments)

Category: Applications

Cite this

Copy

Hirofumi YAMAMOTO, Eiichiro SUMITA, "Bilingual Cluster Based Models for Statistical Machine Translation" in IEICE TRANSACTIONS on Information, vol. E91-D, no. 3, pp. 588-597, March 2008, doi: 10.1093/ietisy/e91-d.3.588.
Abstract: We propose a domain specific model for statistical machine translation. It is well-known that domain specific language models perform well in automatic speech recognition. We show that domain specific language and translation models also benefit statistical machine translation. However, there are two problems with using domain specific models. The first is the data sparseness problem. We employ an adaptation technique to overcome this problem. The second issue is domain prediction. In order to perform adaptation, the domain must be provided, however in many cases, the domain is not known or changes dynamically. For these cases, not only the translation target sentence but also the domain must be predicted. This paper focuses on the domain prediction problem for statistical machine translation. In the proposed method, a bilingual training corpus, is automatically clustered into sub-corpora. Each sub-corpus is deemed to be a domain. The domain of a source sentence is predicted by using its similarity to the sub-corpora. The predicted domain (sub-corpus) specific language and translation models are then used for the translation decoding. This approach gave an improvement of 2.7 in BLEU score on the IWSLT05 Japanese to English evaluation corpus (improving the score from 52.4 to 55.1). This is a substantial gain and indicates the validity of the proposed bilingual cluster based models.
URL: https://global.ieice.org/en_transactions/information/10.1093/ietisy/e91-d.3.588/_p

Copy

@ARTICLE{e91-d_3_588,
author={Hirofumi YAMAMOTO, Eiichiro SUMITA, },
journal={IEICE TRANSACTIONS on Information},
title={Bilingual Cluster Based Models for Statistical Machine Translation},
year={2008},
volume={E91-D},
number={3},
pages={588-597},
abstract={We propose a domain specific model for statistical machine translation. It is well-known that domain specific language models perform well in automatic speech recognition. We show that domain specific language and translation models also benefit statistical machine translation. However, there are two problems with using domain specific models. The first is the data sparseness problem. We employ an adaptation technique to overcome this problem. The second issue is domain prediction. In order to perform adaptation, the domain must be provided, however in many cases, the domain is not known or changes dynamically. For these cases, not only the translation target sentence but also the domain must be predicted. This paper focuses on the domain prediction problem for statistical machine translation. In the proposed method, a bilingual training corpus, is automatically clustered into sub-corpora. Each sub-corpus is deemed to be a domain. The domain of a source sentence is predicted by using its similarity to the sub-corpora. The predicted domain (sub-corpus) specific language and translation models are then used for the translation decoding. This approach gave an improvement of 2.7 in BLEU score on the IWSLT05 Japanese to English evaluation corpus (improving the score from 52.4 to 55.1). This is a substantial gain and indicates the validity of the proposed bilingual cluster based models.},
keywords={},
doi={10.1093/ietisy/e91-d.3.588},
ISSN={1745-1361},
month={March},}

Copy

TY - JOUR
TI - Bilingual Cluster Based Models for Statistical Machine Translation
T2 - IEICE TRANSACTIONS on Information
SP - 588
EP - 597
AU - Hirofumi YAMAMOTO
AU - Eiichiro SUMITA
PY - 2008
DO - 10.1093/ietisy/e91-d.3.588
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E91-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2008
AB - We propose a domain specific model for statistical machine translation. It is well-known that domain specific language models perform well in automatic speech recognition. We show that domain specific language and translation models also benefit statistical machine translation. However, there are two problems with using domain specific models. The first is the data sparseness problem. We employ an adaptation technique to overcome this problem. The second issue is domain prediction. In order to perform adaptation, the domain must be provided, however in many cases, the domain is not known or changes dynamically. For these cases, not only the translation target sentence but also the domain must be predicted. This paper focuses on the domain prediction problem for statistical machine translation. In the proposed method, a bilingual training corpus, is automatically clustered into sub-corpora. Each sub-corpus is deemed to be a domain. The domain of a source sentence is predicted by using its similarity to the sub-corpora. The predicted domain (sub-corpus) specific language and translation models are then used for the translation decoding. This approach gave an improvement of 2.7 in BLEU score on the IWSLT05 Japanese to English evaluation corpus (improving the score from 52.4 to 55.1). This is a substantial gain and indicates the validity of the proposed bilingual cluster based models.
ER -

IEICE TRANSACTIONS on Information

Bilingual Cluster Based Models for Statistical Machine Translation

Summary :

Authors

Keyword

Latest Issue

Contents

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

IEICE TRANSACTIONS on Information

Bilingual Cluster Based Models for Statistical Machine Translation

Summary :

Authors

Keyword

Latest Issue

Contents

Copyrights notice of machine-translated contents

Cite this

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles