Automatically Extracting Parallel Sentences from Wikipedia Using Sequential Matching of Language Resources

Juryong CHEON; Youngjoong KO

doi:10.1587/transinf.2016EDL8135

Automatically Extracting Parallel Sentences from Wikipedia Using Sequential Matching of Language Resources

Juryong CHEON, Youngjoong KO

Full Text Views

0

Cite this

Summary :

In this paper, we propose a method to find similar sentences based on language resources for building a parallel corpus between English and Korean from Wikipedia. We use a Wiki-dictionary consisted of document titles from the Wikipedia and bilingual example sentence pairs from Web dictionary instead of traditional machine readable dictionary. In this way, we perform similarity calculation between sentences using sequential matching of the language resources, and evaluate the extracted parallel sentences. In the experiments, the proposed parallel sentences extraction method finally shows 65.4% of F1-score.

Publication: IEICE TRANSACTIONS on Information Vol.E100-D No.2 pp.405-408

Publication Date: 2017/02/01

Publicized: 2016/11/11

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2016EDL8135

Type of Manuscript: LETTER

Category: Natural Language Processing

Authors

Juryong CHEON
Dong-A University
Youngjoong KO
Dong-A University

Keyword

automatic parallel corpus construction, language resources, sentence similarity calculation, Wikipedia

Cite this

Copy

Juryong CHEON, Youngjoong KO, "Automatically Extracting Parallel Sentences from Wikipedia Using Sequential Matching of Language Resources" in IEICE TRANSACTIONS on Information, vol. E100-D, no. 2, pp. 405-408, February 2017, doi: 10.1587/transinf.2016EDL8135.
Abstract: In this paper, we propose a method to find similar sentences based on language resources for building a parallel corpus between English and Korean from Wikipedia. We use a Wiki-dictionary consisted of document titles from the Wikipedia and bilingual example sentence pairs from Web dictionary instead of traditional machine readable dictionary. In this way, we perform similarity calculation between sentences using sequential matching of the language resources, and evaluate the extracted parallel sentences. In the experiments, the proposed parallel sentences extraction method finally shows 65.4% of F1-score.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016EDL8135/_p

Copy

@ARTICLE{e100-d_2_405,
author={Juryong CHEON, Youngjoong KO, },
journal={IEICE TRANSACTIONS on Information},
title={Automatically Extracting Parallel Sentences from Wikipedia Using Sequential Matching of Language Resources},
year={2017},
volume={E100-D},
number={2},
pages={405-408},
abstract={In this paper, we propose a method to find similar sentences based on language resources for building a parallel corpus between English and Korean from Wikipedia. We use a Wiki-dictionary consisted of document titles from the Wikipedia and bilingual example sentence pairs from Web dictionary instead of traditional machine readable dictionary. In this way, we perform similarity calculation between sentences using sequential matching of the language resources, and evaluate the extracted parallel sentences. In the experiments, the proposed parallel sentences extraction method finally shows 65.4% of F1-score.},
keywords={},
doi={10.1587/transinf.2016EDL8135},
ISSN={1745-1361},
month={February},}

Copy

TY - JOUR
TI - Automatically Extracting Parallel Sentences from Wikipedia Using Sequential Matching of Language Resources
T2 - IEICE TRANSACTIONS on Information
SP - 405
EP - 408
AU - Juryong CHEON
AU - Youngjoong KO
PY - 2017
DO - 10.1587/transinf.2016EDL8135
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E100-D
IS - 2
JA - IEICE TRANSACTIONS on Information
Y1 - February 2017
AB - In this paper, we propose a method to find similar sentences based on language resources for building a parallel corpus between English and Korean from Wikipedia. We use a Wiki-dictionary consisted of document titles from the Wikipedia and bilingual example sentence pairs from Web dictionary instead of traditional machine readable dictionary. In this way, we perform similarity calculation between sentences using sequential matching of the language resources, and evaluate the extracted parallel sentences. In the experiments, the proposed parallel sentences extraction method finally shows 65.4% of F1-score.
ER -