A Linguistics-Driven Approach to Statistical Parsing for Low-Resourced Languages

Prachya BOONKWAN; Thepchai SUPNITHI

doi:10.1587/transinf.2014DAP0024

A Linguistics-Driven Approach to Statistical Parsing for Low-Resourced Languages

Prachya BOONKWAN, Thepchai SUPNITHI

Full Text Views

0

Cite this

Summary :

Developing a practical and accurate statistical parser for low-resourced languages is a hard problem, because it requires large-scale treebanks, which are expensive and labor-intensive to build from scratch. Unsupervised grammar induction theoretically offers a way to overcome this hurdle by learning hidden syntactic structures from raw text automatically. The accuracy of grammar induction is still impractically low because frequent collocations of non-linguistically associable units are commonly found, resulting in dependency attachment errors. We introduce a novel approach to building a statistical parser for low-resourced languages by using language parameters as a guide for grammar induction. The intuition of this paper is: most dependency attachment errors are frequently used word orders which can be captured by a small prescribed set of linguistic constraints, while the rest of the language can be learned statistically by grammar induction. We then show that covering the most frequent grammar rules via our language parameters has a strong impact on the parsing accuracy in 12 languages.

Publication: IEICE TRANSACTIONS on Information Vol.E98-D No.5 pp.1045-1052

Publication Date: 2015/05/01

Publicized: 2015/01/21

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2014DAP0024

Type of Manuscript: Special Section PAPER (Special Section on Data Engineering and Information Management)

Category

Authors

Prachya BOONKWAN
Language and Semantic Technology Laboratory, National Electronics and Computer Technology Center
Thepchai SUPNITHI
Language and Semantic Technology Laboratory, National Electronics and Computer Technology Center

Keyword

statistical parsing, grammar induction, language parameters, Universal Grammar, treebank

Cite this

Copy

Prachya BOONKWAN, Thepchai SUPNITHI, "A Linguistics-Driven Approach to Statistical Parsing for Low-Resourced Languages" in IEICE TRANSACTIONS on Information, vol. E98-D, no. 5, pp. 1045-1052, May 2015, doi: 10.1587/transinf.2014DAP0024.
Abstract: Developing a practical and accurate statistical parser for low-resourced languages is a hard problem, because it requires large-scale treebanks, which are expensive and labor-intensive to build from scratch. Unsupervised grammar induction theoretically offers a way to overcome this hurdle by learning hidden syntactic structures from raw text automatically. The accuracy of grammar induction is still impractically low because frequent collocations of non-linguistically associable units are commonly found, resulting in dependency attachment errors. We introduce a novel approach to building a statistical parser for low-resourced languages by using language parameters as a guide for grammar induction. The intuition of this paper is: most dependency attachment errors are frequently used word orders which can be captured by a small prescribed set of linguistic constraints, while the rest of the language can be learned statistically by grammar induction. We then show that covering the most frequent grammar rules via our language parameters has a strong impact on the parsing accuracy in 12 languages.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2014DAP0024/_p

Copy

@ARTICLE{e98-d_5_1045,
author={Prachya BOONKWAN, Thepchai SUPNITHI, },
journal={IEICE TRANSACTIONS on Information},
title={A Linguistics-Driven Approach to Statistical Parsing for Low-Resourced Languages},
year={2015},
volume={E98-D},
number={5},
pages={1045-1052},
abstract={Developing a practical and accurate statistical parser for low-resourced languages is a hard problem, because it requires large-scale treebanks, which are expensive and labor-intensive to build from scratch. Unsupervised grammar induction theoretically offers a way to overcome this hurdle by learning hidden syntactic structures from raw text automatically. The accuracy of grammar induction is still impractically low because frequent collocations of non-linguistically associable units are commonly found, resulting in dependency attachment errors. We introduce a novel approach to building a statistical parser for low-resourced languages by using language parameters as a guide for grammar induction. The intuition of this paper is: most dependency attachment errors are frequently used word orders which can be captured by a small prescribed set of linguistic constraints, while the rest of the language can be learned statistically by grammar induction. We then show that covering the most frequent grammar rules via our language parameters has a strong impact on the parsing accuracy in 12 languages.},
keywords={},
doi={10.1587/transinf.2014DAP0024},
ISSN={1745-1361},
month={May},}

Copy

TY - JOUR
TI - A Linguistics-Driven Approach to Statistical Parsing for Low-Resourced Languages
T2 - IEICE TRANSACTIONS on Information
SP - 1045
EP - 1052
AU - Prachya BOONKWAN
AU - Thepchai SUPNITHI
PY - 2015
DO - 10.1587/transinf.2014DAP0024
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E98-D
IS - 5
JA - IEICE TRANSACTIONS on Information
Y1 - May 2015
AB - Developing a practical and accurate statistical parser for low-resourced languages is a hard problem, because it requires large-scale treebanks, which are expensive and labor-intensive to build from scratch. Unsupervised grammar induction theoretically offers a way to overcome this hurdle by learning hidden syntactic structures from raw text automatically. The accuracy of grammar induction is still impractically low because frequent collocations of non-linguistically associable units are commonly found, resulting in dependency attachment errors. We introduce a novel approach to building a statistical parser for low-resourced languages by using language parameters as a guide for grammar induction. The intuition of this paper is: most dependency attachment errors are frequently used word orders which can be captured by a small prescribed set of linguistic constraints, while the rest of the language can be learned statistically by grammar induction. We then show that covering the most frequent grammar rules via our language parameters has a strong impact on the parsing accuracy in 12 languages.
ER -