An EM-Based Approach for Mining Word Senses from Corpora

Thatsanee CHAROENPORN; Canasai KRUENGKRAI; Thanaruk THEERAMUNKONG; Virach SORNLERTLAMVANICH

doi:10.1093/ietisy/e90-d.4.775

IEICE TRANSACTIONS on Information

An EM-Based Approach for Mining Word Senses from Corpora

Thatsanee CHAROENPORN, Canasai KRUENGKRAI, Thanaruk THEERAMUNKONG, Virach SORNLERTLAMVANICH

Full Text Views

0

Cite this

Summary :

Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.

Publication: IEICE TRANSACTIONS on Information Vol.E90-D No.4 pp.775-782

Publication Date: 2007/04/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1093/ietisy/e90-d.4.775

Type of Manuscript: PAPER

Category: Natural Language Processing

Cite this

Copy

Thatsanee CHAROENPORN, Canasai KRUENGKRAI, Thanaruk THEERAMUNKONG, Virach SORNLERTLAMVANICH, "An EM-Based Approach for Mining Word Senses from Corpora" in IEICE TRANSACTIONS on Information, vol. E90-D, no. 4, pp. 775-782, April 2007, doi: 10.1093/ietisy/e90-d.4.775.
Abstract: Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.
URL: https://global.ieice.org/en_transactions/information/10.1093/ietisy/e90-d.4.775/_p

Copy

@ARTICLE{e90-d_4_775,
author={Thatsanee CHAROENPORN, Canasai KRUENGKRAI, Thanaruk THEERAMUNKONG, Virach SORNLERTLAMVANICH, },
journal={IEICE TRANSACTIONS on Information},
title={An EM-Based Approach for Mining Word Senses from Corpora},
year={2007},
volume={E90-D},
number={4},
pages={775-782},
abstract={Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.},
keywords={},
doi={10.1093/ietisy/e90-d.4.775},
ISSN={1745-1361},
month={April},}

Copy

TY - JOUR
TI - An EM-Based Approach for Mining Word Senses from Corpora
T2 - IEICE TRANSACTIONS on Information
SP - 775
EP - 782
AU - Thatsanee CHAROENPORN
AU - Canasai KRUENGKRAI
AU - Thanaruk THEERAMUNKONG
AU - Virach SORNLERTLAMVANICH
PY - 2007
DO - 10.1093/ietisy/e90-d.4.775
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E90-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2007
AB - Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.
ER -

IEICE TRANSACTIONS on Information

An EM-Based Approach for Mining Word Senses from Corpora

Summary :

Authors

Keyword

Latest Issue

Contents

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

IEICE TRANSACTIONS on Information

An EM-Based Approach for Mining Word Senses from Corpora

Summary :

Authors

Keyword

Latest Issue

Contents

Copyrights notice of machine-translated contents

Cite this

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles