Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Thatsanee CHAROENPORN, Canasai KRUENGKRAI, Thanaruk THEERAMUNKONG, Virach SORNLERTLAMVANICH, "An EM-Based Approach for Mining Word Senses from Corpora" in IEICE TRANSACTIONS on Information,
vol. E90-D, no. 4, pp. 775-782, April 2007, doi: 10.1093/ietisy/e90-d.4.775.
Abstract: Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.
URL: https://global.ieice.org/en_transactions/information/10.1093/ietisy/e90-d.4.775/_p
Copy
@ARTICLE{e90-d_4_775,
author={Thatsanee CHAROENPORN, Canasai KRUENGKRAI, Thanaruk THEERAMUNKONG, Virach SORNLERTLAMVANICH, },
journal={IEICE TRANSACTIONS on Information},
title={An EM-Based Approach for Mining Word Senses from Corpora},
year={2007},
volume={E90-D},
number={4},
pages={775-782},
abstract={Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.},
keywords={},
doi={10.1093/ietisy/e90-d.4.775},
ISSN={1745-1361},
month={April},}
Copy
TY - JOUR
TI - An EM-Based Approach for Mining Word Senses from Corpora
T2 - IEICE TRANSACTIONS on Information
SP - 775
EP - 782
AU - Thatsanee CHAROENPORN
AU - Canasai KRUENGKRAI
AU - Thanaruk THEERAMUNKONG
AU - Virach SORNLERTLAMVANICH
PY - 2007
DO - 10.1093/ietisy/e90-d.4.775
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E90-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2007
AB - Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.
ER -