The search functionality is under construction.
The search functionality is under construction.

Author Search Result

[Author] Thatsanee CHAROENPORN(3hit)

1-3hit
  • An EM-Based Approach for Mining Word Senses from Corpora

    Thatsanee CHAROENPORN  Canasai KRUENGKRAI  Thanaruk THEERAMUNKONG  Virach SORNLERTLAMVANICH  

     
    PAPER-Natural Language Processing

      Vol:
    E90-D No:4
      Page(s):
    775-782

    Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.

  • Statistical-Based Approach to Non-segmented Language Processing

    Virach SORNLERTLAMVANICH  Thatsanee CHAROENPORN  Shisanu TONGCHIM  Canasai KRUENGKRAI  Hitoshi ISAHARA  

     
    PAPER

      Vol:
    E90-D No:10
      Page(s):
    1565-1573

    Several approaches have been studied to cope with the exceptional features of non-segmented languages. When there is no explicit information about the boundary of a word, segmenting an input text is a formidable task in language processing. Not only the contemporary word list, but also usages of the words have to be maintained to cover the use in the current texts. The accuracy and efficiency in higher processing do heavily rely on this word boundary identification task. In this paper, we introduce some statistical based approaches to tackle the problem due to the ambiguity in word segmentation. The word boundary identification problem is then defined as a part of others for performing the unified language processing in total. To exhibit the ability in conducting the unified language processing, we selectively study the tasks of language identification, word extraction, and dictionary-less search engine.

  • Construction of Thai Lexicon from Existing Dictionaries and Texts on the Web

    Thatsanee CHAROENPORN  Canasai KRUENGKRAI  Thanaruk THEERAMUNKONG  Virach SORNLERTLAMVANICH  

     
    PAPER-Natural Language Processing

      Vol:
    E89-D No:7
      Page(s):
    2286-2293

    A lexicon is an important linguistic resource needed for both shallow and deep language processing. Currently, there are few machine-readable Thai dictionaries available, and most of them do not satisfy the computational requirements. This paper presents the design of a Thai lexicon named the TCL's Computational Lexicon (TCLLEX) and proposes a method to construct a large-scale Thai lexicon by re-using two existing dictionaries and a large number of texts on the Internet. In addition to morphological, syntactic, semantic case role and logical information in the existing dictionaries, a sort of semantic constraint called selectional preference is automatically acquired by analyzing Thai texts on the web and then added into the lexicon. In the acquisition process of the selectional preferences, the so-called Bayesian Information Criterion (BIC) is applied as the measure in a tree cut model. The experiments are done to verify the feasibility and effectiveness of obtained selection preferences.