The search functionality is under construction.
The search functionality is under construction.

Stemming Malay Text and Its Application in Automatic Text Categorization

Michiko YASUKAWA, Hui Tian LIM, Hidetoshi YOKOO

  • Full Text Views

    0

  • Cite this

Summary :

In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.

Publication
IEICE TRANSACTIONS on Information Vol.E92-D No.12 pp.2351-2359
Publication Date
2009/12/01
Publicized
Online ISSN
1745-1361
DOI
10.1587/transinf.E92.D.2351
Type of Manuscript
Special Section PAPER (Special Section on Natural Language Processing and its Applications)
Category
Document Analysis

Authors

Keyword