Determining Indexing Strings with Statistical Analysis

Yoshiyuki TAKEDA; Kyoji UMEMURA; Eiko YAMAMOTO

Determining Indexing Strings with Statistical Analysis

Yoshiyuki TAKEDA, Kyoji UMEMURA, Eiko YAMAMOTO

Full Text Views

0

Cite this

Summary :

Determining indexing strings is an important factor in information retrieval. Ideally, the strings should be words that represent documents or queries. Although any single word may be the first candidate for indexing strings for an English corpus, it may not be ideal due to the existence of compound nouns, which are often good indexing strings, and which often depend on the genre of the corpus used. The situation is even worse in Japanese or Chinese where the words are not separated by spaces. In this paper, we propose a method of determining indexing strings based on statistical analysis. The novel features of our method are to make the most of the statistical measure called "adaptation" and not to use language-dependent resources such as dictionaries and stop word lists. In evaluating our method using a Japanese test collection, we found that it actually improves the precision of information retrieval systems.

Publication: IEICE TRANSACTIONS on Information Vol.E86-D No.9 pp.1781-1787

Publication Date: 2003/09/01

Publicized

Online ISSN

DOI

Type of Manuscript: Special Section PAPER (Special Issue on Text Processing for Information Access)

Category

Cite this

Copy

Yoshiyuki TAKEDA, Kyoji UMEMURA, Eiko YAMAMOTO, "Determining Indexing Strings with Statistical Analysis" in IEICE TRANSACTIONS on Information, vol. E86-D, no. 9, pp. 1781-1787, September 2003, doi: .
Abstract: Determining indexing strings is an important factor in information retrieval. Ideally, the strings should be words that represent documents or queries. Although any single word may be the first candidate for indexing strings for an English corpus, it may not be ideal due to the existence of compound nouns, which are often good indexing strings, and which often depend on the genre of the corpus used. The situation is even worse in Japanese or Chinese where the words are not separated by spaces. In this paper, we propose a method of determining indexing strings based on statistical analysis. The novel features of our method are to make the most of the statistical measure called "adaptation" and not to use language-dependent resources such as dictionaries and stop word lists. In evaluating our method using a Japanese test collection, we found that it actually improves the precision of information retrieval systems.
URL: https://global.ieice.org/en_transactions/information/10.1587/e86-d_9_1781/_p

Copy

@ARTICLE{e86-d_9_1781,
author={Yoshiyuki TAKEDA, Kyoji UMEMURA, Eiko YAMAMOTO, },
journal={IEICE TRANSACTIONS on Information},
title={Determining Indexing Strings with Statistical Analysis},
year={2003},
volume={E86-D},
number={9},
pages={1781-1787},
abstract={Determining indexing strings is an important factor in information retrieval. Ideally, the strings should be words that represent documents or queries. Although any single word may be the first candidate for indexing strings for an English corpus, it may not be ideal due to the existence of compound nouns, which are often good indexing strings, and which often depend on the genre of the corpus used. The situation is even worse in Japanese or Chinese where the words are not separated by spaces. In this paper, we propose a method of determining indexing strings based on statistical analysis. The novel features of our method are to make the most of the statistical measure called "adaptation" and not to use language-dependent resources such as dictionaries and stop word lists. In evaluating our method using a Japanese test collection, we found that it actually improves the precision of information retrieval systems.},
keywords={},
doi={},
ISSN={},
month={September},}

Copy

TY - JOUR
TI - Determining Indexing Strings with Statistical Analysis
T2 - IEICE TRANSACTIONS on Information
SP - 1781
EP - 1787
AU - Yoshiyuki TAKEDA
AU - Kyoji UMEMURA
AU - Eiko YAMAMOTO
PY - 2003
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E86-D
IS - 9
JA - IEICE TRANSACTIONS on Information
Y1 - September 2003
AB - Determining indexing strings is an important factor in information retrieval. Ideally, the strings should be words that represent documents or queries. Although any single word may be the first candidate for indexing strings for an English corpus, it may not be ideal due to the existence of compound nouns, which are often good indexing strings, and which often depend on the genre of the corpus used. The situation is even worse in Japanese or Chinese where the words are not separated by spaces. In this paper, we propose a method of determining indexing strings based on statistical analysis. The novel features of our method are to make the most of the statistical measure called "adaptation" and not to use language-dependent resources such as dictionaries and stop word lists. In evaluating our method using a Japanese test collection, we found that it actually improves the precision of information retrieval systems.
ER -