Out-of-vocabulary (OOV) words create serious problems for automatic speech recognition (ASR) systems. Not only are they miss-recognized as in-vocabulary (IV) words with similar phonetics, but the error also causes further errors in nearby words. Language models (LMs) for most open vocabulary ASR systems treat OOV words as a single entity, ignoring the linguistic information. In this paper we present a class-based n-gram LM that is able to deal with OOV words by treating each of them individually without retraining all the LM parameters. OOV words are assigned to IV classes consisting of similar semantic meanings for IV words. The World Wide Web is used to acquire additional data for finding the relation between the OOV and IV words. An evaluation based on adjusted perplexity and word-error-rate was carried out on the Wall Street Journal corpus. The result suggests the preference of the use of multiple classes for OOV words, instead of one unknown class.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Welly NAPTALI, Masatoshi TSUCHIYA, Seiichi NAKAGAWA, "Class-Based N-Gram Language Model for New Words Using Out-of-Vocabulary to In-Vocabulary Similarity" in IEICE TRANSACTIONS on Information,
vol. E95-D, no. 9, pp. 2308-2317, September 2012, doi: 10.1587/transinf.E95.D.2308.
Abstract: Out-of-vocabulary (OOV) words create serious problems for automatic speech recognition (ASR) systems. Not only are they miss-recognized as in-vocabulary (IV) words with similar phonetics, but the error also causes further errors in nearby words. Language models (LMs) for most open vocabulary ASR systems treat OOV words as a single entity, ignoring the linguistic information. In this paper we present a class-based n-gram LM that is able to deal with OOV words by treating each of them individually without retraining all the LM parameters. OOV words are assigned to IV classes consisting of similar semantic meanings for IV words. The World Wide Web is used to acquire additional data for finding the relation between the OOV and IV words. An evaluation based on adjusted perplexity and word-error-rate was carried out on the Wall Street Journal corpus. The result suggests the preference of the use of multiple classes for OOV words, instead of one unknown class.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E95.D.2308/_p
Copy
@ARTICLE{e95-d_9_2308,
author={Welly NAPTALI, Masatoshi TSUCHIYA, Seiichi NAKAGAWA, },
journal={IEICE TRANSACTIONS on Information},
title={Class-Based N-Gram Language Model for New Words Using Out-of-Vocabulary to In-Vocabulary Similarity},
year={2012},
volume={E95-D},
number={9},
pages={2308-2317},
abstract={Out-of-vocabulary (OOV) words create serious problems for automatic speech recognition (ASR) systems. Not only are they miss-recognized as in-vocabulary (IV) words with similar phonetics, but the error also causes further errors in nearby words. Language models (LMs) for most open vocabulary ASR systems treat OOV words as a single entity, ignoring the linguistic information. In this paper we present a class-based n-gram LM that is able to deal with OOV words by treating each of them individually without retraining all the LM parameters. OOV words are assigned to IV classes consisting of similar semantic meanings for IV words. The World Wide Web is used to acquire additional data for finding the relation between the OOV and IV words. An evaluation based on adjusted perplexity and word-error-rate was carried out on the Wall Street Journal corpus. The result suggests the preference of the use of multiple classes for OOV words, instead of one unknown class.},
keywords={},
doi={10.1587/transinf.E95.D.2308},
ISSN={1745-1361},
month={September},}
Copy
TY - JOUR
TI - Class-Based N-Gram Language Model for New Words Using Out-of-Vocabulary to In-Vocabulary Similarity
T2 - IEICE TRANSACTIONS on Information
SP - 2308
EP - 2317
AU - Welly NAPTALI
AU - Masatoshi TSUCHIYA
AU - Seiichi NAKAGAWA
PY - 2012
DO - 10.1587/transinf.E95.D.2308
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E95-D
IS - 9
JA - IEICE TRANSACTIONS on Information
Y1 - September 2012
AB - Out-of-vocabulary (OOV) words create serious problems for automatic speech recognition (ASR) systems. Not only are they miss-recognized as in-vocabulary (IV) words with similar phonetics, but the error also causes further errors in nearby words. Language models (LMs) for most open vocabulary ASR systems treat OOV words as a single entity, ignoring the linguistic information. In this paper we present a class-based n-gram LM that is able to deal with OOV words by treating each of them individually without retraining all the LM parameters. OOV words are assigned to IV classes consisting of similar semantic meanings for IV words. The World Wide Web is used to acquire additional data for finding the relation between the OOV and IV words. An evaluation based on adjusted perplexity and word-error-rate was carried out on the Wall Street Journal corpus. The result suggests the preference of the use of multiple classes for OOV words, instead of one unknown class.
ER -