Incremental Language Modeling for Automatic Transcription of Broadcast News

Katsutoshi OHTSUKI; Long NGUYEN

doi:10.1093/ietisy/e90-d.2.526

Incremental Language Modeling for Automatic Transcription of Broadcast News

Katsutoshi OHTSUKI, Long NGUYEN

Full Text Views

0

Cite this

Summary :

In this paper, we address the task of incremental language modeling for automatic transcription of broadcast news speech. Daily broadcast news naturally contains new words that are not in the lexicon of the speech recognition system but are important for downstream applications such as information retrieval or machine translation. To recognize those new words, the lexicon and the language model of the speech recognition system need to be updated periodically. We propose a method of estimating a list of words to be added to the lexicon based on some time-series text data. The experimental results on the RT04 Broadcast News data and other TV audio data showed that this method provided an impressive and stable reduction in both out-of-vocabulary rates and speech recognition word error rates.

Publication: IEICE TRANSACTIONS on Information Vol.E90-D No.2 pp.526-532

Publication Date: 2007/02/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1093/ietisy/e90-d.2.526

Type of Manuscript: PAPER

Category: Speech and Hearing

Cite this

Copy

Katsutoshi OHTSUKI, Long NGUYEN, "Incremental Language Modeling for Automatic Transcription of Broadcast News" in IEICE TRANSACTIONS on Information, vol. E90-D, no. 2, pp. 526-532, February 2007, doi: 10.1093/ietisy/e90-d.2.526.
Abstract: In this paper, we address the task of incremental language modeling for automatic transcription of broadcast news speech. Daily broadcast news naturally contains new words that are not in the lexicon of the speech recognition system but are important for downstream applications such as information retrieval or machine translation. To recognize those new words, the lexicon and the language model of the speech recognition system need to be updated periodically. We propose a method of estimating a list of words to be added to the lexicon based on some time-series text data. The experimental results on the RT04 Broadcast News data and other TV audio data showed that this method provided an impressive and stable reduction in both out-of-vocabulary rates and speech recognition word error rates.
URL: https://global.ieice.org/en_transactions/information/10.1093/ietisy/e90-d.2.526/_p

Copy

@ARTICLE{e90-d_2_526,
author={Katsutoshi OHTSUKI, Long NGUYEN, },
journal={IEICE TRANSACTIONS on Information},
title={Incremental Language Modeling for Automatic Transcription of Broadcast News},
year={2007},
volume={E90-D},
number={2},
pages={526-532},
abstract={In this paper, we address the task of incremental language modeling for automatic transcription of broadcast news speech. Daily broadcast news naturally contains new words that are not in the lexicon of the speech recognition system but are important for downstream applications such as information retrieval or machine translation. To recognize those new words, the lexicon and the language model of the speech recognition system need to be updated periodically. We propose a method of estimating a list of words to be added to the lexicon based on some time-series text data. The experimental results on the RT04 Broadcast News data and other TV audio data showed that this method provided an impressive and stable reduction in both out-of-vocabulary rates and speech recognition word error rates.},
keywords={},
doi={10.1093/ietisy/e90-d.2.526},
ISSN={1745-1361},
month={February},}

Copy

TY - JOUR
TI - Incremental Language Modeling for Automatic Transcription of Broadcast News
T2 - IEICE TRANSACTIONS on Information
SP - 526
EP - 532
AU - Katsutoshi OHTSUKI
AU - Long NGUYEN
PY - 2007
DO - 10.1093/ietisy/e90-d.2.526
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E90-D
IS - 2
JA - IEICE TRANSACTIONS on Information
Y1 - February 2007
AB - In this paper, we address the task of incremental language modeling for automatic transcription of broadcast news speech. Daily broadcast news naturally contains new words that are not in the lexicon of the speech recognition system but are important for downstream applications such as information retrieval or machine translation. To recognize those new words, the lexicon and the language model of the speech recognition system need to be updated periodically. We propose a method of estimating a list of words to be added to the lexicon based on some time-series text data. The experimental results on the RT04 Broadcast News data and other TV audio data showed that this method provided an impressive and stable reduction in both out-of-vocabulary rates and speech recognition word error rates.
ER -