We attempted to estimate subjective scores of the Japanese Diagnostic Rhyme Test (DRT), a two-to-one forced selection speech intelligibility test. We used automatic speech recognizers with language models that force one of the words in the word-pair, mimicking the human recognition process of the DRT. Initial testing was done using speaker-independent models, and they showed significantly lower scores than subjective scores. The acoustic models were then adapted to each of the speakers in the corpus, and then adapted to noise at a specified SNR. Three different types of noise were tested: white noise, multi-talker (babble) noise, and pseudo-speech noise. The match between subjective and estimated scores improved significantly with noise-adapted models compared to speaker-independent models and the speaker-adapted models, when the adapted noise level and the tested level match. However, when SNR conditions do not match, the recognition scores degraded especially when tested SNR conditions were higher than the adapted noise level. Accordingly, we adapted the models to mixed levels of noise, i.e., multi-condition training. The adapted models now showed relatively high intelligibility matching subjective intelligibility performance over all levels of noise. The correlation between subjective and estimated intelligibility scores increased to 0.94 with multi-talker noise, 0.93 with white noise, and 0.89 with pseudo-speech noise, while the root mean square error (RMSE) reduced from more than 40 to 13.10, 13.05 and 16.06, respectively.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Yusuke TAKANO, Kazuhiro KONDO, "Estimation of Speech Intelligibility Using Speech Recognition Systems" in IEICE TRANSACTIONS on Information,
vol. E93-D, no. 12, pp. 3368-3376, December 2010, doi: 10.1587/transinf.E93.D.3368.
Abstract: We attempted to estimate subjective scores of the Japanese Diagnostic Rhyme Test (DRT), a two-to-one forced selection speech intelligibility test. We used automatic speech recognizers with language models that force one of the words in the word-pair, mimicking the human recognition process of the DRT. Initial testing was done using speaker-independent models, and they showed significantly lower scores than subjective scores. The acoustic models were then adapted to each of the speakers in the corpus, and then adapted to noise at a specified SNR. Three different types of noise were tested: white noise, multi-talker (babble) noise, and pseudo-speech noise. The match between subjective and estimated scores improved significantly with noise-adapted models compared to speaker-independent models and the speaker-adapted models, when the adapted noise level and the tested level match. However, when SNR conditions do not match, the recognition scores degraded especially when tested SNR conditions were higher than the adapted noise level. Accordingly, we adapted the models to mixed levels of noise, i.e., multi-condition training. The adapted models now showed relatively high intelligibility matching subjective intelligibility performance over all levels of noise. The correlation between subjective and estimated intelligibility scores increased to 0.94 with multi-talker noise, 0.93 with white noise, and 0.89 with pseudo-speech noise, while the root mean square error (RMSE) reduced from more than 40 to 13.10, 13.05 and 16.06, respectively.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E93.D.3368/_p
Copy
@ARTICLE{e93-d_12_3368,
author={Yusuke TAKANO, Kazuhiro KONDO, },
journal={IEICE TRANSACTIONS on Information},
title={Estimation of Speech Intelligibility Using Speech Recognition Systems},
year={2010},
volume={E93-D},
number={12},
pages={3368-3376},
abstract={We attempted to estimate subjective scores of the Japanese Diagnostic Rhyme Test (DRT), a two-to-one forced selection speech intelligibility test. We used automatic speech recognizers with language models that force one of the words in the word-pair, mimicking the human recognition process of the DRT. Initial testing was done using speaker-independent models, and they showed significantly lower scores than subjective scores. The acoustic models were then adapted to each of the speakers in the corpus, and then adapted to noise at a specified SNR. Three different types of noise were tested: white noise, multi-talker (babble) noise, and pseudo-speech noise. The match between subjective and estimated scores improved significantly with noise-adapted models compared to speaker-independent models and the speaker-adapted models, when the adapted noise level and the tested level match. However, when SNR conditions do not match, the recognition scores degraded especially when tested SNR conditions were higher than the adapted noise level. Accordingly, we adapted the models to mixed levels of noise, i.e., multi-condition training. The adapted models now showed relatively high intelligibility matching subjective intelligibility performance over all levels of noise. The correlation between subjective and estimated intelligibility scores increased to 0.94 with multi-talker noise, 0.93 with white noise, and 0.89 with pseudo-speech noise, while the root mean square error (RMSE) reduced from more than 40 to 13.10, 13.05 and 16.06, respectively.},
keywords={},
doi={10.1587/transinf.E93.D.3368},
ISSN={1745-1361},
month={December},}
Copy
TY - JOUR
TI - Estimation of Speech Intelligibility Using Speech Recognition Systems
T2 - IEICE TRANSACTIONS on Information
SP - 3368
EP - 3376
AU - Yusuke TAKANO
AU - Kazuhiro KONDO
PY - 2010
DO - 10.1587/transinf.E93.D.3368
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E93-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2010
AB - We attempted to estimate subjective scores of the Japanese Diagnostic Rhyme Test (DRT), a two-to-one forced selection speech intelligibility test. We used automatic speech recognizers with language models that force one of the words in the word-pair, mimicking the human recognition process of the DRT. Initial testing was done using speaker-independent models, and they showed significantly lower scores than subjective scores. The acoustic models were then adapted to each of the speakers in the corpus, and then adapted to noise at a specified SNR. Three different types of noise were tested: white noise, multi-talker (babble) noise, and pseudo-speech noise. The match between subjective and estimated scores improved significantly with noise-adapted models compared to speaker-independent models and the speaker-adapted models, when the adapted noise level and the tested level match. However, when SNR conditions do not match, the recognition scores degraded especially when tested SNR conditions were higher than the adapted noise level. Accordingly, we adapted the models to mixed levels of noise, i.e., multi-condition training. The adapted models now showed relatively high intelligibility matching subjective intelligibility performance over all levels of noise. The correlation between subjective and estimated intelligibility scores increased to 0.94 with multi-talker noise, 0.93 with white noise, and 0.89 with pseudo-speech noise, while the root mean square error (RMSE) reduced from more than 40 to 13.10, 13.05 and 16.06, respectively.
ER -