Connectionist Approaches to Large Vocabulary Continuous Speech Recognition

Hidefumi SAWAI; Yasuhiro MINAMI; Masanori MIYATAKE; Alex WAIBEL; Kiyohiro SHIKANO

IEICE TRANSACTIONS on Fundamentals

Connectionist Approaches to Large Vocabulary Continuous Speech Recognition

Hidefumi SAWAI, Yasuhiro MINAMI, Masanori MIYATAKE, Alex WAIBEL, Kiyohiro SHIKANO

Full Text Views

0

Cite this

Summary :

This paper describes recent progress in a connectionist large-vocabulary continuous speech recognition system integrating speech recognition and language processing. The speech recognition part consists of Large Phonemic Time-Delay Neural Networks (TDNNs) which can automatically spot all 24 Japanese phonemes (i.e., 18 consonants /b/, /d/, /g/, /p/, /t/, /k/, /m/, /n/, /N/, /s/, /sh/ ([]), /h/, /z/, /ch/ ([t]), /ts/, /r/, /w/, /y/([j]) and 5 vowels /a/, /i/, /u/, /e/, /o/ and a double consonant /Q/ or silence) by simply scanning among input speech without any specific segmentation techniques. On the other hand, the language processing part is made up of a predictive LR parser in which the LR parser is guided by the LR parsing table automatically generated from context-free grammar rules, and proceeds left-to-right without backtracking. Time alignment between the predicted phonemes and a sequence of the TDNN phoneme outputs is carried out by the DTW matching method. We call this 'hybrid' integrated recognition system the 'TDNN-LR' method. We report that large-vocabulary isolated word and continuous speech recognition using the TDNN-LR method provided excellent speaker-dependent recognition performance, where incremental training using a small number of training tokens is found to be very effective for adaptation of speaking rate. Furthermore, we report some new achievements as extensions of the TDNN-LR method: (1) two proposed NN architectures provide robust phoneme recognition performance on variations of speaking manner, (2) a speaker-adaptation technique can be realized using a NN mapping function between input and standard speakers and (3) new architectures proposed for speaker-independent recognition provide performance that nearly matches speaker-dependent recognition performance.

Publication: IEICE TRANSACTIONS on Fundamentals Vol.E74-A No.7 pp.1834-1844

Publication Date: 1991/07/25

Publicized

Online ISSN

DOI

Type of Manuscript: Special Section PAPER (Special Issue on Continuous Speech Recognition and Understanding)

Category: Continuous Speech Recognition

Authors

Hidefumi SAWAI
Yasuhiro MINAMI
Masanori MIYATAKE
Alex WAIBEL
Kiyohiro SHIKANO

Keyword

Cite this

Copy

Hidefumi SAWAI, Yasuhiro MINAMI, Masanori MIYATAKE, Alex WAIBEL, Kiyohiro SHIKANO, "Connectionist Approaches to Large Vocabulary Continuous Speech Recognition" in IEICE TRANSACTIONS on Fundamentals, vol. E74-A, no. 7, pp. 1834-1844, July 1991, doi: .
Abstract: This paper describes recent progress in a connectionist large-vocabulary continuous speech recognition system integrating speech recognition and language processing. The speech recognition part consists of Large Phonemic Time-Delay Neural Networks (TDNNs) which can automatically spot all 24 Japanese phonemes (i.e., 18 consonants /b/, /d/, /g/, /p/, /t/, /k/, /m/, /n/, /N/, /s/, /sh/ ([]), /h/, /z/, /ch/ ([t]), /ts/, /r/, /w/, /y/([j]) and 5 vowels /a/, /i/, /u/, /e/, /o/ and a double consonant /Q/ or silence) by simply scanning among input speech without any specific segmentation techniques. On the other hand, the language processing part is made up of a predictive LR parser in which the LR parser is guided by the LR parsing table automatically generated from context-free grammar rules, and proceeds left-to-right without backtracking. Time alignment between the predicted phonemes and a sequence of the TDNN phoneme outputs is carried out by the DTW matching method. We call this 'hybrid' integrated recognition system the 'TDNN-LR' method. We report that large-vocabulary isolated word and continuous speech recognition using the TDNN-LR method provided excellent speaker-dependent recognition performance, where incremental training using a small number of training tokens is found to be very effective for adaptation of speaking rate. Furthermore, we report some new achievements as extensions of the TDNN-LR method: (1) two proposed NN architectures provide robust phoneme recognition performance on variations of speaking manner, (2) a speaker-adaptation technique can be realized using a NN mapping function between input and standard speakers and (3) new architectures proposed for speaker-independent recognition provide performance that nearly matches speaker-dependent recognition performance.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/e74-a_7_1834/_p

Copy

@ARTICLE{e74-a_7_1834,
author={Hidefumi SAWAI, Yasuhiro MINAMI, Masanori MIYATAKE, Alex WAIBEL, Kiyohiro SHIKANO, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Connectionist Approaches to Large Vocabulary Continuous Speech Recognition},
year={1991},
volume={E74-A},
number={7},
pages={1834-1844},
abstract={This paper describes recent progress in a connectionist large-vocabulary continuous speech recognition system integrating speech recognition and language processing. The speech recognition part consists of Large Phonemic Time-Delay Neural Networks (TDNNs) which can automatically spot all 24 Japanese phonemes (i.e., 18 consonants /b/, /d/, /g/, /p/, /t/, /k/, /m/, /n/, /N/, /s/, /sh/ ([]), /h/, /z/, /ch/ ([t]), /ts/, /r/, /w/, /y/([j]) and 5 vowels /a/, /i/, /u/, /e/, /o/ and a double consonant /Q/ or silence) by simply scanning among input speech without any specific segmentation techniques. On the other hand, the language processing part is made up of a predictive LR parser in which the LR parser is guided by the LR parsing table automatically generated from context-free grammar rules, and proceeds left-to-right without backtracking. Time alignment between the predicted phonemes and a sequence of the TDNN phoneme outputs is carried out by the DTW matching method. We call this 'hybrid' integrated recognition system the 'TDNN-LR' method. We report that large-vocabulary isolated word and continuous speech recognition using the TDNN-LR method provided excellent speaker-dependent recognition performance, where incremental training using a small number of training tokens is found to be very effective for adaptation of speaking rate. Furthermore, we report some new achievements as extensions of the TDNN-LR method: (1) two proposed NN architectures provide robust phoneme recognition performance on variations of speaking manner, (2) a speaker-adaptation technique can be realized using a NN mapping function between input and standard speakers and (3) new architectures proposed for speaker-independent recognition provide performance that nearly matches speaker-dependent recognition performance.},
keywords={},
doi={},
ISSN={},
month={July},}

Copy

TY - JOUR
TI - Connectionist Approaches to Large Vocabulary Continuous Speech Recognition
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1834
EP - 1844
AU - Hidefumi SAWAI
AU - Yasuhiro MINAMI
AU - Masanori MIYATAKE
AU - Alex WAIBEL
AU - Kiyohiro SHIKANO
PY - 1991
DO -
JO - IEICE TRANSACTIONS on Fundamentals
SN -
VL - E74-A
IS - 7
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - July 1991
AB - This paper describes recent progress in a connectionist large-vocabulary continuous speech recognition system integrating speech recognition and language processing. The speech recognition part consists of Large Phonemic Time-Delay Neural Networks (TDNNs) which can automatically spot all 24 Japanese phonemes (i.e., 18 consonants /b/, /d/, /g/, /p/, /t/, /k/, /m/, /n/, /N/, /s/, /sh/ ([]), /h/, /z/, /ch/ ([t]), /ts/, /r/, /w/, /y/([j]) and 5 vowels /a/, /i/, /u/, /e/, /o/ and a double consonant /Q/ or silence) by simply scanning among input speech without any specific segmentation techniques. On the other hand, the language processing part is made up of a predictive LR parser in which the LR parser is guided by the LR parsing table automatically generated from context-free grammar rules, and proceeds left-to-right without backtracking. Time alignment between the predicted phonemes and a sequence of the TDNN phoneme outputs is carried out by the DTW matching method. We call this 'hybrid' integrated recognition system the 'TDNN-LR' method. We report that large-vocabulary isolated word and continuous speech recognition using the TDNN-LR method provided excellent speaker-dependent recognition performance, where incremental training using a small number of training tokens is found to be very effective for adaptation of speaking rate. Furthermore, we report some new achievements as extensions of the TDNN-LR method: (1) two proposed NN architectures provide robust phoneme recognition performance on variations of speaking manner, (2) a speaker-adaptation technique can be realized using a NN mapping function between input and standard speakers and (3) new architectures proposed for speaker-independent recognition provide performance that nearly matches speaker-dependent recognition performance.
ER -

IEICE TRANSACTIONS on Fundamentals

Connectionist Approaches to Large Vocabulary Continuous Speech Recognition

Summary :

Authors

Keyword

Latest Issue

Contents

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

IEICE TRANSACTIONS on Fundamentals

Connectionist Approaches to Large Vocabulary Continuous Speech Recognition

Summary :

Authors

Keyword

Latest Issue

Contents

Copyrights notice of machine-translated contents

Cite this

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles