In a practical continuous speech recognition system, input speech includes many extraneous words. Furthermore, detecting the beginning point of the target word is very difficult. Under those circumstances, word-spotting is useful for extracting and recognizing the target speech from such input speech. On the other hand, a phoneme-based HMM is useful for large-vocabulary word recognition. Training a phoneme-based HMM is easier and more stable than training a word-based HMM when there is not so much training speech, because there are several times more phoneme tokens than word tokens in the training speech. For these reasons, we use word-spotting with phoneme-based HMMs. Furthermore, for more precise modeling, we chose context-dependent phoneme modeling. This paper proposes a new clustering method for context-dependent phoneme HMMs. This clustering method uses triphone context when training samples are sufficient, and automatically selects biphone and uniphone contexts if only a few training samples are given. Using this clustering method, context-dependent models were created and tested in phoneme recognition experiments and word spotting experiments. The context-dependent models achieved 90.0% phoneme recognition accuracy that is 7.6% higher than the context-independent models, and they achieved 69.2% word spotting accuracy that is 7.0% higher than the context-independent models.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Tatsuo MATSUOKA, "Word Spotting Using Context-Dependent Phoneme-Based HMMs" in IEICE TRANSACTIONS on Fundamentals,
vol. E74-A, no. 7, pp. 1768-1772, July 1991, doi: .
Abstract: In a practical continuous speech recognition system, input speech includes many extraneous words. Furthermore, detecting the beginning point of the target word is very difficult. Under those circumstances, word-spotting is useful for extracting and recognizing the target speech from such input speech. On the other hand, a phoneme-based HMM is useful for large-vocabulary word recognition. Training a phoneme-based HMM is easier and more stable than training a word-based HMM when there is not so much training speech, because there are several times more phoneme tokens than word tokens in the training speech. For these reasons, we use word-spotting with phoneme-based HMMs. Furthermore, for more precise modeling, we chose context-dependent phoneme modeling. This paper proposes a new clustering method for context-dependent phoneme HMMs. This clustering method uses triphone context when training samples are sufficient, and automatically selects biphone and uniphone contexts if only a few training samples are given. Using this clustering method, context-dependent models were created and tested in phoneme recognition experiments and word spotting experiments. The context-dependent models achieved 90.0% phoneme recognition accuracy that is 7.6% higher than the context-independent models, and they achieved 69.2% word spotting accuracy that is 7.0% higher than the context-independent models.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/e74-a_7_1768/_p
Copy
@ARTICLE{e74-a_7_1768,
author={Tatsuo MATSUOKA, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Word Spotting Using Context-Dependent Phoneme-Based HMMs},
year={1991},
volume={E74-A},
number={7},
pages={1768-1772},
abstract={In a practical continuous speech recognition system, input speech includes many extraneous words. Furthermore, detecting the beginning point of the target word is very difficult. Under those circumstances, word-spotting is useful for extracting and recognizing the target speech from such input speech. On the other hand, a phoneme-based HMM is useful for large-vocabulary word recognition. Training a phoneme-based HMM is easier and more stable than training a word-based HMM when there is not so much training speech, because there are several times more phoneme tokens than word tokens in the training speech. For these reasons, we use word-spotting with phoneme-based HMMs. Furthermore, for more precise modeling, we chose context-dependent phoneme modeling. This paper proposes a new clustering method for context-dependent phoneme HMMs. This clustering method uses triphone context when training samples are sufficient, and automatically selects biphone and uniphone contexts if only a few training samples are given. Using this clustering method, context-dependent models were created and tested in phoneme recognition experiments and word spotting experiments. The context-dependent models achieved 90.0% phoneme recognition accuracy that is 7.6% higher than the context-independent models, and they achieved 69.2% word spotting accuracy that is 7.0% higher than the context-independent models.},
keywords={},
doi={},
ISSN={},
month={July},}
Copy
TY - JOUR
TI - Word Spotting Using Context-Dependent Phoneme-Based HMMs
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1768
EP - 1772
AU - Tatsuo MATSUOKA
PY - 1991
DO -
JO - IEICE TRANSACTIONS on Fundamentals
SN -
VL - E74-A
IS - 7
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - July 1991
AB - In a practical continuous speech recognition system, input speech includes many extraneous words. Furthermore, detecting the beginning point of the target word is very difficult. Under those circumstances, word-spotting is useful for extracting and recognizing the target speech from such input speech. On the other hand, a phoneme-based HMM is useful for large-vocabulary word recognition. Training a phoneme-based HMM is easier and more stable than training a word-based HMM when there is not so much training speech, because there are several times more phoneme tokens than word tokens in the training speech. For these reasons, we use word-spotting with phoneme-based HMMs. Furthermore, for more precise modeling, we chose context-dependent phoneme modeling. This paper proposes a new clustering method for context-dependent phoneme HMMs. This clustering method uses triphone context when training samples are sufficient, and automatically selects biphone and uniphone contexts if only a few training samples are given. Using this clustering method, context-dependent models were created and tested in phoneme recognition experiments and word spotting experiments. The context-dependent models achieved 90.0% phoneme recognition accuracy that is 7.6% higher than the context-independent models, and they achieved 69.2% word spotting accuracy that is 7.0% higher than the context-independent models.
ER -