The search functionality is under construction.
The search functionality is under construction.

Author Search Result

[Author] Tatsuo MATSUOKA(3hit)

1-3hit
  • Word Spotting Using Context-Dependent Phoneme-Based HMMs

    Tatsuo MATSUOKA  

     
    PAPER-Phoneme Recognition and Word Spotting

      Vol:
    E74-A No:7
      Page(s):
    1768-1772

    In a practical continuous speech recognition system, input speech includes many extraneous words. Furthermore, detecting the beginning point of the target word is very difficult. Under those circumstances, word-spotting is useful for extracting and recognizing the target speech from such input speech. On the other hand, a phoneme-based HMM is useful for large-vocabulary word recognition. Training a phoneme-based HMM is easier and more stable than training a word-based HMM when there is not so much training speech, because there are several times more phoneme tokens than word tokens in the training speech. For these reasons, we use word-spotting with phoneme-based HMMs. Furthermore, for more precise modeling, we chose context-dependent phoneme modeling. This paper proposes a new clustering method for context-dependent phoneme HMMs. This clustering method uses triphone context when training samples are sufficient, and automatically selects biphone and uniphone contexts if only a few training samples are given. Using this clustering method, context-dependent models were created and tested in phoneme recognition experiments and word spotting experiments. The context-dependent models achieved 90.0% phoneme recognition accuracy that is 7.6% higher than the context-independent models, and they achieved 69.2% word spotting accuracy that is 7.0% higher than the context-independent models.

  • Robustness of Phoneme-Based HMMs against Speaking-Style Variations

    Tatsuo MATSUOKA  Kiyohiro SHIKANO  

     
    PAPER-Phoneme Recognition and Word Spotting

      Vol:
    E74-A No:7
      Page(s):
    1761-1767

    In a practical continuous speech recognition system, the target speech is often spoken in a different speaking-style (e.g., speed or loudness) from the training speech. It is difficult to cope with such speaking style variations because the amount of training speech is limited. Therefore, acoustic modeling should be robust against different styles of speech in order to obtain high recognition performance from the limited training speech. This paper describes robustness of six of phoneme-based HMMs against speaking-style variations. The six types of model were VQ-and FVQ-based discrete HMMs, and single-Gaussian and mixture-Gaussian HMMs with either diagonal or full covariance matrices. They were investigated using isolated word utterance, phrase-by-phrase utterance and fluently spoken utterance, with different utterance types for training and testing. The experimental results show that the mixture-Gaussian HMM with diagonal covariance matrices is the most promising choice. The FVQ-based HMM and the single-Gaussian HMM with full covariance matrices also achieved good results. The mixture-Gaussian HMM with full covariance matrices sometime achieved very high accuracies, but often suffered from "overtuning" or a lack of training data. Finally this paper proposes a new model-adaptation technique that combines multiple models with appropriate weighting factors. Each model has different characteristics (e.g., coverage of speaking styles and sensitivety to data), and weighting factors can be estimated using "deletedinterpolation". When the mixture-Gaussian diagonal covariance models were used as baseline models, this technique achieved better recognition accuracy than a model trained using all three utterance types at a time. The advantage of this technique is that estimating the weighting factors is stable even from a limited amount of training speech because there are few free parameters to be estimated.

  • Topic Extraction based on Continuous Speech Recognition in Broadcast News Speech

    Katsutoshi OHTSUKI  Tatsuo MATSUOKA  Shoichi MATSUNAGA  Sadaoki FURUI  

     
    PAPER-Speech and Hearing

      Vol:
    E85-D No:7
      Page(s):
    1138-1144

    In this paper, we propose topic extraction models based on statistical relevance scores between topic words and words in articles, and report results obtained in topic extraction experiments using continuous speech recognition for Japanese broadcast news utterances. We attempt to represent a topic of news speech using a combination of multiple topic words, which are important words in the news article or words relevant to the news. We assume a topic of news is represented by a combination of words. We statistically model mapping from words in an article to topic words. Using the mapping, the topic extraction model can extract topic words even if they do not appear in the article. We train a topic extraction model capable of computing the degree of relevance between a topic word and a word in an article by using newspaper text covering a five-year period. The degree of relevance between those words is calculated based on measures such as mutual information or the χ2-method. In experiments extracting five topic words using a χ2-based model, we achieve 72% precision and 12% recall for speech recognition results. Speech recognition results generally include a number of recognition errors, which degrades topic extraction performance. To avoid this, we employ N-best candidates and likelihood given by acoustic and language models. In experiments, we find that extracting five topic words using N-best candidate and likelihood values achieves significantly improved precision.