The search functionality is under construction.

Author Search Result

[Author] Konstantin MARKOV(7hit)

1-7hit
  • A Hybrid HMM/BN Acoustic Model Utilizing Pentaphone-Context Dependency

    Sakriani SAKTI  Konstantin MARKOV  Satoshi NAKAMURA  

     
    PAPER-Speech Recognition

      Vol:
    E89-D No:3
      Page(s):
    954-961

    The most widely used acoustic unit in current automatic speech recognition systems is the triphone, which includes the immediate preceding and following phonetic contexts. Although triphones have proved to be an efficient choice, it is believed that they are insufficient in capturing all of the coarticulation effects. A wider phonetic context seems to be more appropriate, but often suffers from the data sparsity problem and memory constraints. Therefore, an efficient modeling of wider contexts needs to be addressed to achieve a realistic application for an automatic speech recognition system. This paper presents a new method of modeling pentaphone-context units using the hybrid HMM/BN acoustic modeling framework. Rather than modeling pentaphones explicitly, in this approach the probabilistic dependencies between the triphone context unit and the second preceding/following contexts are incorporated into the triphone state output distributions by means of the BN. The advantages of this approach are that we are able to extend the modeled phonetic context within the triphone framework, and we can use a standard decoding system by assuming the next preceding/following context variables hidden during the recognition. To handle the increased parameter number, tying using knowledge-based phoneme classes and a data-driven clustering method is applied. The evaluation experiments indicate that the proposed model outperforms the standard HMM based triphone model, achieving a 9-10% relative word error rate (WER) reduction.

  • ATR Parallel Decoding Based Speech Recognition System Robust to Noise and Speaking Styles

    Shigeki MATSUDA  Takatoshi JITSUHIRO  Konstantin MARKOV  Satoshi NAKAMURA  

     
    PAPER-Speech Recognition

      Vol:
    E89-D No:3
      Page(s):
    989-997

    In this paper, we describe a parallel decoding-based ASR system developed of ATR that is robust to noise type, SNR and speaking style. It is difficult to recognize speech affected by various factors, especially when an ASR system contains only a single acoustic model. One solution is to employ multiple acoustic models, one model for each different condition. Even though the robustness of each acoustic model is limited, the whole ASR system can handle various conditions appropriately. In our system, there are two recognition sub-systems which use different features such as MFCC and Differential MFCC (DMFCC). Each sub-system has several acoustic models depending on SNR, speaker gender and speaking style, and during recognition each acoustic model is adapted by fast noise adaptation. From each sub-system, one hypothesis is selected based on posterior probability. The final recognition result is obtained by combining the best hypotheses from the two sub-systems. On the AURORA-2J task used widely for the evaluation of noise robustness, our system achieved higher recognition performance than a system which contains only a single model. Also, our system was tested using normal and hyper-articulated speech contaminated by several background noises, and exhibited high robustness to noise and speaking styles.

  • A Study on Acoustic Modeling of Pauses for Recognizing Noisy Conversational Speech

    Jin-Song ZHANG  Konstantin MARKOV  Tomoko MATSUI  Satoshi NAKAMURA  

     
    PAPER-Robust Speech Recognition and Enhancement

      Vol:
    E86-D No:3
      Page(s):
    489-496

    This paper presents a study on modeling inter-word pauses to improve the robustness of acoustic models for recognizing noisy conversational speech. When precise contextual modeling is used for pauses, the frequent appearances and varying acoustics of pauses in noisy conversational speech make it a problem to automatically generate an accurate phonetic transcription of the training data for developing robust acoustic models. This paper presents a proposal to exploit the reliable phonetic heuristics of pauses in speech to aid the detection of varying pauses. Based on it, a stepwise approach to optimize pause HMMs was applied to the data of the DARPA SPINE2 project, and more correct phonetic transcription was achieved. The cross-word triphone HMMs developed using this method got an absolute 9.2% word error reduction when compared to the conventional method with only context free modeling of pauses. For the same pause modeling method, the use of the optimized phonetic segmentation brought about an absolute 5.2% improvements.

  • Noise and Channel Distortion Robust ASR System for DARPA SPINE2 Task

    Konstantin MARKOV  Tomoko MATSUI  Rainer GRUHN  Jinsong ZHANG  Satoshi NAKAMURA  

     
    PAPER-Robust Speech Recognition and Enhancement

      Vol:
    E86-D No:3
      Page(s):
    497-504

    This paper presents the ATR speech recognition system designed for the DARPA SPINE2 evaluation task. The system is capable of dealing with speech from highly variable, real-world noisy conditions and communication channels. A number of robust techniques are implemented, such as differential spectrum mel-scale cepstrum features, on-line MLLR adaptation, and word-level hypothesis combination, which led to a significant reduction in the word error rate.

  • Improving Acoustic Model Precision by Incorporating a Wide Phonetic Context Based on a Bayesian Framework

    Sakriani SAKTI  Satoshi NAKAMURA  Konstantin MARKOV  

     
    PAPER-Speech Recognition

      Vol:
    E89-D No:3
      Page(s):
    946-953

    Over the last decade, the Bayesian approach has increased in popularity in many application areas. It uses a probabilistic framework which encodes our beliefs or actions in situations of uncertainty. Information from several models can also be combined based on the Bayesian framework to achieve better inference and to better account for modeling uncertainty. The approach we adopted here is to utilize the benefits of the Bayesian framework to improve acoustic model precision in speech recognition systems, which modeling a wider-than-triphone context by approximating it using several less context-dependent models. Such a composition was developed in order to avoid the crucial problem of limited training data and to reduce the model complexity. To enhance the model reliability due to unseen contexts and limited training data, flooring and smoothing techniques are applied. Experimental results show that the proposed Bayesian pentaphone model improves word accuracy in comparison with the standard triphone model.

  • Using Hybrid HMM/BN Acoustic Models: Design and Implementation Issues

    Konstantin MARKOV  Satoshi NAKAMURA  

     
    PAPER-Speech Recognition

      Vol:
    E89-D No:3
      Page(s):
    981-988

    In recent years, the number of studies investigating new directions in speech modeling that goes beyond the conventional HMM has increased considerably. One promising approach is to use Bayesian Networks (BN) as speech models. Full recognition systems based on Dynamic BN as well as acoustic models using BN have been proposed lately. Our group at ATR has been developing a hybrid HMM/BN model, which is an HMM where the state probability distribution is modeled by a BN, instead of commonly used mixtures of Gaussian functions. In this paper, we describe how to use the hybrid HMM/BN acoustic models, especially emphasizing some design and implementation issues. The most essential part of HMM/BN model building is the choice of the state BN topology. As it is manually chosen, there are some factors that should be considered in this process. They include, but are not limited to, the type of data, the task and the available additional information. When context-dependent models are used, the state-level structure can be obtained by traditional methods. The HMM/BN parameter learning is based on the Viterbi training paradigm and consists of two alternating steps - BN training and HMM transition updates. For recognition, in some cases, BN inference is computationally equivalent to a mixture of Gaussians, which allows HMM/BN model to be used in existing decoders without any modification. We present two examples of HMM/BN model applications in speech recognition systems. Evaluations under various conditions and for different tasks showed that the HMM/BN model gives consistently better performance than the conventional HMM.

  • A Hybrid HMM/BN Acoustic Model for Automatic Speech Recognition

    Konstantin MARKOV  Satoshi NAKAMURA  

     
    PAPER-Speech and Speaker Recognition

      Vol:
    E86-D No:3
      Page(s):
    438-445

    In current HMM based speech recognition systems, it is difficult to supplement acoustic spectrum features with additional information such as pitch, gender, articulator positions, etc. On the other hand, Bayesian Networks (BN) allow for easy combination of different continuous as well as discrete features by exploring conditional dependencies between them. However, the lack of efficient algorithms has limited their application in continuous speech recognition. In this paper we propose new acoustic model, where HMM are used for modeling of temporal speech characteristics and state probability model is represented by BN. In our experimental system based on HMM/BN model, in addition to speech observation variable, state BN has two more (hidden) variables representing noise type and SNR value. Evaluation results on AURORA2 database showed 36.4% word error rate reduction for closed noise test which is comparable with other much more complex systems utilizing effective adaptation and noise robust methods.