The search functionality is under construction.

Keyword Search Result

[Keyword] audio-visual speech recognition(2hit)

1-2hit
  • Investigation of DNN-Based Audio-Visual Speech Recognition

    Satoshi TAMURA  Hiroshi NINOMIYA  Norihide KITAOKA  Shin OSUGA  Yurie IRIBE  Kazuya TAKEDA  Satoru HAYAMIZU  

     
    PAPER-Acoustic modeling

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2444-2451

    Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandem-based method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.

  • Audio-Visual Speech Recognition Based on Optimized Product HMMs and GMM Based-MCE-GPD Stream Weight Estimation

    Kenichi KUMATANI  Satoshi NAKAMURA  

     
    PAPER-Speech and Speaker Recognition

      Vol:
    E86-D No:3
      Page(s):
    454-463

    In this paper, we describe an adaptive integration method for an audio-visual speech recognition system that uses not only the speaker's audio speech signal but visual speech signals like lip images. Human beings communicate with each other by integrating multiple types of sensory information such as hearing and vision. Such integration can be applied to automatic speech recognition, too. In the integration of audio and visual speech features for speech recognition, there are two important issues, i.e., (1) a model that represents the synchronous and asynchronous characteristics between audio and visual features, and makes the best use of a whole database that includes uni-modal, audio only, or visual only data as well as audio-visual data, and (2) the adaptive estimation of reliability weights for the audio and visual information. This paper mainly investigates two issues and proposes a novel method to effectively integrate audio and visual information in an audio-visual Automatic Speech Recognition (ASR) system. First, as the model that integrates audio-visual speech information, we apply a product of hidden Markov models (product HMM), the product of an audio HMM and a visual HMM. We newly propose a method that re-estimates the product HMM using audio-visual synchronous speech data so as to train the synchronicity of the audio-visual information, while the original product HMM assumes independence from audio-visual features. Second, for the optimal audio-visual information reliability weight estimation, we propose a Gaussian mixture model (GMM) based-MCE-GPD (minimum classification error and generalized probabilistic descent) algorithm, which enables reductions in the amount of adaptation data and amount of computations required for the GMM estimation. Evaluation experiments show that the proposed audio-visual speech recognition system improves the recognition accuracy over conventional ones even if the audio signals are clean.