Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandem-based method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.
Satoshi TAMURA
Gifu University
Hiroshi NINOMIYA
Nagoya University
Norihide KITAOKA
Tokushima University
Shin OSUGA
Aisin Seiki Co., Ltd.
Yurie IRIBE
Aichi Prefectural University
Kazuya TAKEDA
Nagoya University
Satoru HAYAMIZU
Gifu University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Satoshi TAMURA, Hiroshi NINOMIYA, Norihide KITAOKA, Shin OSUGA, Yurie IRIBE, Kazuya TAKEDA, Satoru HAYAMIZU, "Investigation of DNN-Based Audio-Visual Speech Recognition" in IEICE TRANSACTIONS on Information,
vol. E99-D, no. 10, pp. 2444-2451, October 2016, doi: 10.1587/transinf.2016SLP0019.
Abstract: Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandem-based method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016SLP0019/_p
Copy
@ARTICLE{e99-d_10_2444,
author={Satoshi TAMURA, Hiroshi NINOMIYA, Norihide KITAOKA, Shin OSUGA, Yurie IRIBE, Kazuya TAKEDA, Satoru HAYAMIZU, },
journal={IEICE TRANSACTIONS on Information},
title={Investigation of DNN-Based Audio-Visual Speech Recognition},
year={2016},
volume={E99-D},
number={10},
pages={2444-2451},
abstract={Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandem-based method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.},
keywords={},
doi={10.1587/transinf.2016SLP0019},
ISSN={1745-1361},
month={October},}
Copy
TY - JOUR
TI - Investigation of DNN-Based Audio-Visual Speech Recognition
T2 - IEICE TRANSACTIONS on Information
SP - 2444
EP - 2451
AU - Satoshi TAMURA
AU - Hiroshi NINOMIYA
AU - Norihide KITAOKA
AU - Shin OSUGA
AU - Yurie IRIBE
AU - Kazuya TAKEDA
AU - Satoru HAYAMIZU
PY - 2016
DO - 10.1587/transinf.2016SLP0019
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E99-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2016
AB - Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandem-based method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.
ER -