An Isolated Word Speech Recognition Based on Fusion of Visual and Auditory Information Usisng 30-frame/s and 24-bit Color Image

Akio OGIHARA; Shinobu ASAO

An Isolated Word Speech Recognition Based on Fusion of Visual and Auditory Information Usisng 30-frame/s and 24-bit Color Image

Akio OGIHARA, Shinobu ASAO

Full Text Views

0

Cite this

Summary :

In the field of speech recognition, many researchers have proposed speech recognition methods using auditory information like acoustic signal or visual information like shape and motion of lips. Auditory information has valid features for speech recognition, but it is difficult to accomplish speech recognition in noisy environment. On the other side, visual information has advantage to accomplish speech recognition in noisy environment, but it is difficult to extract effective features for speech recognition. Thus, in case of using either auditory information or visual information, it is difficult to accomplish speech recognition perfectly. In this paper, we propose a method to fuse auditory information and visual information in order to realize more accurate speech recognition. The proposed method consists of two processes: (1) two probabilities for auditory information and visual information are calculated by HMM, (2) these probabilities are fused by using linear combination. We have performed speech recognition experiments of isolated words, whose auditory information (22.05kHz sampling, 8-bit quantization) and visual information (30-frame/s sampling, 24-bit quantization) are captured with multi-media personal computer, and have confirmed the validity of the proposed method.

Publication: IEICE TRANSACTIONS on Fundamentals Vol.E80-A No.8 pp.1417-1422

Publication Date: 1997/08/25

Publicized

Online ISSN

DOI

Type of Manuscript: Special Section PAPER (Special Section on Digital Signal Processing)

Category

Cite this

Copy

Akio OGIHARA, Shinobu ASAO, "An Isolated Word Speech Recognition Based on Fusion of Visual and Auditory Information Usisng 30-frame/s and 24-bit Color Image" in IEICE TRANSACTIONS on Fundamentals, vol. E80-A, no. 8, pp. 1417-1422, August 1997, doi: .
Abstract: In the field of speech recognition, many researchers have proposed speech recognition methods using auditory information like acoustic signal or visual information like shape and motion of lips. Auditory information has valid features for speech recognition, but it is difficult to accomplish speech recognition in noisy environment. On the other side, visual information has advantage to accomplish speech recognition in noisy environment, but it is difficult to extract effective features for speech recognition. Thus, in case of using either auditory information or visual information, it is difficult to accomplish speech recognition perfectly. In this paper, we propose a method to fuse auditory information and visual information in order to realize more accurate speech recognition. The proposed method consists of two processes: (1) two probabilities for auditory information and visual information are calculated by HMM, (2) these probabilities are fused by using linear combination. We have performed speech recognition experiments of isolated words, whose auditory information (22.05kHz sampling, 8-bit quantization) and visual information (30-frame/s sampling, 24-bit quantization) are captured with multi-media personal computer, and have confirmed the validity of the proposed method.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/e80-a_8_1417/_p

Copy

@ARTICLE{e80-a_8_1417,
author={Akio OGIHARA, Shinobu ASAO, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={An Isolated Word Speech Recognition Based on Fusion of Visual and Auditory Information Usisng 30-frame/s and 24-bit Color Image},
year={1997},
volume={E80-A},
number={8},
pages={1417-1422},
abstract={In the field of speech recognition, many researchers have proposed speech recognition methods using auditory information like acoustic signal or visual information like shape and motion of lips. Auditory information has valid features for speech recognition, but it is difficult to accomplish speech recognition in noisy environment. On the other side, visual information has advantage to accomplish speech recognition in noisy environment, but it is difficult to extract effective features for speech recognition. Thus, in case of using either auditory information or visual information, it is difficult to accomplish speech recognition perfectly. In this paper, we propose a method to fuse auditory information and visual information in order to realize more accurate speech recognition. The proposed method consists of two processes: (1) two probabilities for auditory information and visual information are calculated by HMM, (2) these probabilities are fused by using linear combination. We have performed speech recognition experiments of isolated words, whose auditory information (22.05kHz sampling, 8-bit quantization) and visual information (30-frame/s sampling, 24-bit quantization) are captured with multi-media personal computer, and have confirmed the validity of the proposed method.},
keywords={},
doi={},
ISSN={},
month={August},}

Copy

TY - JOUR
TI - An Isolated Word Speech Recognition Based on Fusion of Visual and Auditory Information Usisng 30-frame/s and 24-bit Color Image
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1417
EP - 1422
AU - Akio OGIHARA
AU - Shinobu ASAO
PY - 1997
DO -
JO - IEICE TRANSACTIONS on Fundamentals
SN -
VL - E80-A
IS - 8
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - August 1997
AB - In the field of speech recognition, many researchers have proposed speech recognition methods using auditory information like acoustic signal or visual information like shape and motion of lips. Auditory information has valid features for speech recognition, but it is difficult to accomplish speech recognition in noisy environment. On the other side, visual information has advantage to accomplish speech recognition in noisy environment, but it is difficult to extract effective features for speech recognition. Thus, in case of using either auditory information or visual information, it is difficult to accomplish speech recognition perfectly. In this paper, we propose a method to fuse auditory information and visual information in order to realize more accurate speech recognition. The proposed method consists of two processes: (1) two probabilities for auditory information and visual information are calculated by HMM, (2) these probabilities are fused by using linear combination. We have performed speech recognition experiments of isolated words, whose auditory information (22.05kHz sampling, 8-bit quantization) and visual information (30-frame/s sampling, 24-bit quantization) are captured with multi-media personal computer, and have confirmed the validity of the proposed method.
ER -