This paper describes a method for hands-free speech recognition, and particularly for the simultaneous recognition of multiple sound sources. The method is based on the 3-D Viterbi search, i.e., extended to the 3-D N-best search method enabling the recognition of multiple sound sources. The baseline system integrates two existing technologies--3-D Viterbi search and conventional N-best search--into a complete system. Previously, the first evaluation of the 3-D N-best search-based system showed that new ideas are necessary to develop a system for the simultaneous recognition of multiple sound sources. It found two factors that play important roles in the performance of the system, namely the different likelihood ranges of the sound sources and the direction-based separation of the hypotheses. In order to solve these problems, we implemented a likelihood normalization and a path distance-based clustering technique into the baseline 3-D N-best search-based system. The performance of our system was evaluated through experiments on simulated data for the case of two talkers. The experiments showed significant improvements by implementing the above two techniques. The best results were obtained by implementing the two techniques and using a microphone array composed of 32 channels. More specifically, the Word Accuracy for the two talkers was higher than 80% and the Simultaneous Word Accuracy (where both sources are correctly recognized simultaneously) was higher than 70%, which are very promising results.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Panikos HERACLEOUS, Satoshi NAKAMURA, Takeshi YAMADA, Kiyohiro SHIKANO, "A Microphone Array-Based 3-D N-Best Search Method for Recognizing Multiple Sound Sources" in IEICE TRANSACTIONS on Information,
vol. E85-D, no. 6, pp. 994-1002, June 2002, doi: .
Abstract: This paper describes a method for hands-free speech recognition, and particularly for the simultaneous recognition of multiple sound sources. The method is based on the 3-D Viterbi search, i.e., extended to the 3-D N-best search method enabling the recognition of multiple sound sources. The baseline system integrates two existing technologies--3-D Viterbi search and conventional N-best search--into a complete system. Previously, the first evaluation of the 3-D N-best search-based system showed that new ideas are necessary to develop a system for the simultaneous recognition of multiple sound sources. It found two factors that play important roles in the performance of the system, namely the different likelihood ranges of the sound sources and the direction-based separation of the hypotheses. In order to solve these problems, we implemented a likelihood normalization and a path distance-based clustering technique into the baseline 3-D N-best search-based system. The performance of our system was evaluated through experiments on simulated data for the case of two talkers. The experiments showed significant improvements by implementing the above two techniques. The best results were obtained by implementing the two techniques and using a microphone array composed of 32 channels. More specifically, the Word Accuracy for the two talkers was higher than 80% and the Simultaneous Word Accuracy (where both sources are correctly recognized simultaneously) was higher than 70%, which are very promising results.
URL: https://global.ieice.org/en_transactions/information/10.1587/e85-d_6_994/_p
Copy
@ARTICLE{e85-d_6_994,
author={Panikos HERACLEOUS, Satoshi NAKAMURA, Takeshi YAMADA, Kiyohiro SHIKANO, },
journal={IEICE TRANSACTIONS on Information},
title={A Microphone Array-Based 3-D N-Best Search Method for Recognizing Multiple Sound Sources},
year={2002},
volume={E85-D},
number={6},
pages={994-1002},
abstract={This paper describes a method for hands-free speech recognition, and particularly for the simultaneous recognition of multiple sound sources. The method is based on the 3-D Viterbi search, i.e., extended to the 3-D N-best search method enabling the recognition of multiple sound sources. The baseline system integrates two existing technologies--3-D Viterbi search and conventional N-best search--into a complete system. Previously, the first evaluation of the 3-D N-best search-based system showed that new ideas are necessary to develop a system for the simultaneous recognition of multiple sound sources. It found two factors that play important roles in the performance of the system, namely the different likelihood ranges of the sound sources and the direction-based separation of the hypotheses. In order to solve these problems, we implemented a likelihood normalization and a path distance-based clustering technique into the baseline 3-D N-best search-based system. The performance of our system was evaluated through experiments on simulated data for the case of two talkers. The experiments showed significant improvements by implementing the above two techniques. The best results were obtained by implementing the two techniques and using a microphone array composed of 32 channels. More specifically, the Word Accuracy for the two talkers was higher than 80% and the Simultaneous Word Accuracy (where both sources are correctly recognized simultaneously) was higher than 70%, which are very promising results.},
keywords={},
doi={},
ISSN={},
month={June},}
Copy
TY - JOUR
TI - A Microphone Array-Based 3-D N-Best Search Method for Recognizing Multiple Sound Sources
T2 - IEICE TRANSACTIONS on Information
SP - 994
EP - 1002
AU - Panikos HERACLEOUS
AU - Satoshi NAKAMURA
AU - Takeshi YAMADA
AU - Kiyohiro SHIKANO
PY - 2002
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E85-D
IS - 6
JA - IEICE TRANSACTIONS on Information
Y1 - June 2002
AB - This paper describes a method for hands-free speech recognition, and particularly for the simultaneous recognition of multiple sound sources. The method is based on the 3-D Viterbi search, i.e., extended to the 3-D N-best search method enabling the recognition of multiple sound sources. The baseline system integrates two existing technologies--3-D Viterbi search and conventional N-best search--into a complete system. Previously, the first evaluation of the 3-D N-best search-based system showed that new ideas are necessary to develop a system for the simultaneous recognition of multiple sound sources. It found two factors that play important roles in the performance of the system, namely the different likelihood ranges of the sound sources and the direction-based separation of the hypotheses. In order to solve these problems, we implemented a likelihood normalization and a path distance-based clustering technique into the baseline 3-D N-best search-based system. The performance of our system was evaluated through experiments on simulated data for the case of two talkers. The experiments showed significant improvements by implementing the above two techniques. The best results were obtained by implementing the two techniques and using a microphone array composed of 32 channels. More specifically, the Word Accuracy for the two talkers was higher than 80% and the Simultaneous Word Accuracy (where both sources are correctly recognized simultaneously) was higher than 70%, which are very promising results.
ER -