1-2hit |
Weifeng LI Chiyomi MIYAJIMA Takanori NISHINO Katsunobu ITOU Kazuya TAKEDA Fumitada ITAKURA
In this paper, we address issues in improving hands-free speech recognition performance in different car environments using multiple spatially distributed microphones. In the previous work, we proposed the multiple linear regression of the log spectra (MRLS) for estimating the log spectra of speech at a close-talking microphone. In this paper, the concept is extended to nonlinear regressions. Regressions in the cepstrum domain are also investigated. An effective algorithm is developed to adapt the regression weights automatically to different noise environments. Compared to the nearest distant microphone and adaptive beamformer (Generalized Sidelobe Canceller), the proposed adaptive nonlinear regression approach shows an advantage in the average relative word error rate (WER) reductions of 58.5% and 10.3%, respectively, for isolated word recognition under 15 real car environments.
Weifeng LI Tetsuya SHINDE Hiroshi FUJIMURA Chiyomi MIYAJIMA Takanori NISHINO Katunobu ITOU Kazuya TAKEDA Fumitada ITAKURA
This paper describes a new multi-channel method of noisy speech recognition, which estimates the log spectrum of speech at a close-talking microphone based on the multiple regression of the log spectra (MRLS) of noisy signals captured by distributed microphones. The advantages of the proposed method are as follows: 1) The method does not require a sensitive geometric layout, calibration of the sensors nor additional pre-processing for tracking the speech source; 2) System works in very small computation amounts; and 3) Regression weights can be statistically optimized over the given training data. Once the optimal regression weights are obtained by regression learning, they can be utilized to generate the estimated log spectrum in the recognition phase, where the speech of close-talking is no longer required. The performance of the proposed method is illustrated by speech recognition of real in-car dialogue data. In comparison to the nearest distant microphone and multi-microphone adaptive beamformer, the proposed approach obtains relative word error rate (WER) reductions of 9.8% and 3.6%, respectively.