1-9hit |
Po-Yi SHIH Po-Chuan LIN Jhing-Fa WANG
This paper describes a novel harmonic-based robust voice activity detection (H-RVAD) method with harmonic spectral local peak (HSLP) feature. HSLP is extracted by spectral amplitude analysis between the adjacent formants, and such characteristic can be used to identify and verify audio stream containing meaningful human speech accurately in low SNR environment. And, an enhanced low SNR noisy speech recognition system framework with wakeup module, speech recognition module and confirmation module is proposed. Users can determine or reject the system feedback while a recognition result was given in the framework, to prevent any chance that the voiced noise misleads the recognition result. The H-RVAD method is evaluated by the AURORA2 corpus in eight types of noise and three SNR levels and increased overall average performance from 4% to 20%. In home noise, the performance of H-RVAD method can be performed from 4% to 14% sentence recognition rate in average.
Tetsuji OGAWA Tetsunori KOBAYASHI
The accuracy of simulation-based assessments of speech recognition systems under noisy conditions is investigated with a focus on the influence of the Lombard effect on the speech recognition performances. This investigation was carried out under various recognition conditions of different sound pressure levels of ambient noise, for different recognition tasks, such as continuous speech recognition and spoken word recognition, and using different recognition systems, i.e., systems with and without adaptation of the acoustic models to ambient noise. Experimental results showed that accurate simulation was not always achieved when dry sources with neutral talking style were used, but it could be achieved if the dry sources that include the influence of the Lombard effect were used; the simulation in the latter case is accurate, irrespective of the recognition conditions.
Norihide KITAOKA Souta HAMAGUCHI Seiichi NAKAGAWA
To achieve high recognition performance for a wide variety of noise and for a wide range of signal-to-noise ratio, this paper presents methods for integration of four noise reduction algorithms: spectral subtraction with smoothing of time direction, temporal domain SVD-based speech enhancement, GMM-based speech estimation and KLT-based comb-filtering. In this paper, we proposed two types of combination methods of noise suppression algorithms: selection of front-end processor and combination of results from multiple recognition processes. Recognition results on the CENSREC-1 task showed the effectiveness of our proposed methods.
Md. Babul ISLAM Kazumasa YAMAMOTO Hiroshi MATSUMOTO
This paper proposes a Mel-Wiener filter to enhance Mel-LPC spectra in the presence of additive noise. The transfer function of the proposed filter is defined by using a first-order all-pass filter instead of unit delay. The filter coefficients are estimated based on minimization of the sum of the square error on the linear frequency scale without applying the bilinear transformation and efficiently implemented in the autocorrelation domain. The proposed filter does not require any time-frequency conversion, which saves a large amount of computational load. The performance of the proposed system is comparable to that of ETSI AFE. The optimum filter order is found to be 3, and thus filtering is computationally inexpensive. The computational cost of the proposed system except VAD is 53% of ETSI AFE.
Masakiyo FUJIMOTO Kazuya TAKEDA Satoshi NAKAMURA
This paper introduces a common database, an evaluation framework, and its baseline recognition results for in-car speech recognition, CENSREC-3, as an outcome of the IPSJ-SIG SLP Noisy Speech Recognition Evaluation Working Group. CENSREC-3, which is a sequel to AURORA-2J, has been designed as the evaluation framework of isolated word recognition in real car-driving environments. Speech data were collected using two microphones, a close-talking microphone and a hands-free microphone, under 16 carefully controlled driving conditions, i.e., combinations of three car speeds and six car conditions. CENSREC-3 provides six evaluation environments designed using speech data collected in these conditions.
Masakiyo FUJIMOTO Satoshi NAKAMURA
This paper addresses a speech recognition problem in non-stationary noise environments: the estimation of noise sequences. To solve this problem, we present a particle filter-based sequential noise estimation method for front-end processing of speech recognition in noise. In the proposed method, a noise sequence is estimated in three stages: a sequential importance sampling step, a residual resampling step, and finally a Markov chain Monte Carlo step with Metropolis-Hastings sampling. The estimated noise sequence is used in the MMSE-based clean speech estimation. We also introduce Polyak averaging and feedback into a state transition process for particle filtering. In the evaluation results, we observed that the proposed method improves speech recognition accuracy in the results of non-stationary noise environments a noise compensation method with stationary noise assumptions.
Satoshi NAKAMURA Kazuya TAKEDA Kazumasa YAMAMOTO Takeshi YAMADA Shingo KUROIWA Norihide KITAOKA Takanobu NISHIURA Akira SASOU Mitsunori MIZUMACHI Chiyomi MIYAJIMA Masakiyo FUJIMOTO Toshiki ENDO
This paper introduces an evaluation framework for Japanese noisy speech recognition named AURORA-2J. Speech recognition systems must still be improved to be robust to noisy environments, but this improvement requires development of the standard evaluation corpus and assessment technologies. Recently, the Aurora 2, 3 and 4 corpora and their evaluation scenarios have had significant impact on noisy speech recognition research. The AURORA-2J is a Japanese connected digits corpus and its evaluation scripts are designed in the same way as Aurora 2 with the help of European Telecommunications Standards Institute (ETSI) AURORA group. This paper describes the data collection, baseline scripts, and its baseline performance. We also propose a new performance analysis method that considers differences in recognition performance among speakers. This method is based on the word accuracy per speaker, revealing the degree of the individual difference of the recognition performance. We also propose categorization of modifications, applied to the original HTK baseline system, which helps in comparing the systems and in recognizing technologies that improve the performance best within the same category.
A data-driven approach that compensates the HMM parameters for the noisy speech recognition is proposed. Instead of assuming some statistical approximations as in the conventional methods such as the PMC, the various statistical information necessary for the HMM parameter adaptation is directly estimated by using the Baum-Welch algorithm. The proposed method has shown improved results compared with the PMC for the noisy speech recognition.
Kazuyuki TAKAGI Rei OGURO Kazuhiko OZEKI
Experiments were conducted to examine an approach from language modeling side to improving noisy speech recognition performance. By adopting appropriate word strings as new units of processing, speech recognition performance was improved by acoustic effects as well as by test-set perplexity reduction. Three kinds of word string language models were evaluated, whose additional lexical entries were selected based on combinations of part of speech information, word length, occurrence frequency, and log likelihood ratio of the hypotheses about the bigram frequency. All of the three word string models reduced errors in broadcast news speech recognition, and also lowered test-set perplexity. The word string model based on log likelihood ratio exhibited the best improvement for noisy speech recognition, by which deletion errors were reduced by 26%, substitution errors by 9.3%, and insertion errors by 13%, in the experiments using the speaker-dependent, noise-adapted triphone. Effectiveness of word string models on error reduction was more prominent for noisy speech than for studio-clean speech.