Yutaka TSUBOI Takehiro IHARA Kazuyuki TAKAGI Kazuhiko OZEKI
A solution to the problem of improving robustness to noise in automatic speech recognition is presented in the framework of multi-band, multi-SNR, and multi-path approaches. In our word recognizer, the whole frequency band is divided into seven-overlapped sub-bands, and then sub-band noisy phoneme HMMs are trained on speech data mixed with the filtered white Gaussian noise at multiple SNRs. The acoustic model of a word is built as a set of concatenations of clean and noisy sub-band phoneme HMMs arranged in parallel. A Viterbi decoder allows a search path to transit to another SNR condition at a phoneme boundary. The recognition scores of the sub-bands are then recombined to give the score for a word. Experiments show that the overlapped seven-band system yields the best performance under nonstationary ambient noises. It is also shown that the use of filtered white Gaussian noise is advantageous for training noisy phoneme HMMs.
Image quality assessment method is a methodology that measures the difference of quality between the reference image and its distorted one. In this paper, we propose a novel reduced-reference (RR) quality assessment method for JPEG-2000 compressed images, which exploits the statistical characteristics of context information extracted through partial entropy decoding or decoding. These statistical features obtained in the process of JPEG-2000 encoding are transmitted to the receiver as side information and used to estimate the quality of images transmitted over various noisy channels at the decompression side. In the framework of JPEG-2000, the context of a current coefficient is determined depending on the pattern of the significance and/or the sign of its neighbors in three bit-plane coding passes and four coding modes. As the context information represents the local property of images, it can efficiently describe textured pattern and edge orientation. The quality of transmitted images is measured by the difference of entropy of context information between received and original images. Moreover, the proposed quality assessment method can directly process the images in the JPEG-2000 compressed domain without full decompression. Therefore, our proposed can accelerate the work of assessing image quality. Through simulations, we demonstrate that our method achieves fairly good performance in terms of the quality measurement accuracy as well as the computational complexity.
Norihide KITAOKA Souta HAMAGUCHI Seiichi NAKAGAWA
To achieve high recognition performance for a wide variety of noise and for a wide range of signal-to-noise ratio, this paper presents methods for integration of four noise reduction algorithms: spectral subtraction with smoothing of time direction, temporal domain SVD-based speech enhancement, GMM-based speech estimation and KLT-based comb-filtering. In this paper, we proposed two types of combination methods of noise suppression algorithms: selection of front-end processor and combination of results from multiple recognition processes. Recognition results on the CENSREC-1 task showed the effectiveness of our proposed methods.
Akinori ITO Takanobu OBA Takashi KONASHI Motoyuki SUZUKI Shozo MAKINO
Speech recognition in a noisy environment is one of the hottest topics in the speech recognition research. Noise-tolerant acoustic models or noise reduction techniques are often used to improve recognition accuracy. In this paper, we propose a method to improve accuracy of spoken dialog system from a language model point of view. In the proposed method, the dialog system automatically changes its language model and dialog strategy according to the estimated recognition accuracy in a noisy environment in order to keep the performance of the system high. In a noise-free environment, the system accepts any utterance from a user. On the other hand, the system restricts its grammar and vocabulary in a noisy environment. To realize this strategy, we investigated a method to avoid the user's out-of-grammar utterances through an instruction given by the system to a user. Furthermore, we developed a method to estimate recognition accuracy from features extracted from noise signals. Finally, we realized a proposed dialog system according to these investigations.
Masakiyo FUJIMOTO Kentaro ISHIZUKA
This paper addresses the problem of voice activity detection (VAD) in noisy environments. The VAD method proposed in this paper is based on a statistical model approach, and estimates statistical models sequentially without a priori knowledge of noise. Namely, the proposed method constructs a clean speech / silence state transition model beforehand, and sequentially adapts the model to the noisy environment by using a switching Kalman filter when a signal is observed. In this paper, we carried out two evaluations. In the first, we observed that the proposed method significantly outperforms conventional methods as regards voice activity detection accuracy in simulated noise environments. Second, we evaluated the proposed method on a VAD evaluation framework, CENSREC-1-C. The evaluation results revealed that the proposed method significantly outperforms the baseline results of CENSREC-1-C as regards VAD accuracy in real environments. In addition, we confirmed that the proposed method helps to improve the accuracy of concatenated speech recognition in real environments.
Md. Babul ISLAM Kazumasa YAMAMOTO Hiroshi MATSUMOTO
This paper proposes a Mel-Wiener filter to enhance Mel-LPC spectra in the presence of additive noise. The transfer function of the proposed filter is defined by using a first-order all-pass filter instead of unit delay. The filter coefficients are estimated based on minimization of the sum of the square error on the linear frequency scale without applying the bilinear transformation and efficiently implemented in the autocorrelation domain. The proposed filter does not require any time-frequency conversion, which saves a large amount of computational load. The performance of the proposed system is comparable to that of ETSI AFE. The optimum filter order is found to be 3, and thus filtering is computationally inexpensive. The computational cost of the proposed system except VAD is 53% of ETSI AFE.
Kiichi NIITSU Noriyuki MIURA Mari INOUE Yoshihiro NAKAGAWA Masamoto TAGO Masayuki MIZUNO Takayasu SAKURAI Tadahiro KURODA
A daisy chain of current-driven transmitters in inductive-coupling complementary metal oxide semiconductor (CMOS) links is presented. Transmitter power can be reduced since current is reused by multiple transmitters. Eight transceivers are arranged with a pitch of 20 µm in 0.18 µm CMOS. Transmitter power is reduced by 35% without sacrificing either the data rate (1 Gb/s/ch) or BER (<10-12) by using a 4-transmitter daisy chain. A coding technique for efficient use of daisy chain transmitters is also proposed. With the proposed coding technique, additional power reduction can be achieved.
Masakiyo FUJIMOTO Kazuya TAKEDA Satoshi NAKAMURA
This paper introduces a common database, an evaluation framework, and its baseline recognition results for in-car speech recognition, CENSREC-3, as an outcome of the IPSJ-SIG SLP Noisy Speech Recognition Evaluation Working Group. CENSREC-3, which is a sequel to AURORA-2J, has been designed as the evaluation framework of isolated word recognition in real car-driving environments. Speech data were collected using two microphones, a close-talking microphone and a hands-free microphone, under 16 carefully controlled driving conditions, i.e., combinations of three car speeds and six car conditions. CENSREC-3 provides six evaluation environments designed using speech data collected in these conditions.
Masakiyo FUJIMOTO Satoshi NAKAMURA
This paper addresses a speech recognition problem in non-stationary noise environments: the estimation of noise sequences. To solve this problem, we present a particle filter-based sequential noise estimation method for front-end processing of speech recognition in noise. In the proposed method, a noise sequence is estimated in three stages: a sequential importance sampling step, a residual resampling step, and finally a Markov chain Monte Carlo step with Metropolis-Hastings sampling. The estimated noise sequence is used in the MMSE-based clean speech estimation. We also introduce Polyak averaging and feedback into a state transition process for particle filtering. In the evaluation results, we observed that the proposed method improves speech recognition accuracy in the results of non-stationary noise environments a noise compensation method with stationary noise assumptions.
Randy GOMEZ Akinobu LEE Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO
This paper describes the method of using multi-template unsupervised speaker adaptation based on HMM-Sufficient Statistics to push up the adaptation performance while keeping adaptation time within few seconds with just one arbitrary utterance. This adaptation scheme is mainly composed of two processes. The first part is done offline which involves the training of multiple class-dependent acoustic models and the creation of speakers' HMM-Sufficient Statistics based on gender and age. The second part is performed online where adaptation begins using the single utterance of a test speaker. From this utterance, the system will classify the speaker's class and consequently select the N-best neighbor speakers close to the utterance using Gaussian Mixture Models (GMM). The classified speakers' class template model is then adopted as a base model. From this template model, the adapted model is rapidly constructed using the N-best neighbor speakers' HMM-Sufficient Statistics. Experiments in noisy environment conditions with 20 dB, 15 dB and 10 dB SNR office, crowd, booth, and car noise are performed. The proposed multi-template method achieved 89.5% word accuracy rate compared with 88.1% of the conventional single-template method, while the baseline recognition rate without adaptation is 86.4%. Moreover, experiments using Vocal Tract Length Normalization (VTLN) and supervised Maximum Likelihood Linear Regression (MLLR) are also compared.
Shoei SATO Kazuo ONOE Akio KOBAYASHI Toru IMAI
This paper proposes a new compensation method of acoustic scores in the Viterbi search for robust speech recognition. This method introduces noise models to represent a wide variety of noises and realizes robust decoding together with conventional techniques of subtraction and adaptation. This method uses likelihoods of noise models in two ways. One is to calculate a confidence factor for each input frame by comparing likelihoods of speech models and noise models. Then the weight of the acoustic score for a noisy frame is reduced according to the value of the confidence factor for compensation. The other is to use the likelihood of noise model as an alternative that of a silence model when given noisy input. Since a lower confidence factor compresses acoustic scores, the decoder rather relies on language scores and keeps more hypotheses within a fixed search depth for a noisy frame. An experiment using commentary transcriptions of a broadcast sports program (MLB: Major League Baseball) showed that the proposed method obtained a 6.7% relative word error reduction. The method also reduced the relative error rate of key words by 17.9%, and this is expected lead to an improvement metadata extraction accuracy.
Han-Yu CHEN Guo-Wei HUANG Kun-Ming CHEN Chun-Yen CHANG
In this letter, a new computation method for the noise parameters of a linear noisy two-port network is introduced. A new error function, which considers noise figure and source admittance error simultaneously, is proposed to estimate the four noise parameters. The global optimization of the error function is searched directly by using a genetic algorithm.
Satoshi NAKAMURA Kazuya TAKEDA Kazumasa YAMAMOTO Takeshi YAMADA Shingo KUROIWA Norihide KITAOKA Takanobu NISHIURA Akira SASOU Mitsunori MIZUMACHI Chiyomi MIYAJIMA Masakiyo FUJIMOTO Toshiki ENDO
This paper introduces an evaluation framework for Japanese noisy speech recognition named AURORA-2J. Speech recognition systems must still be improved to be robust to noisy environments, but this improvement requires development of the standard evaluation corpus and assessment technologies. Recently, the Aurora 2, 3 and 4 corpora and their evaluation scenarios have had significant impact on noisy speech recognition research. The AURORA-2J is a Japanese connected digits corpus and its evaluation scripts are designed in the same way as Aurora 2 with the help of European Telecommunications Standards Institute (ETSI) AURORA group. This paper describes the data collection, baseline scripts, and its baseline performance. We also propose a new performance analysis method that considers differences in recognition performance among speakers. This method is based on the word accuracy per speaker, revealing the degree of the individual difference of the recognition performance. We also propose categorization of modifications, applied to the original HTK baseline system, which helps in comparing the systems and in recognizing technologies that improve the performance best within the same category.
A data-driven approach that compensates the HMM parameters for the noisy speech recognition is proposed. Instead of assuming some statistical approximations as in the conventional methods such as the PMC, the various statistical information necessary for the HMM parameter adaptation is directly estimated by using the Baum-Welch algorithm. The proposed method has shown improved results compared with the PMC for the noisy speech recognition.
The effects of noisy estimates of fading on turbo-coded modulation are studied in the presence of flat Rayleigh fading, and the channel capacity of the system is calculated to determine the limit above which no reliable transmission is guaranteed. This limit is then compared to the signal-to-noise ratio required for a turbo-coded modulation scheme to achieve a bit-error-rate of 10-5. Numerical results are obtained, especially for QAM signals. Our results show that even slightly noisy estimates significantly degrade the theoretical limits related to channel capacities, and that an effective use of capacity-approaching codes can lower the sensitivity to noisy estimates, though noise that exceeds a certain threshold cannot be offset by the performance improvement associated with error-correcting capability.
Kazuyuki TAKAGI Rei OGURO Kazuhiko OZEKI
Experiments were conducted to examine an approach from language modeling side to improving noisy speech recognition performance. By adopting appropriate word strings as new units of processing, speech recognition performance was improved by acoustic effects as well as by test-set perplexity reduction. Three kinds of word string language models were evaluated, whose additional lexical entries were selected based on combinations of part of speech information, word length, occurrence frequency, and log likelihood ratio of the hypotheses about the bigram frequency. All of the three word string models reduced errors in broadcast news speech recognition, and also lowered test-set perplexity. The word string model based on log likelihood ratio exhibited the best improvement for noisy speech recognition, by which deletion errors were reduced by 26%, substitution errors by 9.3%, and insertion errors by 13%, in the experiments using the speaker-dependent, noise-adapted triphone. Effectiveness of word string models on error reduction was more prominent for noisy speech than for studio-clean speech.
This paper presents closed form expressions to evaluate the average bit error rate (BER) of coherent binary phase shift keying (BPSK) and quadrature PSK (QPSK) systems in the presence of Nakagami-m fading channel and noisy phase reference. Performance degradation due to noisy phase reference is investigated versus both the fading parameter m and the maximum phase error φ. When m is increased from 1 to 9 and φ = 30, the degradation at the average BER of 10-3 for BPSK is increased from 0.3 dB to 0.48 dB. For φ increasing from 10 to 40 and m=5, the degradation is increased from 0.06 dB to 0.92 dB. Degradation thus increases with increasing φ and m.
Shinji TSUZUKI Susumu YOSHIDA Saburo TAZAKI Yoshio YAMADA
In this paper we discuss the binary spreading sequences whose spectral distributions are DC free and spectral distribution's shapes can be easily controlled by a certain parameter denoted by δ. The newly developed sequences, referred to as modified antisymmetric M-sequences, are modified-versions of the conventional antisymmetric (AS)M-sequences. The proposed sequences are designed to increase the varieties of spectral distribution's shapes and improve the correlation properties when compared to those of the FM coded M-sequences which have already proposed by Tsuzuki et al. Some typical line coded M-sequences, i.e. the (differential) Manchester coded M-sequences and the FM coded M-sequences, and the conventional AS M-sequences are included in the set of proposed sequences. The improvement of the average BER (bit error rate) performance for asynchronous DS/SSMA (direct sequence/spread spectrum multiple access) systems using the proposed sequences in comparison to the system using the conventional AO/LSE (auto-optimal phase with least sidelobe energy) M-sequences is also shown.
Hanzhong GU Haruhisa TAKAHASHI
In this paper, we apply the method of relating learning to hypothesis testing [6] to study average generalization performance of concept learning from noisy random training examples. A striking aspect of the method is that a learning problem with a so-called ill-disposed learning algorithm can equivalently be reduced to a simple one, and for this simple problem, even though a direct and exact calculation of the learning curves might still be impossible, a thorough empirical study can easily be performed. One of the main advantages of using the illdisposed algorithm is that it well models lower quality learning in real situations, and hence the result can provide useful implications as far as reliable generalization is concerned. We provide empirical formulas for the learning curves by simple functions of the noise rate and the sample size from a thorough empirical study, which smoothly incorporates the results from noise-free analysis and are quite accurate and adequate for practical applications when the noise rate is relatively small. The resulting learning curve bounds are directly related to the number of system weights and are not pessimistic in practice, and apply to learning settings not necessarily within the Bayesian framework.
This paper presents an improved pragmatic approach to coded modulation design which provides higher coding gains especially for very noisy channels including those with Rayleigh fading. The signal constellation using four equally utilized dimensions implemented with two correlative carrier frequencies is adopted to enhance the performance of the pragmatic approach previously proposed by Viterbi et al.. The proposed scheme is shown to perform much better by analysis of system performance parameters and extensive computer simulation for practical channel conditions. The bandwidth and power efficiencies are also analyzed and discussed to provide more design flexibility for different communications environments.