Shigeki MATSUDA Takatoshi JITSUHIRO Konstantin MARKOV Satoshi NAKAMURA
In this paper, we describe a parallel decoding-based ASR system developed of ATR that is robust to noise type, SNR and speaking style. It is difficult to recognize speech affected by various factors, especially when an ASR system contains only a single acoustic model. One solution is to employ multiple acoustic models, one model for each different condition. Even though the robustness of each acoustic model is limited, the whole ASR system can handle various conditions appropriately. In our system, there are two recognition sub-systems which use different features such as MFCC and Differential MFCC (DMFCC). Each sub-system has several acoustic models depending on SNR, speaker gender and speaking style, and during recognition each acoustic model is adapted by fast noise adaptation. From each sub-system, one hypothesis is selected based on posterior probability. The final recognition result is obtained by combining the best hypotheses from the two sub-systems. On the AURORA-2J task used widely for the evaluation of noise robustness, our system achieved higher recognition performance than a system which contains only a single model. Also, our system was tested using normal and hyper-articulated speech contaminated by several background noises, and exhibited high robustness to noise and speaking styles.
Takehiro IHARA Takayuki NAGAI Kazuhiko OZEKI Akira KUREMATSU
We present a novel approach for single-channel noise reduction of speech signals contaminated by additive noise. In this approach, the system requires speech samples to be uttered in advance by the same speaker as that of the input signal. Speech samples used in this method must have enough phonetic variety to reconstruct the input signal. In the proposed method, which we refer to as referential reconstruction, we have used a small database created from examples of speech, which will be called reference signals. Referential reconstruction uses an example-based approach, in which the objective is to find the candidate speech frame which is the most similar to the clean input frame without noise, although the input frame is contaminated with noise. When candidate frames are found, they become final outputs without any special processing. In order to find the candidate frames, a correlation coefficient is used as a similarity measure. Through automatic speech recognition experiments, the proposed method was shown to be effective, particularly for low-SNR speech signals corrupted with white noise or noise in high-frequency bands. Since the direct implementation of this method requires infeasible computational cost for searching through reference signals, a coarse-to-fine strategy is introduced in this paper.
Seiichi NAKAGAWA Wei ZHANG Mitsuo TAKAHASHI
We presented a new text-independent/text-prompted speaker recognition method by combining speaker-specific Gaussian Mixture Model (GMM) with syllable-based HMM adapted by MLLR or MAP. The robustness of this speaker recognition method for speaking style's change was evaluated in this paper. The speaker identification experiment using NTT database which consists of sentences data uttered at three speed modes (normal, fast and slow) by 35 Japanese speakers (22 males and 13 females) on five sessions over ten months was conducted. Each speaker uttered only 5 training utterances (about 20 seconds in total). A combination method reduced the identification error rate by about 50%. We obtained the accuracy of 98.8% for text-independent speaker identification for three speaking style modes (normal, fast, slow) by using a short test utterance (about 4 seconds). Especially, we obtained the accuracy of 99.4% for normal speaking mode. This result was superior to conventional methods for the same database. We show that the attractive result was brought from the compensational effect between speaker specific GMM and speaker adapted syllable based HMM.
Carlos TRONCOSO Tatsuya KAWAHARA
We present a novel trigger-based language model adaptation method oriented to the transcription of meetings. In meetings, the topic is focused and consistent throughout the whole session, therefore keywords can be correlated over long distances. The trigger-based language model is designed to capture such long-distance dependencies, but it is typically constructed from a large corpus, which is usually too general to derive task-dependent trigger pairs. In the proposed method, we make use of the initial speech recognition results to extract task-dependent trigger pairs and to estimate their statistics. Moreover, we introduce a back-off scheme that also exploits the statistics estimated from a large corpus. The proposed model reduced the test-set perplexity considerably more than the typical trigger-based language model constructed from a large corpus, and achieved a remarkable perplexity reduction of 44% over the baseline when combined with an adapted trigram language model. In addition, a reduction in word error rate was obtained when using the proposed language model to rescore word graphs.
Yoshiyuki NAKAMURA Thomas CLOUQUEUR Kewal K. SALUJA Hideo FUJIWARA
In this paper, we provide a practical formulation of the problem of identifying all error occurrences and all failed scan cells in at-speed scan based BIST environment. We propose a method that can be used to identify every error when the circuit test frequency is higher than the tester frequency. Our approach requires very little extra hardware for diagnosis and the test application time required to identify errors is a linear function of the frequency ratio between the CUT and the tester.
Tetsuya TAKIGUCHI Masafumi NISHIMURA Yasuo ARIKI
This paper describes a hands-free speech recognition technique based on acoustic model adaptation to reverberant speech. In hands-free speech recognition, the recognition accuracy is degraded by reverberation, since each segment of speech is affected by the reflection energy of the preceding segment. To compensate for the reflection signal we introduce a frame-by-frame adaptation method adding the reflection signal to the means of the acoustic model. The reflection signal is approximated by a first-order linear prediction from the observation signal at the preceding frame, and the linear prediction coefficient is estimated with a maximum likelihood method by using the EM algorithm, which maximizes the likelihood of the adaptation data. Its effectiveness is confirmed by word recognition experiments on reverberant speech.
Shoei SATO Kazuo ONOE Akio KOBAYASHI Toru IMAI
This paper proposes a new compensation method of acoustic scores in the Viterbi search for robust speech recognition. This method introduces noise models to represent a wide variety of noises and realizes robust decoding together with conventional techniques of subtraction and adaptation. This method uses likelihoods of noise models in two ways. One is to calculate a confidence factor for each input frame by comparing likelihoods of speech models and noise models. Then the weight of the acoustic score for a noisy frame is reduced according to the value of the confidence factor for compensation. The other is to use the likelihood of noise model as an alternative that of a silence model when given noisy input. Since a lower confidence factor compresses acoustic scores, the decoder rather relies on language scores and keeps more hypotheses within a fixed search depth for a noisy frame. An experiment using commentary transcriptions of a broadcast sports program (MLB: Major League Baseball) showed that the proposed method obtained a 6.7% relative word error reduction. The method also reduced the relative error rate of key words by 17.9%, and this is expected lead to an improvement metadata extraction accuracy.
Conventional confidence measures for assessing the reliability of ASR (automatic speech recognition) output are typically derived from "low-level" information which is obtained during speech recognition decoding. In contrast to these approaches, we propose a novel utterance verification framework which incorporates "high-level" knowledge sources. Specifically, we investigate two application-independent measures: in-domain confidence, the degree of match between the input utterance and the application domain of the back-end system, and discourse coherence, the consistency between consecutive utterances in a dialogue session. A joint confidence score is generated by combining these two measures with an orthodox measure based on GPP (generalized posterior probability). The proposed framework was evaluated on an utterance verification task for spontaneous dialogue performed via a (English/Japanese) speech-to-speech translation system. Incorporating the two proposed measures significantly improved utterance verification accuracy compared to using GPP alone, realizing reductions in CER (confidence error-rate) of 11.4% and 8.1% for the English and Japanese sides, respectively. When negligible ASR errors (that do not affect translation) were ignored, further improvement was achieved for the English side, realizing a reduction in CER of up to 14.6% compared to the GPP case.
Employing noise masking threshold (NMT) to adapt a speech enhancement system has become popular due to the advantage of rendering the residual noise to perceptually white. Most methods employ the NMT to empirically adjust the parameters of a speech enhancement system according to the various properties of noise. In this article, without any predefined empirical factor, an explicit-form gain factor for a frequency bin is derived by perceptually constraining the residual noise below the NMT in spectral domain. This perceptual constraint preserves the spectrum of noisy speech when the level of residual noise is less than the NMT. If the level of residual noise exceeds the NMT, then the spectrum of noisy speech is suppressed to reduce the corrupting noise. Experimental results show that the proposed approach can efficiently remove the added noise in cases of various noise corruptions, and almost free from musical residual noise.
Kouichi YAMAGUCHI Muneo FUKAISHI
This paper describes a BIST circuit for testing SoC integrated multi-channel serializer/deserializer (SerDes) macros. A newly developed packet-based PRBS generator enables the BIST to perform at-speed testing of asynchronous data transfers. In addition, a new technique for chained alignment checks between adjacent channels helps achieve a channel-count-independent architecture for verification of multi-channel alignment between SerDes macros. Fabricated in a 0.13-µm CMOS process and operating at > 500 MHz, the BIST has successfully verified all SerDes functions in at-speed testing of 5-Gbps20-ch SerDes macros.
Since their inception almost fifty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems--today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabilities, each of these ways having both advantages and disadvantages. In an effort to better understand what HMMs can do, this tutorial article analyzes HMMs by exploring a definition of HMMs in terms of random variables and conditional independence assumptions. We prefer this definition as it allows us to reason more throughly about the capabilities of HMMs. In particular, it is possible to deduce that there are, in theory at least, no limitations to the class of probability distributions representable by HMMs. This paper concludes that, in search of a model to supersede the HMM (say for ASR), rather than trying to correct for HMM limitations in the general case, new models should be found based on their potential for better parsimony, computational requirements, and noise insensitivity.
Kazuhiro KONDO Kiyoshi NAKAGAWA
We proposed and evaluated a speech packet loss concealment method which predicts lost segments from speech included in packets either before, or both before and after the lost packet. The lost segments are predicted recursively by using linear prediction both in the forward direction from the packet preceding the loss, and in the backward direction from the packet succeeding the lost segment. Predicted samples in each direction are smoothed by averaging using linear weights to obtain the final interpolated signal. The adjacent segments are also smoothed extensively to significantly reduce the speech quality discontinuity between the interpolated signal and the received speech signal. Subjective quality comparisons between the proposed method and the the packet loss concealment algorithm described in the ITU standard G.711 Appendix I showed similar scores up to about 10% packet loss. However, the proposed method showed higher scores above this loss rate, with Mean Opinion Score rating exceeding 2.4, even at an extremely high packet loss rate of 30%. Packet loss concealment of speech degraded with G.729 coding, and babble noise mixed speech showed similar trends, with the proposed method showing higher qualities at high loss rates. We plan to further improve the performance by using adaptive LPC prediction order depending on the estimated pitch, and adaptive LPC bandwidth expansion depending on the consecutive number of repetitive prediction, among many other improvements. We also plan to investigate complexity reduction using gradient LPC coefficient updates, and processing delay reduction using adaptive forward/bidirectional prediction modes depending on the measured packet loss ratio.
Yusuke HIWASAKI Toru MORINAGA Jotaro IKEDO Akitoshi KATAOKA
This paper presents a way of using a linear regression model to produce a single-valued criterion that indicates the perceived importance of each block in a stream of speech blocks. This method is superior to the conventional approach, voice activity detection (VAD), in that it provides a dynamically changing priority value for speech segments with finer granularity. The approach can be used in conjunction with scalable speech coding techniques in the context of IP QoS services to achieve a flexible form of quality control for speech transmission. A simple linear regression model is used to estimate a mean opinion score (MOS) of the various cases of missing speech segments. The estimated MOS is a continuous value that can be mapped to priority levels with arbitrary granularity. Through subjective evaluation, we show the validity of the calculated priority values.
Chun-Lung HSU Mean-Hom HO Chin-Feng LIN
This study presents a new current-mirror sense amplifier (CMSA) design for high-speed static random access memory (SRAM) applications. The proposed CMSA can directly sense the current of memory cell and only needs two transistor stages cascaded from VDD to GND for achieving the low-voltage operation. Moreover, the sensing speed of the proposed CMSA is independent of the bit-line capacitances and is only slightly sensitive to the data-line capacitances. Based on the simulation with using the TSMC 0.25-µm 2P4M CMOS process parameter, the proposed CMSA can effectively work at 500 MHz-1 GHz with working voltage as low as 1.5 V. Simulated results show that the proposed CMSA has a much speed improvement compared with the conventional sense amplifiers. Also, the effectiveness of the proposed CMSA is demonstrated with a read-cycle-only memory system to show the good performance for SRAM applications.
This study presents a fast adaptive algorithm for noise estimation in non-stationary environments. To make noise estimation adapt quickly to non-stationary noise environments, a robust entropy-based voice activity detection (VAD) is thus required. It is well-known that the entropy-based measure defined in spectral domain is very insensitive to the changing level of nose. To exploit the specific nature of straight lines existing on speech-only spectrogram, the proposed spectrum entropy measurement improved from spectrum entropy proposed by Shen et al. is further presented and is named band-splitting spectrum entropy (BSE). Consequently, the proposed recursive noise estimator including BSE-based VAD can update noise power spectrum accurately even if the noise-level quickly changes.
Hua ZHENG Shingo OMURA Koichi WADA
We consider a network, where a special data called certificate is issued between two users, and all certificates issued by the users in the network can be represented by a directed graph. For any two users u and v, when u needs to send a message to v securely, v's public-key is needed. The user u can obtain v's public-key using the certificates stored in u and v. We need to disperse the certificates to the users such that when a user wants to send a message to the other user securely, there are enough certificates in them to get the reliable public-key. In this paper, when a certificate graph and a set of communication requests are given, we consider the problem to disperse the certificates among the nodes in the network, such that the communication requests are satisfied and the total number of certificates stored in the nodes is minimized. We formulate this problem as MINIMUM CERTIFICATE DISPERSAL (MCD for short). We show that MCD is NP-Complete, even if its input graph is restricted to a strongly connected graph. We also present a polynomial-time 2-approximation algorithm MinPivot for strongly connected graphs, when the communication requests satisfy some restrictions. We introduce some graph classes for which MinPivot can compute optimal dispersals, such as trees, rings, and some Cartesian products of graphs.
This paper describes the author's perspective on multimedia quality prediction methodologies for multimedia communications in advanced mobile and internet protocol (IP)-based telephony, and reports related experiments and trials. First, the paper describes the need for perceptual QoS (Quality of Service) assessment in which various quality factors in multimedia communications for advanced mobile and IP-based telephony are analyzed. Then an objective quality prediction scheme is proposed from the viewpoints of quality measurement tools for each quality factor and an opinion model for compound quality factors in mobile and IP-based communications networks. Finally, the author's current trials of measurement tools and opinion models are described.
Hideaki YAMADA Norihiro FUKUMOTO
We present a quantitative evaluation of speech quality using the multiplexing scheme for the efficient transmission of voice signals in order to reduce the number of the IP packets carrying voice signals (called VoIP packets) transferred. The multiplexing scheme is applicable to a variety of media gateways controlling the bulk of voice streams over IP-based networks, based on VoIP technology. We speculated that the multiplexing scheme would reduce the degradation of speech quality due to packet loss since it also has a similar effect to interleaving the voice signal streams. However, the interleaving effect for maintaining speech quality in the scheme characterized by the feature of IP-based multiplication is not quantitatively clear. Through our end-to-end quality evaluation results of speech, as transmitted via the multiplexing scheme using dedicated hardware, we confirm the advantages of the multiplexing scheme from the perspective of achieving improved speech quality without increasing the processing delay when considering practical packet loss conditions within an IP-based network.
Hossein SHAMSI Omid SHOAEI Roghayeh DOOST
In this paper by using an exactly analytic approach the clock jitter in the feedback path of the continuous time Delta Sigma modulators (CT DSM) is modeled as an additive jitter noise, providing a time invariant model for a jittery CT DSM. Then for various DAC waveforms the power spectral density (psd) of the clock jitter at the output of DAC is derived and by using an approximation the in-band power of the clock jitter at the output of the modulator is extracted. The simplicity and generality of the proposed approach are the main advantages of this paper. The MATALB and HSPICE simulation results confirm the validity of the proposed formulas.
Noriko Y. YAMASAKI Yoh TAKEI Kensuke MASUI Kazuhisa MITSUDA Toshimitsu MOROOKA Satoshi NAKAYAMA
In frequency-domain multiplexing (FDM) for TES signals, a magnetic field summation method utilizing a multi-input SQUID has the fundamental merit of small degradation of the signal-to-noise ratio. We formulated shifts of the operation point due to a common impedance and cross talk currents. These effects are evaluated for several FDM methods, and the requirements for the bandwidth and filters are summarized. The design parameters of multi-input SQUIDs and a flux locked loop driving circuits are also presented.