IEICE global.ieice.org Site

Keyword Search Result

[Keyword] SPE(2504hit)

1361-1380hit(2504hit)

ATR Parallel Decoding Based Speech Recognition System Robust to Noise and Speaking Styles
Shigeki MATSUDA Takatoshi JITSUHIRO Konstantin MARKOV Satoshi NAKAMURA

PAPER-Speech Recognition

Vol:
E89-D No:3
Page(s):
989-997
In this paper, we describe a parallel decoding-based ASR system developed of ATR that is robust to noise type, SNR and speaking style. It is difficult to recognize speech affected by various factors, especially when an ASR system contains only a single acoustic model. One solution is to employ multiple acoustic models, one model for each different condition. Even though the robustness of each acoustic model is limited, the whole ASR system can handle various conditions appropriately. In our system, there are two recognition sub-systems which use different features such as MFCC and Differential MFCC (DMFCC). Each sub-system has several acoustic models depending on SNR, speaker gender and speaking style, and during recognition each acoustic model is adapted by fast noise adaptation. From each sub-system, one hypothesis is selected based on posterior probability. The final recognition result is obtained by combining the best hypotheses from the two sub-systems. On the AURORA-2J task used widely for the evaluation of noise robustness, our system achieved higher recognition performance than a system which contains only a single model. Also, our system was tested using normal and hyper-articulated speech contaminated by several background noises, and exhibited high robustness to noise and speaking styles.
Noise Reduction in Time Domain Using Referential Reconstruction
Takehiro IHARA Takayuki NAGAI Kazuhiko OZEKI Akira KUREMATSU

PAPER-Speech and Hearing

Vol:
E89-D No:3
Page(s):
1203-1213
We present a novel approach for single-channel noise reduction of speech signals contaminated by additive noise. In this approach, the system requires speech samples to be uttered in advance by the same speaker as that of the input signal. Speech samples used in this method must have enough phonetic variety to reconstruct the input signal. In the proposed method, which we refer to as referential reconstruction, we have used a small database created from examples of speech, which will be called reference signals. Referential reconstruction uses an example-based approach, in which the objective is to find the candidate speech frame which is the most similar to the clean input frame without noise, although the input frame is contaminated with noise. When candidate frames are found, they become final outputs without any special processing. In order to find the candidate frames, a correlation coefficient is used as a similarity measure. Through automatic speech recognition experiments, the proposed method was shown to be effective, particularly for low-SNR speech signals corrupted with white noise or noise in high-frequency bands. Since the direct implementation of this method requires infeasible computational cost for searching through reference signals, a coarse-to-fine strategy is introduced in this paper.
Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM
Seiichi NAKAGAWA Wei ZHANG Mitsuo TAKAHASHI

PAPER-Speaker Recognition

Vol:
E89-D No:3
Page(s):
1058-1065
We presented a new text-independent/text-prompted speaker recognition method by combining speaker-specific Gaussian Mixture Model (GMM) with syllable-based HMM adapted by MLLR or MAP. The robustness of this speaker recognition method for speaking style's change was evaluated in this paper. The speaker identification experiment using NTT database which consists of sentences data uttered at three speed modes (normal, fast and slow) by 35 Japanese speakers (22 males and 13 females) on five sessions over ten months was conducted. Each speaker uttered only 5 training utterances (about 20 seconds in total). A combination method reduced the identification error rate by about 50%. We obtained the accuracy of 98.8% for text-independent speaker identification for three speaking style modes (normal, fast, slow) by using a short test utterance (about 4 seconds). Especially, we obtained the accuracy of 99.4% for normal speaking mode. This result was superior to conventional methods for the same database. We show that the attractive result was brought from the compensational effect between speaker specific GMM and speaker adapted syllable based HMM.
Trigger-Based Language Model Adaptation for Automatic Transcription of Panel Discussions
Carlos TRONCOSO Tatsuya KAWAHARA

PAPER-Speech Recognition

Vol:
E89-D No:3
Page(s):
1024-1031
We present a novel trigger-based language model adaptation method oriented to the transcription of meetings. In meetings, the topic is focused and consistent throughout the whole session, therefore keywords can be correlated over long distances. The trigger-based language model is designed to capture such long-distance dependencies, but it is typically constructed from a large corpus, which is usually too general to derive task-dependent trigger pairs. In the proposed method, we make use of the initial speech recognition results to extract task-dependent trigger pairs and to estimate their statistics. Moreover, we introduce a back-off scheme that also exploits the statistics estimated from a large corpus. The proposed model reduced the test-set perplexity considerably more than the typical trigger-based language model constructed from a large corpus, and achieved a remarkable perplexity reduction of 44% over the baseline when combined with an adapted trigram language model. In addition, a reduction in word error rate was obtained when using the proposed language model to rescore word graphs.
Error Identification in At-Speed Scan BIST Environment in the Presence of Circuit and Tester Speed Mismatch
Yoshiyuki NAKAMURA Thomas CLOUQUEUR Kewal K. SALUJA Hideo FUJIWARA

PAPER-Dependable Computing

Vol:
E89-D No:3
Page(s):
1165-1172
In this paper, we provide a practical formulation of the problem of identifying all error occurrences and all failed scan cells in at-speed scan based BIST environment. We propose a method that can be used to identify every error when the circuit test frequency is higher than the tester frequency. Our approach requires very little extra hardware for diagnosis and the test application time required to identify errors is a linear function of the frequency ratio between the CUT and the tester.
Acoustic Model Adaptation Using First-Order Linear Prediction for Reverberant Speech
Tetsuya TAKIGUCHI Masafumi NISHIMURA Yasuo ARIKI

PAPER-Speech Recognition

Vol:
E89-D No:3
Page(s):
908-914
This paper describes a hands-free speech recognition technique based on acoustic model adaptation to reverberant speech. In hands-free speech recognition, the recognition accuracy is degraded by reverberation, since each segment of speech is affected by the reflection energy of the preceding segment. To compensate for the reflection signal we introduce a frame-by-frame adaptation method adding the reflection signal to the means of the acoustic model. The reflection signal is approximated by a first-order linear prediction from the observation signal at the preceding frame, and the linear prediction coefficient is estimated with a maximum likelihood method by using the EM algorithm, which maximizes the likelihood of the adaptation data. Its effectiveness is confirmed by word recognition experiments on reverberant speech.
Robust Speech Recognition by Using Compensated Acoustic Scores
Shoei SATO Kazuo ONOE Akio KOBAYASHI Toru IMAI

PAPER-Speech Recognition

Vol:
E89-D No:3
Page(s):
915-921
This paper proposes a new compensation method of acoustic scores in the Viterbi search for robust speech recognition. This method introduces noise models to represent a wide variety of noises and realizes robust decoding together with conventional techniques of subtraction and adaptation. This method uses likelihoods of noise models in two ways. One is to calculate a confidence factor for each input frame by comparing likelihoods of speech models and noise models. Then the weight of the acoustic score for a noisy frame is reduced according to the value of the confidence factor for compensation. The other is to use the likelihood of noise model as an alternative that of a silence model when given noisy input. Since a lower confidence factor compresses acoustic scores, the decoder rather relies on language scores and keeps more hypotheses within a fixed search depth for a noisy frame. An experiment using commentary transcriptions of a broadcast sports program (MLB: Major League Baseball) showed that the proposed method obtained a 6.7% relative word error reduction. The method also reduced the relative error rate of key words by 17.9%, and this is expected lead to an improvement metadata extraction accuracy.
Verification of Speech Recognition Results Incorporating In-domain Confidence and Discourse Coherence Measures
Ian R. LANE Tatsuya KAWAHARA

PAPER-Speech Recognition

Vol:
E89-D No:3
Page(s):
931-938
Conventional confidence measures for assessing the reliability of ASR (automatic speech recognition) output are typically derived from "low-level" information which is obtained during speech recognition decoding. In contrast to these approaches, we propose a novel utterance verification framework which incorporates "high-level" knowledge sources. Specifically, we investigate two application-independent measures: in-domain confidence, the degree of match between the input utterance and the application domain of the back-end system, and discourse coherence, the consistency between consecutive utterances in a dialogue session. A joint confidence score is generated by combining these two measures with an orthodox measure based on GPP (generalized posterior probability). The proposed framework was evaluated on an utterance verification task for spontaneous dialogue performed via a (English/Japanese) speech-to-speech translation system. Incorporating the two proposed measures significantly improved utterance verification accuracy compared to using GPP alone, realizing reductions in CER (confidence error-rate) of 11.4% and 8.1% for the English and Japanese sides, respectively. When negligible ASR errors (that do not affect translation) were ignored, further improvement was achieved for the English side, realizing a reduction in CER of up to 14.6% compared to the GPP case.
An Explicit-Form Gain Factor for Speech Enhancement Using Spectral-Domain-Constrained Approach
Ching-Ta LU Hsiao-Chuan WANG

PAPER-Speech and Hearing

Vol:
E89-D No:3
Page(s):
1195-1202
Employing noise masking threshold (NMT) to adapt a speech enhancement system has become popular due to the advantage of rendering the residual noise to perceptually white. Most methods employ the NMT to empirically adjust the parameters of a speech enhancement system according to the various properties of noise. In this article, without any predefined empirical factor, an explicit-form gain factor for a frequency bin is derived by perceptually constraining the residual noise below the NMT in spectral domain. This perceptual constraint preserves the spectrum of noisy speech when the level of residual noise is less than the NMT. If the level of residual noise exceeds the NMT, then the spectrum of noisy speech is suppressed to reduce the corrupting noise. Experimental results show that the proposed approach can efficiently remove the added noise in cases of various noise corruptions, and almost free from musical residual noise.
Channel-Count-Independent BIST for Multi-Channel SerDes
Kouichi YAMAGUCHI Muneo FUKAISHI

PAPER-Interface and Interconnect Techniques

Vol:
E89-C No:3
Page(s):
314-319
This paper describes a BIST circuit for testing SoC integrated multi-channel serializer/deserializer (SerDes) macros. A newly developed packet-based PRBS generator enables the BIST to perform at-speed testing of asynchronous data transfers. In addition, a new technique for chained alignment checks between adjacent channels helps achieve a channel-count-independent architecture for verification of multi-channel alignment between SerDes macros. Fabricated in a 0.13-µm CMOS process and operating at > 500 MHz, the BIST has successfully verified all SerDes functions in at-speed testing of 5-Gbps20-ch SerDes macros.
What HMMs Can Do
Jeff A. BILMES

INVITED PAPER

Vol:
E89-D No:3
Page(s):
869-891
Since their inception almost fifty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems--today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabilities, each of these ways having both advantages and disadvantages. In an effort to better understand what HMMs can do, this tutorial article analyzes HMMs by exploring a definition of HMMs in terms of random variables and conditional independence assumptions. We prefer this definition as it allows us to reason more throughly about the capabilities of HMMs. In particular, it is possible to deduce that there are, in theory at least, no limitations to the class of probability distributions representable by HMMs. This paper concludes that, in search of a model to supersede the HMM (say for ASR), rather than trying to correct for HMM limitations in the general case, new models should be found based on their potential for better parsimony, computational requirements, and noise insensitivity.
A Speech Packet Loss Concealment Method Using Linear Prediction
Kazuhiro KONDO Kiyoshi NAKAGAWA

PAPER-Speech and Hearing

Vol:
E89-D No:2
Page(s):
806-813
We proposed and evaluated a speech packet loss concealment method which predicts lost segments from speech included in packets either before, or both before and after the lost packet. The lost segments are predicted recursively by using linear prediction both in the forward direction from the packet preceding the loss, and in the backward direction from the packet succeeding the lost segment. Predicted samples in each direction are smoothed by averaging using linear weights to obtain the final interpolated signal. The adjacent segments are also smoothed extensively to significantly reduce the speech quality discontinuity between the interpolated signal and the received speech signal. Subjective quality comparisons between the proposed method and the the packet loss concealment algorithm described in the ITU standard G.711 Appendix I showed similar scores up to about 10% packet loss. However, the proposed method showed higher scores above this loss rate, with Mean Opinion Score rating exceeding 2.4, even at an extremely high packet loss rate of 30%. Packet loss concealment of speech degraded with G.729 coding, and babble noise mixed speech showed similar trends, with the proposed method showing higher qualities at high loss rates. We plan to further improve the performance by using adaptive LPC prediction order depending on the estimated pitch, and adaptive LPC bandwidth expansion depending on the consecutive number of repetitive prediction, among many other improvements. We also plan to investigate complexity reduction using gradient LPC coefficient updates, and processing delay reduction using adaptive forward/bidirectional prediction modes depending on the measured packet loss ratio.
Measuring the Perceived Importance of Speech Segments for Transmission over IP Networks Open Access
Yusuke HIWASAKI Toru MORINAGA Jotaro IKEDO Akitoshi KATAOKA

PAPER

Vol:
E89-B No:2
Page(s):
326-333
This paper presents a way of using a linear regression model to produce a single-valued criterion that indicates the perceived importance of each block in a stream of speech blocks. This method is superior to the conventional approach, voice activity detection (VAD), in that it provides a dynamically changing priority value for speech segments with finer granularity. The approach can be used in conjunction with scalable speech coding techniques in the context of IP QoS services to achieve a flexible form of quality control for speech transmission. A simple linear regression model is used to estimate a mean opinion score (MOS) of the various cases of missing speech segments. The estimated MOS is a continuous value that can be mapped to priority levels with arbitrary granularity. Through subjective evaluation, we show the validity of the calculated priority values.
New Current-Mirror Sense Amplifier Design for High-Speed SRAM Applications
Chun-Lung HSU Mean-Hom HO Chin-Feng LIN

PAPER

Vol:
E89-A No:2
Page(s):
377-384
This study presents a new current-mirror sense amplifier (CMSA) design for high-speed static random access memory (SRAM) applications. The proposed CMSA can directly sense the current of memory cell and only needs two transistor stages cascaded from VDD to GND for achieving the low-voltage operation. Moreover, the sensing speed of the proposed CMSA is independent of the bit-line capacitances and is only slightly sensitive to the data-line capacitances. Based on the simulation with using the TSMC 0.25-µm 2P4M CMOS process parameter, the proposed CMSA can effectively work at 500 MHz-1 GHz with working voltage as low as 1.5 V. Simulated results show that the proposed CMSA has a much speed improvement compared with the conventional sense amplifiers. Also, the effectiveness of the proposed CMSA is demonstrated with a read-cycle-only memory system to show the good performance for SRAM applications.
Noise Spectrum Estimation with Entropy-Based VAD in Non-stationary Environments
Bing-Fei WU Kun-Ching WANG

PAPER-Digital Signal Processing

Vol:
E89-A No:2
Page(s):
479-485
This study presents a fast adaptive algorithm for noise estimation in non-stationary environments. To make noise estimation adapt quickly to non-stationary noise environments, a robust entropy-based voice activity detection (VAD) is thus required. It is well-known that the entropy-based measure defined in spectral domain is very insensitive to the changing level of nose. To exploit the specific nature of straight lines existing on speech-only spectrogram, the proposed spectrum entropy measurement improved from spectrum entropy proposed by Shen et al. is further presented and is named band-splitting spectrum entropy (BSE). Consequently, the proposed recursive noise estimator including BSE-based VAD can update noise power spectrum accurately even if the noise-level quickly changes.
An Approximation Algorithm for Minimum Certificate Dispersal Problems
Hua ZHENG Shingo OMURA Koichi WADA

PAPER-Graphs and Networks

Vol:
E89-A No:2
Page(s):
551-558
We consider a network, where a special data called certificate is issued between two users, and all certificates issued by the users in the network can be represented by a directed graph. For any two users u and v, when u needs to send a message to v securely, v's public-key is needed. The user u can obtain v's public-key using the certificates stored in u and v. We need to disperse the certificates to the users such that when a user wants to send a message to the other user securely, there are enough certificates in them to get the reliable public-key. In this paper, when a certificate graph and a set of communication requests are given, we consider the problem to disperse the certificates among the nodes in the network, such that the communication requests are satisfied and the total number of certificates stored in the nodes is minimized. We formulate this problem as MINIMUM CERTIFICATE DISPERSAL (MCD for short). We show that MCD is NP-Complete, even if its input graph is restricted to a strongly connected graph. We also present a polynomial-time 2-approximation algorithm MinPivot for strongly connected graphs, when the communication requests satisfy some restrictions. We introduce some graph classes for which MinPivot can compute optimal dispersals, such as trees, rings, and some Cartesian products of graphs.
Multimedia Quality Prediction Methodologies for Advanced Mobile and IP-Based Telephony Open Access
Nobuhiko KITAWAKI

INVITED PAPER

Vol:
E89-B No:2
Page(s):
262-272
This paper describes the author's perspective on multimedia quality prediction methodologies for multimedia communications in advanced mobile and internet protocol (IP)-based telephony, and reports related experiments and trials. First, the paper describes the need for perceptual QoS (Quality of Service) assessment in which various quality factors in multimedia communications for advanced mobile and IP-based telephony are analyzed. Then an objective quality prediction scheme is proposed from the viewpoints of quality measurement tools for each quality factor and an opinion model for compound quality factors in mobile and IP-based communications networks. Finally, the author's current trials of measurement tools and opinion models are described.
Speech Quality Transmitted by Circuit Multiplication Equipment Optimized for IP-Based Networks (IP-CME)
Hideaki YAMADA Norihiro FUKUMOTO

PAPER-Internet

Vol:
E89-B No:2
Page(s):
490-499
We present a quantitative evaluation of speech quality using the multiplexing scheme for the efficient transmission of voice signals in order to reduce the number of the IP packets carrying voice signals (called VoIP packets) transferred. The multiplexing scheme is applicable to a variety of media gateways controlling the bulk of voice streams over IP-based networks, based on VoIP technology. We speculated that the multiplexing scheme would reduce the degradation of speech quality due to packet loss since it also has a similar effect to interleaving the voice signal streams. However, the interleaving effect for maintaining speech quality in the scheme characterized by the feature of IP-based multiplication is not quantitatively clear. Through our end-to-end quality evaluation results of speech, as transmitted via the multiplexing scheme using dedicated hardware, we confirm the advantages of the multiplexing scheme from the perspective of achieving improved speech quality without increasing the processing delay when considering practical packet loss conditions within an IP-based network.
Analysis of the Clock Jitter Effects in a Time Invariant Model of Continuous Time Delta Sigma Modulators
Hossein SHAMSI Omid SHOAEI Roghayeh DOOST

PAPER

Vol:
E89-A No:2
Page(s):
399-407
In this paper by using an exactly analytic approach the clock jitter in the feedback path of the continuous time Delta Sigma modulators (CT DSM) is modeled as an additive jitter noise, providing a time invariant model for a jittery CT DSM. Then for various DAC waveforms the power spectral density (psd) of the clock jitter at the output of DAC is derived and by using an approximation the in-band power of the clock jitter at the output of the modulator is extracted. The simplicity and generality of the proposed approach are the main advantages of this paper. The MATALB and HSPICE simulation results confirm the validity of the proposed formulas.
Frequency Domain Multiplexing of TES Signals by Magnetic Field Summation
Noriko Y. YAMASAKI Yoh TAKEI Kensuke MASUI Kazuhisa MITSUDA Toshimitsu MOROOKA Satoshi NAKAYAMA

INVITED PAPER

Vol:
E89-C No:2
Page(s):
98-105
In frequency-domain multiplexing (FDM) for TES signals, a magnetic field summation method utilizing a multi-input SQUID has the fundamental merit of small degradation of the signal-to-noise ratio. We formulated shifts of the operation point due to a common impedance and cross talk currents. These effects are evaluated for several FDM methods, and the requirements for the bandwidth and filters are summarized. The design parameters of multi-input SQUIDs and a flux locked loop driving circuits are also presented.

1361-1380hit(2504hit)

Keyword Search Result

[Keyword] SPE(2504hit)

ATR Parallel Decoding Based Speech Recognition System Robust to Noise and Speaking Styles

Noise Reduction in Time Domain Using Referential Reconstruction

Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM

Trigger-Based Language Model Adaptation for Automatic Transcription of Panel Discussions

Error Identification in At-Speed Scan BIST Environment in the Presence of Circuit and Tester Speed Mismatch

Acoustic Model Adaptation Using First-Order Linear Prediction for Reverberant Speech

Robust Speech Recognition by Using Compensated Acoustic Scores

Verification of Speech Recognition Results Incorporating In-domain Confidence and Discourse Coherence Measures

An Explicit-Form Gain Factor for Speech Enhancement Using Spectral-Domain-Constrained Approach

Channel-Count-Independent BIST for Multi-Channel SerDes

What HMMs Can Do

A Speech Packet Loss Concealment Method Using Linear Prediction

Measuring the Perceived Importance of Speech Segments for Transmission over IP Networks Open Access

New Current-Mirror Sense Amplifier Design for High-Speed SRAM Applications

Noise Spectrum Estimation with Entropy-Based VAD in Non-stationary Environments

An Approximation Algorithm for Minimum Certificate Dispersal Problems

Multimedia Quality Prediction Methodologies for Advanced Mobile and IP-Based Telephony Open Access

Speech Quality Transmitted by Circuit Multiplication Equipment Optimized for IP-Based Networks (IP-CME)

Analysis of the Clock Jitter Effects in a Time Invariant Model of Continuous Time Delta Sigma Modulators

Frequency Domain Multiplexing of TES Signals by Magnetic Field Summation

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles