Xiaoli XI Yongxing DU Jiangfan LIU Jinsheng ZHANG
The unconditional stable finite-difference time-domain (US-FDTD) method based on Laguerre polynomial expansion and Galerkin temporal testing is used to model thin-film bulk acoustic wave resonators (TFBAR). Numerical results show the efficiency of the US-FDTD algorithm.
Hiroko MURAKAMI Koichi SHINODA Sadaoki FURUI
We propose an active learning framework for speech recognition that reduces the amount of data required for acoustic modeling. This framework consists of two steps. We first obtain a phone-error distribution using an acoustic model estimated from transcribed speech data. Then, from a text corpus we select a sentence whose phone-occurrence distribution is close to the phone-error distribution and collect its speech data. We repeat this process to increase the amount of transcribed speech data. We applied this framework to speaker adaptation and acoustic model training. Our evaluation results showed that it significantly reduced the amount of transcribed data while maintaining the same level of accuracy.
Arata ITOH Sunao HARA Norihide KITAOKA Kazuya TAKEDA
A novel speech feature generation-based acoustic model training method for robust speaker-independent speech recognition is proposed. For decades, speaker adaptation methods have been widely used. All of these adaptation methods need adaptation data. However, our proposed method aims to create speaker-independent acoustic models that cover not only known but also unknown speakers. We achieve this by adopting inverse maximum likelihood linear regression (MLLR) transformation-based feature generation, and then we train our models using these features. First we obtain MLLR transformation matrices from a limited number of existing speakers. Then we extract the bases of the MLLR transformation matrices using PCA. The distribution of the weight parameters to express the transformation matrices for the existing speakers are estimated. Next, we construct pseudo-speaker transformations by sampling the weight parameters from the distribution, and apply the transformation to the normalized features of the existing speaker to generate the features of the pseudo-speakers. Finally, using these features, we train the acoustic models. Evaluation results show that the acoustic models trained using our proposed method are robust for unknown speakers.
Xuemin ZHAO Yuhong GUO Jian LIU Yonghong YAN Qiang FU
In this paper, a logarithmic adaptive quantization projection (LAQP) algorithm for digital watermarking is proposed. Conventional quantization index modulation uses a fixed quantization step in the watermarking embedding procedure, which leads to poor fidelity. Moreover, the conventional methods are sensitive to value-metric scaling attack. The LAQP method combines the quantization projection scheme with a perceptual model. In comparison to some conventional quantization methods with a perceptual model, the LAQP only needs to calculate the perceptual model in the embedding procedure, avoiding the decoding errors introduced by the difference of the perceptual model used in the embedding and decoding procedure. Experimental results show that the proposed watermarking scheme keeps a better fidelity and is robust against the common signal processing attack. More importantly, the proposed scheme is invariant to value-metric scaling attack.
Hwan Sik YUN Kiho CHO Nam Soo KIM
Acoustic data transmission is a technique which embeds data in a sound wave imperceptibly and detects it at a receiver. The data are embedded in an original audio signal and transmitted through the air by playing back the data-embedded audio using a loudspeaker. At the receiver, the data are extracted from the received audio signal captured by a microphone. In our previous work, we proposed an acoustic data transmission system designed based on phase modification of the modulated complex lapped transform (MCLT) coefficients. In this paper, we propose the spectral magnitude adjustment (SMA) technique which not only enhances the quality of the data-embedded audio signal but also improves the transmission performance of the system.
Masami AKAMINE Jitendra AJMERA
This paper proposes likelihood smoothing techniques to improve decision tree-based acoustic models, where decision trees are used as replacements for Gaussian mixture models to compute the observation likelihoods for a given HMM state in a speech recognition system. Decision trees have a number of advantageous properties, such as not imposing restrictions on the number or types of features, and automatically performing feature selection. This paper describes basic configurations of decision tree-based acoustic models and proposes two methods to improve the robustness of the basic model: DT mixture models and soft decisions for continuous features. Experimental results for the Aurora 2 speech database show that a system using decision trees offers state-of-the-art performance, even without taking advantage of its full potential and soft decisions improve the performance of DT-based acoustic models with 16.8% relative error rate reduction over hard decisions.
Seokjin LEE Sang Ha PARK Koeng-Mo SUNG
In this paper, a geometric source separation system using nonnegative matrix factorization (NMF) is proposed. The adaptive beamformer is the best method for geometric source separation, but it suffers from a “target signal cancellation” problem in multi-path situations. We modified the HALS-NMF algorithm for decomposition into bases, and developed an interference suppression module in order to cancel the interference bases. A performance comparison between the proposed and subband GSC-RLS algorithm using a MATLAB® simulation was executed; the results show that the proposed system is robust in multi-path situations.
Toshio ITO Masanori SUGIMOTO Hiromichi HASHIZUME
This paper presents and evaluates a new acoustic imaging system that uses multicarrier signals for correlation division in synthetic transmit aperture (CD-STA). CD-STA is a method that transmits uncorrelated signals from different transducers simultaneously to achieve high-speed and high-resolution acoustic imaging. In CD-STA, autocorrelations and cross-correlations in transmitted signals must be suppressed because they cause artifacts in the resulting images, which narrow the dynamic range as a consequence. To suppress the correlation noise, we had proposed to use multicarrier signals optimized by a genetic algorithm. Because the evaluation of our proposed method was very limited in the previous reports, we analyzed it more deeply in this paper. We optimized three pairs of multicarrier waveforms of various lengths, which correspond to 5th-, 6th- and 7th-order M-sequence signals, respectively. We built a CD-STA imaging system that operates in air. Using the system, we conducted imaging experiments to evaluate the image quality and resolution of the multicarrier signals. We also investigated the ability of the proposed method to resolve both positions and velocities of target scatterers. For that purpose, we carried out an experiment, in which both moving and fixed targets were visualized by our system. As a result of the experiments, we confirmed that the multicarrier signals have lower artifact levels, better axial resolution, and greater tolerance to velocity mismatch than M-sequence signals, particularly for short signals.
Masato NAKAYAMA Shimpei HANABUSA Tetsuji UEBO Noboru NAKASAKO
Distance to target is fundamental and very important information in numerous engineering fields. Many distance measurement methods using sound use the time delay of a reflected wave, which is measured in reference to the transmitted wave. This method, however, cannot measure short distances because the transmitted wave, which has not attenuated sufficiently by the time the reflected waves are received, suppresses the reflected waves for short distances. Therefore, we proposed an acoustic distance measurement method based on the interference between the transmitted wave and the reflected waves, which can measure distance in a short range. The proposed method requires a cancellation processing for background components due to the spectrum of the transmitted wave and the transfer function of the measurement system in real environments. We refer to this processing as background components cancellation processing (BGCCP). We proposed BGCCP based on subtraction or whitening. However, the proposed method had a limitation with respect to the transmitted wave or additive noise in real environments. In the present paper, we propose an acoustic distance measurement method based on the new BGCCP. In the new BGCCP, we use the calibration of a real measurement system and the whitening processing of the transmitted wave and introduce the concept of the cepstrum to the proposed method in order to achieve robustness. Although the conventional BGCCP requires the recording of the transmitted wave under the condition without targets, the new BGCCP does not have this requirement. Finally, we confirmed the effectiveness of the proposed method through experiments in real environments. As a result, the proposed method was confirmed to be valid and effective, even in noisy environments.
Atsunori OGAWA Satoshi TAKAHASHI Atsushi NAKAMURA
This paper proposes an efficient combination of state likelihood recycling and batch state likelihood calculation for accelerating acoustic likelihood calculation in an HMM-based speech recognizer. Recycling and batch calculation are each based on different technical approaches, i.e. the former is a purely algorithmic technique while the latter fully exploits computer architecture. To accelerate the recognition process further by combining them efficiently, we introduce conditional fast processing and acoustic backing-off. Conditional fast processing is based on two criteria. The first potential activity criterion is used to control not only the recycling of state likelihoods at the current frame but also the precalculation of state likelihoods for several succeeding frames. The second reliability criterion and acoustic backing-off are used to control the choice of recycled or batch calculated state likelihoods when they are contradictory in the combination and to prevent word accuracies from degrading. Large vocabulary spontaneous speech recognition experiments using four different CPU machines under two environmental conditions showed that, compared with the baseline recognizer, recycling and batch calculation, our combined acceleration technique further reduced both of the acoustic likelihood calculation time and the total recognition time. We also performed detailed analyses to reveal each technique's acceleration and environmental dependency mechanisms by classifying types of state likelihoods and counting each of them. The analysis results comfirmed the effectiveness of the combined acceleration technique.
Yan DENG Wei-Qiang ZHANG Yan-Min QIAN Jia LIU
One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.
This paper is concerned with the packet transmission scheduling problem for repeating all-to-all broadcasts in Underwater Sensor Networks (USN) in which there are n nodes in a transmission range. All-to-all communication is one of the most dense communication patterns. It is assumed that each node has the same size packet. Unlike the terrestrial scenarios, the propagation time in underwater communications is not negligible. We define all-to-all broadcast as the one where every node transmits packets to all the other nodes in the network except itself. So, there are in total n(n - 1) packets to be transmitted for an all-to-all broadcast. The optimal transmission scheduling is to schedule in a way that all packets can be transmitted within the minimum time. In this paper, we propose an efficient packet transmission scheduling algorithm for underwater acoustic communications using the property of long propagation delay.
Shoei SATO Takahiro OKU Shinichi HOMMA Akio KOBAYASHI Toru IMAI
We present a new discriminative method of acoustic model adaptation that deals with a task-dependent speech variability. We have focused on differences of expressions or speaking styles between tasks and set the objective of this method as improving the recognition accuracy of indistinctly pronounced phrases dependent on a speaking style. The adaptation appends subword models for frequently observable variants of subwords in the task. To find the task-dependent variants, low-confidence words are statistically selected from words with higher frequency in the task's adaptation data by using their word lattices. HMM parameters of subword models dependent on the words are discriminatively trained by using linear transforms with a minimum phoneme error (MPE) criterion. For the MPE training, subword accuracy discriminating between the variants and the originals is also investigated. In speech recognition experiments, the proposed adaptation with the subword variants reduced the word error rate by 12.0% relative in a Japanese conversational broadcast task.
Seong-Jun HAHM Yuichi OHKAWA Masashi ITO Motoyuki SUZUKI Akinori ITO Shozo MAKINO
In this paper, we propose an acoustic model that is robust to multiple noise environments, as well as a method for adapting the acoustic model to an environment to improve the model. The model is called "the multi-mixture model," which is based on a mixture of different HMMs each of which is trained using speech under different noise conditions. Speech recognition experiments showed that the proposed model performs better than the conventional multi-condition model. The method for adaptation is based on the aspect model, which is a "mixture-of-mixture" model. To realize adaptation using extremely small amount of adaptation data (i.e., a few seconds), we train a small number of mixture models, which can be interpreted as models for "clusters" of noise environments. Then, the models are mixed using weights, which are determined according to the adaptation data. The experimental results showed that the adaptation based on the aspect model improved the word accuracy in a heavy noise environment and showed no performance deterioration for all noise conditions, while the conventional methods either did not improve the performance or showed both improvement and degradation of recognition performance according to noise conditions.
In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, the proposed hybrid model adaptation can be performed at either the state-tying or triphone-modeling level, depending at which acoustic model adaptation is performed. In both methods, we first analyze the pronunciation variant rules of non-native speakers and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level hybrid method then adapts pronunciation models and acoustic models by accommodating the pronunciation variants in the pronunciation dictionary and by clustering the states of triphone acoustic models using the acoustic variants, respectively. On the other hand, the triphone-modeling level hybrid method initially adapts pronunciation models in the same way as in the state-tying level hybrid method; however, for the acoustic model adaptation, the triphone acoustic models are then re-estimated based on the adapted pronunciation models and the states of the re-estimated triphone acoustic models are clustered using the acoustic variants. From the Korean-spoken English speech recognition experiments, it is shown that ASR systems employing the state-tying and triphone-modeling level adaptation methods can relatively reduce the average word error rates (WERs) by 17.1% and 22.1% for non-native speech, respectively, when compared to a baseline ASR system.
Statistical speech recognition using continuous-density hidden Markov models (CDHMMs) has yielded many practical applications. However, in general, mismatches between the training data and input data significantly degrade recognition accuracy. Various acoustic model adaptation techniques using a few input utterances have been employed to overcome this problem. In this article, we survey these adaptation techniques, including maximum a posteriori (MAP) estimation, maximum likelihood linear regression (MLLR), and eigenvoice. We also present a schematic view called the adaptation pyramid to illustrate how these methods relate to each other.
Detection of transient signals is generally done by examining power and spectral variation of the received signal, but it becomes a difficult task when the background noise gets large. In this paper, we propose a robust transient detection algorithm using the EVRC noise suppression module. We define new parameters from the outputs of the EVRC noise suppression module for transient detection. Experimental results with various types of underwater transients have shown that the proposed method outperforms the conventional energy-based method and achieved performance improvement of detection rate by 7% to 15% for various types of background noise.
In this paper, we propose a novel target acoustic signal detection approach which is based on non-negative matrix factorization (NMF). Target basis vectors are trained from the target signal database through NMF, and input vectors are projected onto the subspace spanned by these target basis vectors. By analyzing the distribution of time-varying normalized projection error, the optimal threshold can be calculated to detect the target signal intervals during the entire input signal. Experimental results show that the proposed algorithm can detect the target signal successfully under various signal environments.
Tomohiro NISHINO Ryo YAMAKI Akira HIROSE
Ultrasonic imaging is useful in seabed or lakebed observations. We can roughly estimate the sea depth by hearing the echo generated by the boundary of water and rocks or sand. However, the estimation quality is usually not sufficient to draw seabed landscape since the echo signal includes serious distortion caused by autointerference. This paper proposes a novel method to visualize the shape of distant boundaries, such as the seawater-rock/sand boundary, based on the complex-valued Markov random field (CMRF) model. Our method realizes adaptive compensation of distortion without changing the global features in the measurement data, and obtains higher-quality landscape with less computational cost than conventional methods.
Minwoo LEE Yoonjae LEE Kihyeon KIM Hanseok KO
In this Letter, a residual acoustic echo suppression method is proposed to enhance the speech quality of hands-free communication in an automobile environment. The echo signal is normally a human voice with harmonic characteristics in a hands-free communication environment. The proposed algorithm estimates the residual echo signal by emphasizing its harmonic components. The estimated residual echo is used to obtain the signal-to-interference ratio (SIR) information at the acoustic echo canceller output. Then, the SIR based Wiener post-filter is constructed to reduce both the residual echo and noise. The experimental results confirm that the proposed algorithm is superior to the conventional residual echo suppression algorithm in terms of the echo return loss enhancement (ERLE) and the segmental signal-to-noise ratio (SEGSNR).