Chuan CAO Ming LI Xiao WU Hongbin SUO Jian LIU Yonghong YAN
In this letter, we present an automatic approach of objective singing performance evaluation for untrained singers by relating acoustic measurements to perceptual ratings of singing voice quality. Several acoustic parameters and their combination features are investigated to find objective correspondences of the perceptual evaluation criteria. Experimental results show relative strong correlation between perceptual ratings and the combined features and the reliability of the proposed evaluation system is tested to be comparable to human judges.
Junbo ZHANG Fuping PAN Bin DONG Qingwei ZHAO Yonghong YAN
In this paper, we presented a novel method for automatic pronunciation quality assessment. Unlike the popular “Goodness of Pronunciation” (GOP) method, this method does not map the decoding confidence into pronunciation quality score, but differentiates the different pronunciation quality utterances directly. In this method, the student's utterance need to be decoded for two times. The first-time decoding was for getting the time points of each phone of the utterance by a forced alignment using a conventional trained acoustic model (AM). The second-time decoding was for differentiating the pronunciation quality for each triphone using a specially trained AM, where the triphones in different pronunciation qualities were trained as different units, and the model was trained in discriminative method to ensure the model has the best discrimination among the triphones whose names were same but pronunciation quality scores were different. The decoding network in the second-time decoding included different pronunciation quality triphones, so the phone-level scores can be obtained from the decoding result directly. The phone-level scores were combined into the sentence-level scores using maximum entropy criterion. The experimental results shows that the scoring performance was increased significantly compared to the GOP method, especially in sentence-level.
Chunyan LIANG Lin YANG Qingwei ZHAO Yonghong YAN
In this letter, we adopt a new factor analysis of neighborhood-preserving embedding (NPE) for speaker verification. NPE aims at preserving the local neighborhood structure on the data and defines a low-dimensional speaker space called neighborhood-preserving embedding space. We compare the proposed method with the state-of-the-art total variability approach on the telephone-telephone core condition of the NIST 2008 Speaker Recognition Evaluation (SRE) dataset. The experimental results indicate that the proposed NPE method outperforms the total variability approach, providing up to 24% relative improvement.
Hai YANG Yunfei XU Qinwei ZHAO Ruohua ZHOU Yonghong YAN
Sparse representation has been studied within the field of signal processing as a means of providing a compact form of signal representation. This paper introduces a sparse representation based framework named Sparse Probabilistic Linear Discriminant Analysis in speaker recognition. In this latent variable model, probabilistic linear discriminant analysis is modified to obtain an algorithm for learning overcomplete sparse representations by replacing the Gaussian prior on the factors with Laplace prior that encourages sparseness. For a given speaker signal, the dictionary obtained from this model has good representational power while supporting optimal discrimination of the classes. An expectation-maximization algorithm is derived to train the model with a variational approximation to a range of heavy-tailed distributions whose limit is the Laplace. The variational approximation is also used to compute the likelihood ratio score of all trials of speakers. This approach performed well on the core-extended conditions of the NIST 2010 Speaker Recognition Evaluation, and is competitive compared to the Gaussian Probabilistic Linear Discriminant Analysis, in terms of normalized Decision Cost Function and Equal Error Rate.
Shang CAI Yeming XIAO Jielin PAN Qingwei ZHAO Yonghong YAN
Mel Frequency Cepstral Coefficients (MFCC) are the most popular acoustic features used in automatic speech recognition (ASR), mainly because the coefficients capture the most useful information of the speech and fit well with the assumptions used in hidden Markov models. As is well known, MFCCs already employ several principles which have known counterparts in the peripheral properties of human hearing: decoupling across frequency, mel-warping of the frequency axis, log-compression of energy, etc. It is natural to introduce more mechanisms in the auditory periphery to improve the noise robustness of MFCC. In this paper, a k-nearest neighbors based frequency masking filter is proposed to reduce the audibility of spectra valleys which are sensitive to noise. Besides, Moore and Glasberg's critical band equivalent rectangular bandwidth (ERB) expression is utilized to determine the filter bandwidth. Furthermore, a new bandpass infinite impulse response (IIR) filter is proposed to imitate the temporal masking phenomenon of the human auditory system. These three auditory perceptual mechanisms are combined with the standard MFCC algorithm in order to investigate their effects on ASR performance, and a revised MFCC extraction scheme is presented. Recognition performances with the standard MFCC, RASTA perceptual linear prediction (RASTA-PLP) and the proposed feature extraction scheme are evaluated on a medium-vocabulary isolated-word recognition task and a more complex large vocabulary continuous speech recognition (LVCSR) task. Experimental results show that consistent robustness against background noise is achieved on these two tasks, and the proposed method outperforms both the standard MFCC and RASTA-PLP.
Hang REN Qingwei ZHAO Yonghong YAN
The optimization of spoken dialog management policies is a non-trivial task due to the erroneous inputs from speech recognition and language understanding modules. The dialog manager needs to ground uncertain semantic information at times to fully understand the need of human users and successfully complete the required dialog tasks. Approaches based on reinforcement learning are currently mainstream in academia and have been proved to be effective, especially when operating in noisy environments. However, in reinforcement learning the dialog strategy is often represented by complex numeric model and thus is incomprehensible to humans. The trained policies are very difficult for dialog system designers to verify or modify, which largely limits the deployment for commercial applications. In this paper we propose a novel framework for optimizing dialog policies specified in human-readable domain language using genetic algorithm. We present learning algorithms using user simulator and real human-machine dialog corpora. Empirical experimental results show that the proposed approach can achieve competitive performance on par with some state-of-the-art reinforcement learning algorithms, while maintaining a comprehensible policy structure.
Meixu SONG Jielin PAN Qingwei ZHAO Yonghong YAN
Introducing pronunciation models into decoding has been proven to be benefit to LVCSR. In this paper, a discriminative pronunciation modeling method is presented, within the framework of the Minimum Phone Error (MPE) training for HMM/GMM. In order to bring the pronunciation models into the MPE training, the auxiliary function is rewritten at word level and decomposes into two parts. One is for co-training the acoustic models, and the other is for discriminatively training the pronunciation models. On Mandarin conversational telephone speech recognition task, compared to the baseline using a canonical lexicon, the discriminative pronunciation models reduced the absolute Character Error Rate (CER) by 0.7% on LDC test set, and with the acoustic model co-training, 0.8% additional CER decrease had been achieved.
Heng ZHANG Qiang FU Yonghong YAN
In this letter, a two channel frequency domain speech enhancement algorithm is proposed. The algorithm is designed to achieve better overall performance with relatively small array size. An improved version of adaptive null-forming is used, in which noise cancelation is implemented in auditory subbands. And an OM-LSA based postfiltering stage further purifies the output. The algorithm also features interaction between the array processing and the postfilter to make the filter adaptation more robust. This approach achieves considerable improvement on signal-to-noise ratio (SNR) and subjective quality of the desired speech. Experiments confirm the effectiveness of the proposed system.
Fengpei GE Changliang LIU Jian SHAO Fuping PAN Bin DONG Yonghong YAN
In this paper we present our investigation into improving the performance of our computer-assisted language learning (CALL) system through exploiting the acoustic model and features within the speech recognition framework. First, to alleviate channel distortion, speaker-dependent cepstrum mean normalization (CMN) is adopted and the average correlation coefficient (average CC) between machine and expert scores is improved from 78.00% to 84.14%. Second, heteroscedastic linear discriminant analysis (HLDA) is adopted to enhance the discriminability of the acoustic model, which successfully increases the average CC from 84.14% to 84.62%. Additionally, HLDA causes the scoring accuracy to be more stable at various pronunciation proficiency levels, and thus leads to an increase in the speaker correct-rank rate from 85.59% to 90.99%. Finally, we use maximum a posteriori (MAP) estimation to tune the acoustic model to fit strongly accented test speech. As a result, the average CC is improved from 84.62% to 86.57%. These three novel techniques improve the accuracy of evaluating pronunciation quality.
Yanling LI Qingwei ZHAO Yonghong YAN
Semantic concept in an utterance is obtained by a fuzzy matching methods to solve problems such as words' variation induced by automatic speech recognition (ASR), or missing field of key information by users in the process of spoken language understanding (SLU). A two-stage method is proposed: first, we adopt conditional random field (CRF) for building probabilistic models to segment and label entity names from an input sentence. Second, fuzzy matching based on similarity function is conducted between the named entities labeled by a CRF model and the reference characters of a dictionary. The experiments compare the performances in terms of accuracy and processing speed. Dice similarity and cosine similarity based on TF score can achieve better accuracy performance among four similarity measures, which equal to and greater than 93% in F1-measure. Especially the latter one improved by 8.8% and 9% respectively compared to q-gram and improved edit-distance, which are two conventional methods for string fuzzy matching.
Jian SHAO Ta LI Qingqing ZHANG Qingwei ZHAO Yonghong YAN
This paper presents our developed decoder which adopts the idea of statically optimizing part of the knowledge sources while handling the others dynamically. The lexicon, phonetic contexts and acoustic model are statically integrated to form a memory-efficient state network, while the language model (LM) is dynamically incorporated on the fly by means of extended tokens. The novelties of our approach for constructing the state network are (1) introducing two layers of dummy nodes to cluster the cross-word (CW) context dependent fan-in and fan-out triphones, (2) introducing a so-called "WI layer" to store the word identities and putting the nodes of this layer in the non-shared mid-part of the network, (3) optimizing the network at state level by a sufficient forward and backward node-merge process. The state network is organized as a multi-layer structure for distinct token propagation at each layer. By exploiting the characteristics of the state network, several techniques including LM look-ahead, LM cache and beam pruning are specially designed for search efficiency. Especially in beam pruning, a layer-dependent pruning method is proposed to further reduce the search space. The layer-dependent pruning takes account of the neck-like characteristics of WI layer and the reduced variety of word endings, which enables tighter beam without introducing much search errors. In addition, other techniques including LM compression, lattice-based bookkeeping and lattice garbage collection are also employed to reduce the memory requirements. Experiments are carried out on a Mandarin spontaneous speech recognition task where the decoder involves a trigram LM and CW triphone models. A comparison with HDecode of HTK toolkits shows that, within 1% performance deviation, our decoder can run 5 times faster with half of the memory footprint.
Hongbin SUO Ming LI Ping LU Yonghong YAN
Robust automatic language identification (LID) is the task of identifying the language from a short utterance spoken by an unknown speaker. The mainstream approaches include parallel phone recognition language modeling (PPRLM), support vector machine (SVM) and the general Gaussian mixture models (GMMs). These systems map the cepstral features of spoken utterances into high level scores by classifiers. In this paper, in order to increase the dimension of the score vector and alleviate the inter-speaker variability within the same language, multiple data groups based on supervised speaker clustering are employed to generate the discriminative language characterization score vectors (DLCSV). The back-end SVM classifiers are used to model the probability distribution of each target language in the DLCSV space. Finally, the output scores of back-end classifiers are calibrated by a pair-wise posterior probability estimation (PPPE) algorithm. The proposed language identification frameworks are evaluated on 2003 NIST Language Recognition Evaluation (LRE) databases and the experiments show that the system described in this paper produces comparable results to the existing systems. Especially, the SVM framework achieves an equal error rate (EER) of 4.0% in the 30-second task and outperforms the state-of-art systems by more than 30% relative error reduction. Besides, the performances of proposed PPRLM and GMMs algorithms achieve an EER of 5.1% and 5.0% respectively.
Risheng XIA Junfeng LI Andrea PRIMAVERA Stefania CECCHI Yôiti SUZUKI Yonghong YAN
Real-time high-quality reverberation simulation plays an important role in many modern applications, such as computer games and virtual reality. Traditional physically and perceptually-based reverberation simulation techniques, however, suffer from the high computational cost and lack of the specific characteristics of the enclosure. In this paper, a hybrid reverberation simulation approach is proposed in which early reflections are reproduced by convolving with the impulse response modeled by the image-source method and late reverberation is reproduced by the feedback delay network. A parametric predictor of energy decay relief is presented for the modeled early reflections and then exploited to compute the parameters of the feedback delay network. This ensures the smooth transition in the time-frequency domain from early to late reflections. Numerical simulation and listening tests validate the effectiveness of this proposed hybrid reverberation simulation approach.
Dongni HU Chengxin CHEN Pengyuan ZHANG Junfeng LI Yonghong YAN Qingwei ZHAO
Recently, automated recognition and analysis of human emotion has attracted increasing attention from multidisciplinary communities. However, it is challenging to utilize the emotional information simultaneously from multiple modalities. Previous studies have explored different fusion methods, but they mainly focused on either inter-modality interaction or intra-modality interaction. In this letter, we propose a novel two-stage fusion strategy named modality attention flow (MAF) to model the intra- and inter-modality interactions simultaneously in a unified end-to-end framework. Experimental results show that the proposed approach outperforms the widely used late fusion methods, and achieves even better performance when the number of stacked MAF blocks increases.
Xuyang WANG Pengyuan ZHANG Qingwei ZHAO Jielin PAN Yonghong YAN
The introduction of deep neural networks (DNNs) leads to a significant improvement of the automatic speech recognition (ASR) performance. However, the whole ASR system remains sophisticated due to the dependent on the hidden Markov model (HMM). Recently, a new end-to-end ASR framework, which utilizes recurrent neural networks (RNNs) to directly model context-independent targets with connectionist temporal classification (CTC) objective function, is proposed and achieves comparable results with the hybrid HMM/DNN system. In this paper, we investigate per-dimensional learning rate methods, ADAGRAD and ADADELTA included, to improve the recognition of the end-to-end system, based on the fact that the blank symbol used in CTC technique dominates the output and these methods give frequent features small learning rates. Experiment results show that more than 4% relative reduction of word error rate (WER) as well as 5% absolute improvement of label accuracy on the training set are achieved when using ADADELTA, and fewer epochs of training are needed.
Anhao XING Qingwei ZHAO Yonghong YAN
This paper proposes a new quantization framework on activation function of deep neural networks (DNN). We implement fixed-point DNN by quantizing the activations into powers-of-two integers. The costly multiplication operations in using DNN can be replaced with low-cost bit-shifts to massively save computations. Thus, applying DNN-based speech recognition on embedded systems becomes much easier. Experiments show that the proposed method leads to no performance degradation.
Chenglong MA Qingwei ZHAO Jielin PAN Yonghong YAN
Short texts usually encounter the problem of data sparseness, as they do not provide sufficient term co-occurrence information. In this paper, we show how to mitigate the problem in short text classification through word embeddings. We assume that a short text document is a specific sample of one distribution in a Gaussian-Bayesian framework. Furthermore, a fast clustering algorithm is utilized to expand and enrich the context of short text in embedding space. This approach is compared with those based on the classical bag-of-words approaches and neural network based methods. Experimental results validate the effectiveness of the proposed method.
Xiao WU Ming LI Hongbin SUO Yonghong YAN
In this letter we focus on the task of selecting the melody track from a polyphonic MIDI file. Based on the intuition that music and language are similar in many aspects, we solve the selection problem by introducing an n-gram language model to learn the melody co-occurrence patterns in a statistical manner and determine the melodic degree of a given MIDI track. Furthermore, we propose the idea of using background model and posterior probability criteria to make modeling more discriminative. In the evaluation, the achieved 81.6% correct rate indicates the feasibility of our approach.
Yanqing SUN Yu ZHOU Qingwei ZHAO Pengyuan ZHANG Fuping PAN Yonghong YAN
In this paper, the robustness of the posterior-based confidence measures is improved by utilizing entropy information, which is calculated for speech-unit-level posteriors using only the best recognition result, without requiring a larger computational load than conventional methods. Using different normalization methods, two posterior-based entropy confidence measures are proposed. Practical details are discussed for two typical levels of hidden Markov model (HMM)-based posterior confidence measures, and both levels are compared in terms of their performances. Experiments show that the entropy information results in significant improvements in the posterior-based confidence measures. The absolute improvements of the out-of-vocabulary (OOV) rejection rate are more than 20% for both the phoneme-level confidence measures and the state-level confidence measures for our embedded test sets, without a significant decline of the in-vocabulary accuracy.
Yaohui QI Fuping PAN Fengpei GE Qingwei ZHAO Yonghong YAN
A smoothing method for minimum phone error linear regression (MPELR) is proposed in this paper. We show that the objective function for minimum phone error (MPE) can be combined with a prior mean distribution. When the prior mean distribution is based on maximum likelihood (ML) estimates, the proposed method is the same as the previous smoothing technique for MPELR. Instead of ML estimates, maximum a posteriori (MAP) parameter estimate is used to define the mode of prior mean distribution to improve the performance of MPELR. Experiments on a large vocabulary speech recognition task show that the proposed method can obtain 8.4% relative reduction in word error rate when the amount of data is limited, while retaining the same asymptotic performance as conventional MPELR. When compared with discriminative maximum a posteriori linear regression (DMAPLR), the proposed method shows improvement except for the case of limited adaptation data for supervised adaptation.