1-7hit |
Xiang ZHANG Hongbin SUO Qingwei ZHAO Yonghong YAN
In this letter, we propose a new approach to SVM based speaker recognition, which utilizes a kind of novel phonotactic information as the feature for SVM modeling. Gaussian mixture models (GMMs) have been proven extremely successful for text-independent speaker recognition. The GMM universal background model (UBM) is a speaker-independent model, each component of which can be considered as modeling some underlying phonetic sound classes. We assume that the utterances from different speakers should get different average posterior probabilities on the same Gaussian component of the UBM, and the supervector composed of the average posterior probabilities on all components of the UBM for each utterance should be discriminative. We use these supervectors as the features for SVM based speaker recognition. Experiment results on a NIST SRE 2006 task show that the proposed approach demonstrates comparable performance with the commonly used systems. Fusion results are also presented.
Chuan CAO Ming LI Xiao WU Hongbin SUO Jian LIU Yonghong YAN
In this letter, we present an automatic approach of objective singing performance evaluation for untrained singers by relating acoustic measurements to perceptual ratings of singing voice quality. Several acoustic parameters and their combination features are investigated to find objective correspondences of the perceptual evaluation criteria. Experimental results show relative strong correlation between perceptual ratings and the combined features and the reliability of the proposed evaluation system is tested to be comparable to human judges.
Hongbin SUO Ming LI Ping LU Yonghong YAN
Robust automatic language identification (LID) is the task of identifying the language from a short utterance spoken by an unknown speaker. The mainstream approaches include parallel phone recognition language modeling (PPRLM), support vector machine (SVM) and the general Gaussian mixture models (GMMs). These systems map the cepstral features of spoken utterances into high level scores by classifiers. In this paper, in order to increase the dimension of the score vector and alleviate the inter-speaker variability within the same language, multiple data groups based on supervised speaker clustering are employed to generate the discriminative language characterization score vectors (DLCSV). The back-end SVM classifiers are used to model the probability distribution of each target language in the DLCSV space. Finally, the output scores of back-end classifiers are calibrated by a pair-wise posterior probability estimation (PPPE) algorithm. The proposed language identification frameworks are evaluated on 2003 NIST Language Recognition Evaluation (LRE) databases and the experiments show that the system described in this paper produces comparable results to the existing systems. Especially, the SVM framework achieves an equal error rate (EER) of 4.0% in the 30-second task and outperforms the state-of-art systems by more than 30% relative error reduction. Besides, the performances of proposed PPRLM and GMMs algorithms achieve an EER of 5.1% and 5.0% respectively.
Xiao WU Ming LI Hongbin SUO Yonghong YAN
In this letter we focus on the task of selecting the melody track from a polyphonic MIDI file. Based on the intuition that music and language are similar in many aspects, we solve the selection problem by introducing an n-gram language model to learn the melody co-occurrence patterns in a statistical manner and determine the melodic degree of a given MIDI track. Furthermore, we propose the idea of using background model and posterior probability criteria to make modeling more discriminative. In the evaluation, the achieved 81.6% correct rate indicates the feasibility of our approach.
Xiang ZHANG Ping LU Hongbin SUO Qingwei ZHAO Yonghong YAN
In this letter, a recently proposed clustering algorithm named affinity propagation is introduced for the task of speaker clustering. This novel algorithm exhibits fast execution speed and finds clusters with low error. However, experiments show that the speaker purity of affinity propagation is not satisfying. Thus, we propose a hybrid approach that combines affinity propagation with agglomerative hierarchical clustering to improve the clustering performance. Experiments show that compared with traditional agglomerative hierarchical clustering, the hybrid method achieves better performance on the test corpora.
Changliang LIU Fuping PAN Fengpei GE Bin DONG Hongbin SUO Yonghong YAN
This paper describes a reading miscue detection system based on the conventional Large Vocabulary Continuous Speech Recognition (LVCSR) framework [1]. In order to incorporate the knowledge of reference (what the reader ought to read) and some error patterns into the decoding process, two methods are proposed: Dynamic Multiple Pronunciation Incorporation (DMPI) and Dynamic Interpolation of Language Model (DILM). DMPI dynamically adds some pronunciation variations into the search space to predict reading substitutions and insertions. To resolve the conflict between the coverage of error predications and the perplexity of the search space, only the pronunciation variants related to the reference are added. DILM dynamically interpolates the general language model based on the analysis of the reference and so keeps the active paths of decoding relatively near the reference. It makes the recognition more accurate, which further improves the detection performance. At the final stage of detection, an improved dynamic program (DP) is used to align the confusion network (CN) from speech recognition and the reference to generate the detecting result. The experimental results show that the proposed two methods can decrease the Equal Error Rate (EER) by 14% relatively, from 46.4% to 39.8%.
Xiang XIAO Xiang ZHANG Haipeng WANG Hongbin SUO Qingwei ZHAO Yonghong YAN
The GMM-UBM framework has been proved to be one of the most effective approaches to the automatic speaker verification (ASV) task in recent years. In this letter, we first propose an approximate decision function of traditional GMM-UBM, from which it is shown that the contribution to classification of each Gaussian component is equally important. However, research in speaker perception shows that a different speech sound unit defined by Gaussian component makes a different contribution to speaker verification. This motivates us to emphasize some sound units which have discriminability between speakers while de-emphasize the speech sound units which contain little information for speaker verification. Experiments on 2006 NIST SRE core task show that the proposed approach outperforms traditional GMM-UBM approach in classification accuracy.