The search functionality is under construction.

Author Search Result

[Author] Jielin PAN(7hit)

1-7hit
  • Multi-Task Learning in Deep Neural Networks for Mandarin-English Code-Mixing Speech Recognition

    Mengzhe CHEN  Jielin PAN  Qingwei ZHAO  Yonghong YAN  

     
    LETTER-Acoustic modeling

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2554-2557

    Multi-task learning in deep neural networks has been proven to be effective for acoustic modeling in speech recognition. In the paper, this technique is applied to Mandarin-English code-mixing recognition. For the primary task of the senone classification, three schemes of the auxiliary tasks are proposed to introduce the language information to networks and improve the prediction of language switching. On the real-world Mandarin-English test corpus in mobile voice search, the proposed schemes enhanced the recognition on both languages and reduced the relative overall error rates by 3.5%, 3.8% and 5.8% respectively.

  • Discriminative Approach to Build Hybrid Vocabulary for Conversational Telephone Speech Recognition of Agglutinative Languages

    Xin LI  Jielin PAN  Qingwei ZHAO  Yonghong YAN  

     
    LETTER-Speech and Hearing

      Vol:
    E96-D No:11
      Page(s):
    2478-2482

    Morphemes, which are obtained from morphological parsing, and statistical sub-words, which are derived from data-driven splitting, are commonly used as the recognition units for speech recognition of agglutinative languages. In this letter, we propose a discriminative approach to select the splitting result, which is more likely to improve the recognizer's performance, for each distinct word type. An objective function which involves the unigram language model (LM) probability and the count of misrecognized phones on the acoustic training data is defined and minimized. After determining the splitting result for each word in the text corpus, we select the frequent units to build a hybrid vocabulary including morphemes and statistical sub-words. Compared to a statistical sub-word based system, the hybrid system achieves 0.8% letter error rates (LERs) reduction on the test set.

  • Development of a Mandarin-English Bilingual Speech Recognition System for Real World Music Retrieval

    Qingqing ZHANG  Jielin PAN  Yang LIN  Jian SHAO  Yonghong YAN  

     
    PAPER-Acoustic Modeling

      Vol:
    E91-D No:3
      Page(s):
    514-521

    In recent decades, there has been a great deal of research into the problem of bilingual speech recognition - to develop a recognizer that can handle inter- and intra-sentential language switching between two languages. This paper presents our recent work on the development of a grammar-constrained, Mandarin-English bilingual Speech Recognition System (MESRS) for real world music retrieval. Two of the main difficult issues in handling the bilingual speech recognition systems for real world applications are tackled in this paper. One is to balance the performance and the complexity of the bilingual speech recognition system; the other is to effectively deal with the matrix language accents in embedded language. In order to process the intra-sentential language switching and reduce the amount of data required to robustly estimate statistical models, a compact single set of bilingual acoustic models derived by phone set merging and clustering is developed instead of using two separate monolingual models for each language. In our study, a novel Two-pass phone clustering method based on Confusion Matrix (TCM) is presented and compared with the log-likelihood measure method. Experiments testify that TCM can achieve better performance. Since potential system users' native language is Mandarin which is regarded as a matrix language in our application, their pronunciations of English as the embedded language usually contain Mandarin accents. In order to deal with the matrix language accents in embedded language, different non-native adaptation approaches are investigated. Experiments show that model retraining method outperforms the other common adaptation methods such as Maximum A Posteriori (MAP). With the effective incorporation of approaches on phone clustering and non-native adaptation, the Phrase Error Rate (PER) of MESRS for English utterances was reduced by 24.47% relatively compared to the baseline monolingual English system while the PER on Mandarin utterances was comparable to that of the baseline monolingual Mandarin system. The performance for bilingual utterances achieved 22.37% relative PER reduction.

  • Noise Robust Feature Scheme for Automatic Speech Recognition Based on Auditory Perceptual Mechanisms

    Shang CAI  Yeming XIAO  Jielin PAN  Qingwei ZHAO  Yonghong YAN  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:6
      Page(s):
    1610-1618

    Mel Frequency Cepstral Coefficients (MFCC) are the most popular acoustic features used in automatic speech recognition (ASR), mainly because the coefficients capture the most useful information of the speech and fit well with the assumptions used in hidden Markov models. As is well known, MFCCs already employ several principles which have known counterparts in the peripheral properties of human hearing: decoupling across frequency, mel-warping of the frequency axis, log-compression of energy, etc. It is natural to introduce more mechanisms in the auditory periphery to improve the noise robustness of MFCC. In this paper, a k-nearest neighbors based frequency masking filter is proposed to reduce the audibility of spectra valleys which are sensitive to noise. Besides, Moore and Glasberg's critical band equivalent rectangular bandwidth (ERB) expression is utilized to determine the filter bandwidth. Furthermore, a new bandpass infinite impulse response (IIR) filter is proposed to imitate the temporal masking phenomenon of the human auditory system. These three auditory perceptual mechanisms are combined with the standard MFCC algorithm in order to investigate their effects on ASR performance, and a revised MFCC extraction scheme is presented. Recognition performances with the standard MFCC, RASTA perceptual linear prediction (RASTA-PLP) and the proposed feature extraction scheme are evaluated on a medium-vocabulary isolated-word recognition task and a more complex large vocabulary continuous speech recognition (LVCSR) task. Experimental results show that consistent robustness against background noise is achieved on these two tasks, and the proposed method outperforms both the standard MFCC and RASTA-PLP.

  • Discriminative Pronunciation Modeling Using the MPE Criterion

    Meixu SONG  Jielin PAN  Qingwei ZHAO  Yonghong YAN  

     
    LETTER-Speech and Hearing

      Pubricized:
    2014/12/02
      Vol:
    E98-D No:3
      Page(s):
    717-720

    Introducing pronunciation models into decoding has been proven to be benefit to LVCSR. In this paper, a discriminative pronunciation modeling method is presented, within the framework of the Minimum Phone Error (MPE) training for HMM/GMM. In order to bring the pronunciation models into the MPE training, the auxiliary function is rewritten at word level and decomposes into two parts. One is for co-training the acoustic models, and the other is for discriminatively training the pronunciation models. On Mandarin conversational telephone speech recognition task, compared to the baseline using a canonical lexicon, the discriminative pronunciation models reduced the absolute Character Error Rate (CER) by 0.7% on LDC test set, and with the acoustic model co-training, 0.8% additional CER decrease had been achieved.

  • Improved End-to-End Speech Recognition Using Adaptive Per-Dimensional Learning Rate Methods

    Xuyang WANG  Pengyuan ZHANG  Qingwei ZHAO  Jielin PAN  Yonghong YAN  

     
    LETTER-Acoustic modeling

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2550-2553

    The introduction of deep neural networks (DNNs) leads to a significant improvement of the automatic speech recognition (ASR) performance. However, the whole ASR system remains sophisticated due to the dependent on the hidden Markov model (HMM). Recently, a new end-to-end ASR framework, which utilizes recurrent neural networks (RNNs) to directly model context-independent targets with connectionist temporal classification (CTC) objective function, is proposed and achieves comparable results with the hybrid HMM/DNN system. In this paper, we investigate per-dimensional learning rate methods, ADAGRAD and ADADELTA included, to improve the recognition of the end-to-end system, based on the fact that the blank symbol used in CTC technique dominates the output and these methods give frequent features small learning rates. Experiment results show that more than 4% relative reduction of word error rate (WER) as well as 5% absolute improvement of label accuracy on the training set are achieved when using ADADELTA, and fewer epochs of training are needed.

  • Short Text Classification Based on Distributional Representations of Words

    Chenglong MA  Qingwei ZHAO  Jielin PAN  Yonghong YAN  

     
    LETTER-Text classification

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2562-2565

    Short texts usually encounter the problem of data sparseness, as they do not provide sufficient term co-occurrence information. In this paper, we show how to mitigate the problem in short text classification through word embeddings. We assume that a short text document is a specific sample of one distribution in a Gaussian-Bayesian framework. Furthermore, a fast clustering algorithm is utilized to expand and enrich the context of short text in embedding space. This approach is compared with those based on the classical bag-of-words approaches and neural network based methods. Experimental results validate the effectiveness of the proposed method.