The search functionality is under construction.

Author Search Result

[Author] Haizhou LI(5hit)

1-5hit
  • FOREWORD Open Access

    Haizhou LI  

     
    FOREWORD

      Vol:
    E95-D No:5
      Page(s):
    1181-1181
  • Error Corrective Fusion of Classifier Scores for Spoken Language Recognition

    Omid DEHZANGI  Bin MA  Eng Siong CHNG  Haizhou LI  

     
    PAPER-Speech and Hearing

      Vol:
    E94-D No:12
      Page(s):
    2503-2512

    This paper investigates a new method for fusion of scores generated by multiple classification sub-systems that help to further reduce the classification error rate in Spoken Language Recognition (SLR). In recent studies, a variety of effective classification algorithms have been developed for SLR. Hence, it has been a common practice in the National Institute of Standards and Technology (NIST) Language Recognition Evaluations (LREs) to fuse the results from several classification sub-systems to boost the performance of the SLR systems. In this work, we introduce a discriminative performance measure to optimize the performance of the fusion of 7 language classifiers developed as IIR's submission to the 2009 NIST LRE. We present an Error Corrective Fusion (ECF) method in which we iteratively learn the fusion weights to minimize error rate of the fusion system. Experiments conducted on the 2009 NIST LRE corpus demonstrate a significant improvement compared to individual sub-systems. Comparison study is also conducted to show the effectiveness of the ECF method.

  • Broadcast News Story Segmentation Using Conditional Random Fields and Multimodal Features

    Xiaoxuan WANG  Lei XIE  Mimi LU  Bin MA  Eng Siong CHNG  Haizhou LI  

     
    PAPER-Speech Processing

      Vol:
    E95-D No:5
      Page(s):
    1206-1215

    In this paper, we propose integration of multimodal features using conditional random fields (CRFs) for the segmentation of broadcast news stories. We study story boundary cues from lexical, audio and video modalities, where lexical features consist of lexical similarity, chain strength and overall cohesiveness; acoustic features involve pause duration, pitch, speaker change and audio event type; and visual features contain shot boundaries, anchor faces and news title captions. These features are extracted in a sequence of boundary candidate positions in the broadcast news. A linear-chain CRF is used to detect each candidate as boundary/non-boundary tags based on the multimodal features. Important interlabel relations and contextual feature information are effectively captured by the sequential learning framework of CRFs. Story segmentation experiments show that the CRF approach outperforms other popular classifiers, including decision trees (DTs), Bayesian networks (BNs), naive Bayesian classifiers (NBs), multilayer perception (MLP), support vector machines (SVMs) and maximum entropy (ME) classifiers.

  • Cross-Lingual Phone Mapping for Large Vocabulary Speech Recognition of Under-Resourced Languages

    Van Hai DO  Xiong XIAO  Eng Siong CHNG  Haizhou LI  

     
    PAPER-Speech and Hearing

      Vol:
    E97-D No:2
      Page(s):
    285-295

    This paper presents a novel acoustic modeling technique of large vocabulary automatic speech recognition for under-resourced languages by leveraging well-trained acoustic models of other languages (called source languages). The idea is to use source language acoustic model to score the acoustic features of the target language, and then map these scores to the posteriors of the target phones using a classifier. The target phone posteriors are then used for decoding in the usual way of hybrid acoustic modeling. The motivation of such a strategy is that human languages usually share similar phone sets and hence it may be easier to predict the target phone posteriors from the scores generated by source language acoustic models than to train from scratch an under-resourced language acoustic model. The proposed method is evaluated using on the Aurora-4 task with less than 1 hour of training data. Two types of source language acoustic models are considered, i.e. hybrid HMM/MLP and conventional HMM/GMM models. In addition, we also use triphone tied states in the mapping. Our experimental results show that by leveraging well trained Malay and Hungarian acoustic models, we achieved 9.0% word error rate (WER) given 55 minutes of English training data. This is close to the WER of 7.9% obtained by using the full 15 hours of training data and much better than the WER of 14.4% obtained by conventional acoustic modeling techniques with the same 55 minutes of training data.

  • Selective Gammatone Envelope Feature for Robust Sound Event Recognition

    Yi Ren LENG  Huy Dat TRAN  Norihide KITAOKA  Haizhou LI  

     
    PAPER-Audio Processing

      Vol:
    E95-D No:5
      Page(s):
    1229-1237

    Conventional features for Automatic Speech Recognition and Sound Event Recognition such as Mel-Frequency Cepstral Coefficients (MFCCs) have been shown to perform poorly in noisy conditions. We introduce an auditory feature based on the gammatone filterbank, the Selective Gammatone Envelope Feature (SGEF), for Robust Sound Event Recognition where channel selection and the filterbank envelope is used to reduce the effect of noise for specific noise environments. In the experiments with Hidden Markov Model (HMM) recognizers, we shall show that our feature outperforms MFCCs significantly in four different noisy environments at various signal-to-noise ratios.