The search functionality is under construction.

Author Search Result

[Author] Xia ZOU(8hit)

1-8hit
  • Speech Reconstruction from MFCC Based on Nonnegative and Sparse Priors

    Gang MIN  Xiong wei ZHANG  Ji bin YANG  Xia ZOU  Zhi song PAN  

     
    LETTER-Speech and Hearing

      Vol:
    E98-A No:7
      Page(s):
    1540-1543

    In this letter, high quality speech reconstruction approaches from Mel-frequency cepstral coefficients (MFCC) are presented. Taking into account of the nonnegative and sparse properties of the speech power spectrum, an alternating direction method of multipliers (ADMM) based nonnegative l2 norm (NL2) and weighted nonnegative l2 norm (NWL2) minimization approach is proposed to cope with the under-determined nature of the reconstruction problem. The phase spectrum is recovered by the well-known LSE-ISTFTM algorithm. Experimental results demonstrate that the NL2 and NWL2 approach substantially achieves better quality for reconstructed speech than the conventional l2 norm minimization approach, it sounds very close to the original speech when using the high-resolution MFCC, the PESQ score reaches 4.0.

  • On the Complementary Role of DNN Multi-Level Enhancement for Noisy Robust Speaker Recognition in an I-Vector Framework

    Xingyu ZHANG  Xia ZOU  Meng SUN  Penglong WU  Yimin WANG  Jun HE  

     
    LETTER-Speech and Hearing

      Vol:
    E103-A No:1
      Page(s):
    356-360

    In order to improve the noise robustness of automatic speaker recognition, many techniques on speech/feature enhancement have been explored by using deep neural networks (DNN). In this work, a DNN multi-level enhancement (DNN-ME), which consists of the stages of signal enhancement, cepstrum enhancement and i-vector enhancement, is proposed for text-independent speaker recognition. Given the fact that these enhancement methods are applied in different stages of the speaker recognition pipeline, it is worth exploring the complementary role of these methods, which benefits the understanding of the pros and cons of the enhancements of different stages. In order to use the capabilities of DNN-ME as much as possible, two kinds of methods called Cascaded DNN-ME and joint input of DNNs are studied. Weighted Gaussian mixture models (WGMMs) proposed in our previous work is also applied to further improve the model's performance. Experiments conducted on the Speakers in the Wild (SITW) database have shown that DNN-ME demonstrated significant superiority over the systems with only a single enhancement for noise robust speaker recognition. Compared with the i-vector baseline, the equal error rate (EER) was reduced from 5.75 to 4.01.

  • Speech Enhancement Combining NMF Weighted by Speech Presence Probability and Statistical Model

    Yonggang HU  Xiongwei ZHANG  Xia ZOU  Gang MIN  Meng SUN  Yunfei ZHENG  

     
    LETTER-Speech and Hearing

      Vol:
    E98-A No:12
      Page(s):
    2701-2704

    The conventional non-negative matrix factorization (NMF)-based speech enhancement is accomplished by updating iteratively with the prior knowledge of the clean speech and noise spectra bases. With the probabilistic estimation of whether the speech is present or not in a certain frame, this letter proposes a speech enhancement algorithm incorporating the speech presence probability (SPP) obtained via noise estimation to the NMF process. To take advantage of both the NMF-based and statistical model-based approaches, the final enhanced speech is achieved by applying a statistical model-based filter to the output of the SPP weighted NMF. Objective evaluations using perceptual evaluation of speech quality (PESQ) on TIMIT with 20 noise types at various signal-to-noise ratio (SNR) levels demonstrate the superiority of the proposed algorithm over the conventional NMF and statistical model-based baselines.

  • Semi-Supervised Speech Enhancement Combining Nonnegative Matrix Factorization and Robust Principal Component Analysis

    Yonggang HU  Xiongwei ZHANG  Xia ZOU  Meng SUN  Yunfei ZHENG  Gang MIN  

     
    LETTER-Speech and Hearing

      Vol:
    E100-A No:8
      Page(s):
    1714-1719

    Nonnegative matrix factorization (NMF) is one of the most popular machine learning tools for speech enhancement. The supervised NMF-based speech enhancement is accomplished by updating iteratively with the prior knowledge of the clean speech and noise spectra bases. However, in many real-world scenarios, it is not always possible for conducting any prior training. The traditional semi-supervised NMF (SNMF) version overcomes this shortcoming while the performance degrades. In this letter, without any prior knowledge of the speech and noise, we present an improved semi-supervised NMF-based speech enhancement algorithm combining techniques of NMF and robust principal component analysis (RPCA). In this approach, fixed speech bases are obtained from the training samples chosen from public dateset offline. The noise samples used for noise bases training, instead of characterizing a priori as usual, can be obtained via RPCA algorithm on the fly. This letter also conducts a study on the assumption whether the time length of the estimated noise samples may have an effect on the performance of the algorithm. Three metrics, including PESQ, SDR and SNR are applied to evaluate the performance of the algorithms by making experiments on TIMIT with 20 noise types at various signal-to-noise ratio levels. Extensive experimental results demonstrate the superiority of the proposed algorithm over the competing speech enhancement algorithm.

  • Automatic Model Order Selection for Convolutive Non-Negative Matrix Factorization

    Yinan LI  Xiongwei ZHANG  Meng SUN  Chong JIA  Xia ZOU  

     
    LETTER-Speech and Hearing

      Vol:
    E99-A No:10
      Page(s):
    1867-1870

    Exploring a parsimonious model that is just enough to represent the temporal dependency of time serial signals such as audio or speech is a practical requirement for many signal processing applications. A well suited method for intuitively and efficiently representing magnitude spectra is to use convolutive non-negative matrix factorization (CNMF) to discover the temporal relationship among nearby frames. However, the model order selection problem in CNMF, i.e., the choice of the number of convolutive bases, has seldom been investigated ever. In this paper, we propose a novel Bayesian framework that can automatically learn the optimal model order through maximum a posteriori (MAP) estimation. The proposed method yields a parsimonious and low-rank approximation by removing the redundant bases iteratively. We conducted intuitive experiments to show that the proposed algorithm is very effective in automatically determining the correct model order.

  • FFT-Based Implementation of Sampling Rate Conversion with a Small Number of Delays

    Xiaoxia ZOU  Shogo MURAMATSU  Hitoshi KIYA  

     
    PAPER

      Vol:
    E80-A No:8
      Page(s):
    1367-1375

    Block delay caused by using fast Fourier transform (FFT), and computational complexity in sampling rate conversion system are considered in this paper. The relationship between the number of block delays and the computational complexity is investigated. The proposed method can avoid the redundant operations of sampling rate conversion completely and moreover provide a good trade-off between the number of block delays and the computational complexity. As a result, ti is shown that with the proposed method, the sampling rate conversion can be realized more efficiently under a small number of block delays.

  • Deep Neural Network Based Monaural Speech Enhancement with Low-Rank Analysis and Speech Present Probability

    Wenhua SHI  Xiongwei ZHANG  Xia ZOU  Meng SUN  Wei HAN  Li LI  Gang MIN  

     
    LETTER-Noise and Vibration

      Vol:
    E101-A No:3
      Page(s):
    585-589

    A monaural speech enhancement method combining deep neural network (DNN) with low rank analysis and speech present probability is proposed in this letter. Low rank and sparse analysis is first applied on the noisy speech spectrogram to get the approximate low rank representation of noise. Then a joint feature training strategy for DNN based speech enhancement is presented, which helps the DNN better predict the target speech. To reduce the residual noise in highly overlapping regions and high frequency domain, speech present probability (SPP) weighted post-processing is employed to further improve the quality of the speech enhanced by trained DNN model. Compared with the supervised non-negative matrix factorization (NMF) and the conventional DNN method, the proposed method obtains improved speech enhancement performance under stationary and non-stationary conditions.

  • Improved Semi-Supervised NMF Based Real-Time Capable Speech Enhancement

    Yonggang HU  Xiongwei ZHANG  Xia ZOU  Meng SUN  Gang MIN  Yinan LI  

     
    LETTER-Speech and Hearing

      Vol:
    E99-A No:1
      Page(s):
    402-406

    Nonnegative matrix factorization (NMF) is one of the most popular tools for speech enhancement. In this letter, we present an improved semi-supervised NMF (ISNMF)-based speech enhancement algorithm combining techniques of noise estimation and Incremental NMF (INMF). In this approach, fixed speech bases are obtained from training samples offline in advance while noise bases are trained on-the-fly whenever new noisy frame arrives. The INMF algorithm is adopted for noise bases learning because it can overcome the difficulties that conventional NMF confronts in online processing. The proposed algorithm is real-time capable in the sense that it processes the time frames of the noisy speech one by one and the computational complexity is feasible. Four different objective evaluation measures at various signal-to-noise ratio (SNR) levels demonstrate the superiority of the proposed method over traditional semi-supervised NMF (SNMF) and well-known robust principal component analysis (RPCA) algorithm.