The search functionality is under construction.

Author Search Result

[Author] Gang MIN(8hit)

1-8hit
  • Deep Neural Network Based Monaural Speech Enhancement with Low-Rank Analysis and Speech Present Probability

    Wenhua SHI  Xiongwei ZHANG  Xia ZOU  Meng SUN  Wei HAN  Li LI  Gang MIN  

     
    LETTER-Noise and Vibration

      Vol:
    E101-A No:3
      Page(s):
    585-589

    A monaural speech enhancement method combining deep neural network (DNN) with low rank analysis and speech present probability is proposed in this letter. Low rank and sparse analysis is first applied on the noisy speech spectrogram to get the approximate low rank representation of noise. Then a joint feature training strategy for DNN based speech enhancement is presented, which helps the DNN better predict the target speech. To reduce the residual noise in highly overlapping regions and high frequency domain, speech present probability (SPP) weighted post-processing is employed to further improve the quality of the speech enhanced by trained DNN model. Compared with the supervised non-negative matrix factorization (NMF) and the conventional DNN method, the proposed method obtains improved speech enhancement performance under stationary and non-stationary conditions.

  • Improved Semi-Supervised NMF Based Real-Time Capable Speech Enhancement

    Yonggang HU  Xiongwei ZHANG  Xia ZOU  Meng SUN  Gang MIN  Yinan LI  

     
    LETTER-Speech and Hearing

      Vol:
    E99-A No:1
      Page(s):
    402-406

    Nonnegative matrix factorization (NMF) is one of the most popular tools for speech enhancement. In this letter, we present an improved semi-supervised NMF (ISNMF)-based speech enhancement algorithm combining techniques of noise estimation and Incremental NMF (INMF). In this approach, fixed speech bases are obtained from training samples offline in advance while noise bases are trained on-the-fly whenever new noisy frame arrives. The INMF algorithm is adopted for noise bases learning because it can overcome the difficulties that conventional NMF confronts in online processing. The proposed algorithm is real-time capable in the sense that it processes the time frames of the noisy speech one by one and the computational complexity is feasible. Four different objective evaluation measures at various signal-to-noise ratio (SNR) levels demonstrate the superiority of the proposed method over traditional semi-supervised NMF (SNMF) and well-known robust principal component analysis (RPCA) algorithm.

  • A Perceptually Motivated Approach for Speech Enhancement Based on Deep Neural Network

    Wei HAN  Xiongwei ZHANG  Gang MIN  Meng SUN  

     
    LETTER-Speech and Hearing

      Vol:
    E99-A No:4
      Page(s):
    835-838

    In this letter, a novel perceptually motivated single channel speech enhancement approach based on Deep Neural Network (DNN) is presented. Taking into account the good masking properties of the human auditory system, a new DNN architecture is proposed to reduce the perceptual effect of the residual noise. This new DNN architecture is directly trained to learn a gain function which is used to estimate the power spectrum of clean speech and shape the spectrum of the residual noise at the same time. Experimental results demonstrate that the proposed perceptually motivated speech enhancement approach could achieve better objective speech quality when tested with TIMIT sentences corrupted by various types of noise, no matter whether the noise conditions are included in the training set or not.

  • Sequence-Based Pronunciation Variation Modeling for Spontaneous ASR Using a Noisy Channel Approach

    Hansjorg HOFMANN  Sakriani SAKTI  Chiori HORI  Hideki KASHIOKA  Satoshi NAKAMURA  Wolfgang MINKER  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:8
      Page(s):
    2084-2093

    The performance of English automatic speech recognition systems decreases when recognizing spontaneous speech mainly due to multiple pronunciation variants in the utterances. Previous approaches address this problem by modeling the alteration of the pronunciation on a phoneme to phoneme level. However, the phonetic transformation effects induced by the pronunciation of the whole sentence have not yet been considered. In this article, the sequence-based pronunciation variation is modeled using a noisy channel approach where the spontaneous phoneme sequence is considered as a “noisy” string and the goal is to recover the “clean” string of the word sequence. Hereby, the whole word sequence and its effect on the alternation of the phonemes will be taken into consideration. Moreover, the system not only learns the phoneme transformation but also the mapping from the phoneme to the word directly. In this study, first the phonemes will be recognized with the present recognition system and afterwards the pronunciation variation model based on the noisy channel approach will map from the phoneme to the word level. Two well-known natural language processing approaches are adopted and derived from the noisy channel model theory: Joint-sequence models and statistical machine translation. Both of them are applied and various experiments are conducted using microphone and telephone of spontaneous speech.

  • Joint Optimization of Perceptual Gain Function and Deep Neural Networks for Single-Channel Speech Enhancement

    Wei HAN  Xiongwei ZHANG  Gang MIN  Xingyu ZHOU  Meng SUN  

     
    LETTER-Noise and Vibration

      Vol:
    E100-A No:2
      Page(s):
    714-717

    In this letter, we explore joint optimization of perceptual gain function and deep neural networks (DNNs) for a single-channel speech enhancement task. A DNN architecture is proposed which incorporates the masking properties of the human auditory system to make the residual noise inaudible. This new DNN architecture directly trains a perceptual gain function which is used to estimate the magnitude spectrum of clean speech from noisy speech features. Experimental results demonstrate that the proposed speech enhancement approach can achieve significant improvements over the baselines when tested with TIMIT sentences corrupted by various types of noise, no matter whether the noise conditions are included in the training set or not.

  • Speech Reconstruction from MFCC Based on Nonnegative and Sparse Priors

    Gang MIN  Xiong wei ZHANG  Ji bin YANG  Xia ZOU  Zhi song PAN  

     
    LETTER-Speech and Hearing

      Vol:
    E98-A No:7
      Page(s):
    1540-1543

    In this letter, high quality speech reconstruction approaches from Mel-frequency cepstral coefficients (MFCC) are presented. Taking into account of the nonnegative and sparse properties of the speech power spectrum, an alternating direction method of multipliers (ADMM) based nonnegative l2 norm (NL2) and weighted nonnegative l2 norm (NWL2) minimization approach is proposed to cope with the under-determined nature of the reconstruction problem. The phase spectrum is recovered by the well-known LSE-ISTFTM algorithm. Experimental results demonstrate that the NL2 and NWL2 approach substantially achieves better quality for reconstructed speech than the conventional l2 norm minimization approach, it sounds very close to the original speech when using the high-resolution MFCC, the PESQ score reaches 4.0.

  • Speech Enhancement Combining NMF Weighted by Speech Presence Probability and Statistical Model

    Yonggang HU  Xiongwei ZHANG  Xia ZOU  Gang MIN  Meng SUN  Yunfei ZHENG  

     
    LETTER-Speech and Hearing

      Vol:
    E98-A No:12
      Page(s):
    2701-2704

    The conventional non-negative matrix factorization (NMF)-based speech enhancement is accomplished by updating iteratively with the prior knowledge of the clean speech and noise spectra bases. With the probabilistic estimation of whether the speech is present or not in a certain frame, this letter proposes a speech enhancement algorithm incorporating the speech presence probability (SPP) obtained via noise estimation to the NMF process. To take advantage of both the NMF-based and statistical model-based approaches, the final enhanced speech is achieved by applying a statistical model-based filter to the output of the SPP weighted NMF. Objective evaluations using perceptual evaluation of speech quality (PESQ) on TIMIT with 20 noise types at various signal-to-noise ratio (SNR) levels demonstrate the superiority of the proposed algorithm over the conventional NMF and statistical model-based baselines.

  • Semi-Supervised Speech Enhancement Combining Nonnegative Matrix Factorization and Robust Principal Component Analysis

    Yonggang HU  Xiongwei ZHANG  Xia ZOU  Meng SUN  Yunfei ZHENG  Gang MIN  

     
    LETTER-Speech and Hearing

      Vol:
    E100-A No:8
      Page(s):
    1714-1719

    Nonnegative matrix factorization (NMF) is one of the most popular machine learning tools for speech enhancement. The supervised NMF-based speech enhancement is accomplished by updating iteratively with the prior knowledge of the clean speech and noise spectra bases. However, in many real-world scenarios, it is not always possible for conducting any prior training. The traditional semi-supervised NMF (SNMF) version overcomes this shortcoming while the performance degrades. In this letter, without any prior knowledge of the speech and noise, we present an improved semi-supervised NMF-based speech enhancement algorithm combining techniques of NMF and robust principal component analysis (RPCA). In this approach, fixed speech bases are obtained from the training samples chosen from public dateset offline. The noise samples used for noise bases training, instead of characterizing a priori as usual, can be obtained via RPCA algorithm on the fly. This letter also conducts a study on the assumption whether the time length of the estimated noise samples may have an effect on the performance of the algorithm. Three metrics, including PESQ, SDR and SNR are applied to evaluate the performance of the algorithms by making experiments on TIMIT with 20 noise types at various signal-to-noise ratio levels. Extensive experimental results demonstrate the superiority of the proposed algorithm over the competing speech enhancement algorithm.