The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] MFCC(8hit)

1-8hit
  • Home Activity Recognition by Sounds of Daily Life Using Improved Feature Extraction Method

    João Filipe PAPEL  Tatsuji MUNAKA  

     
    PAPER

      Pubricized:
    2022/08/23
      Vol:
    E106-D No:4
      Page(s):
    450-458

    In recent years, with the aging of society, many kinds of research have been actively conducted to recognize human activity in a home to watch over the elderly. Multiple sensors for activity recognition are used. However, we need to consider privacy when using these sensors. One of the candidates of the sensors that keep privacy is a sound sensor. MFCC (Mel-Frequency Cepstral Coefficient) is widely used as a feature extraction algorithm for voice recognition. However, it is not suitable to apply conventional MFCC to activity recognition by sounds of daily life. We denote “sounds of daily life” as “life sounds” simply in this paper. The reason is that conventional MFCC does not extract well several features of life sounds that appear at high frequencies. This paper proposes the improved MFCC and reports the evaluation results of activity recognition by machine learning SVM (Support Vector Machine) using features extracted by improved MFCC.

  • Pitch Estimation and Voicing Classification Using Reconstructed Spectrum from MFCC

    JianFeng WU  HuiBin QIN  YongZhu HUA  LingYan FAN  

     
    LETTER-Speech and Hearing

      Pubricized:
    2017/11/15
      Vol:
    E101-D No:2
      Page(s):
    556-559

    In this paper, a novel method for pitch estimation and voicing classification is proposed using reconstructed spectrum from Mel-frequency cepstral coefficients (MFCC). The proposed algorithm reconstructs spectrum from MFCC with Moore-Penrose pseudo-inverse by Mel-scale weighting functions. The reconstructed spectrum is compressed and filtered in log-frequency. Pitch estimation is achieved by modeling the joint density of pitch frequency and the filter spectrum with Gaussian Mixture Model (GMM). Voicing classification is also achieved by GMM-based model, and the test results show that over 99% frames can be correctly classified. The results of pitch estimation demonstrate that the proposed GMM-based pitch estimator has high accuracy, and the relative error is 6.68% on TIMIT database.

  • Speech Reconstruction from MFCC Based on Nonnegative and Sparse Priors

    Gang MIN  Xiong wei ZHANG  Ji bin YANG  Xia ZOU  Zhi song PAN  

     
    LETTER-Speech and Hearing

      Vol:
    E98-A No:7
      Page(s):
    1540-1543

    In this letter, high quality speech reconstruction approaches from Mel-frequency cepstral coefficients (MFCC) are presented. Taking into account of the nonnegative and sparse properties of the speech power spectrum, an alternating direction method of multipliers (ADMM) based nonnegative l2 norm (NL2) and weighted nonnegative l2 norm (NWL2) minimization approach is proposed to cope with the under-determined nature of the reconstruction problem. The phase spectrum is recovered by the well-known LSE-ISTFTM algorithm. Experimental results demonstrate that the NL2 and NWL2 approach substantially achieves better quality for reconstructed speech than the conventional l2 norm minimization approach, it sounds very close to the original speech when using the high-resolution MFCC, the PESQ score reaches 4.0.

  • Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions

    Longbiao WANG  Kazue MINAMI  Kazumasa YAMAMOTO  Seiichi NAKAGAWA  

     
    PAPER-Speaker Recognition

      Vol:
    E93-D No:9
      Page(s):
    2397-2406

    In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.

  • Robust Feature Extraction Using Variable Window Function in Autocorrelation Domain for Speech Recognition

    Sangho LEE  Jeonghyun HA  Jaekeun HONG  

     
    LETTER-Speech and Hearing

      Vol:
    E92-A No:11
      Page(s):
    2917-2921

    This paper presents a new feature extraction method for robust speech recognition based on the autocorrelation mel frequency cepstral coefficients (AMFCCs) and a variable window. While the AMFCC feature extraction method uses the fixed double-dynamic-range (DDR) Hamming window for higher-lag autocorrelation coefficients, which are least affected by noise, the proposed method applies a variable window, depending on the frame energy and periodicity. The performance of the proposed method is verified using an Aurora-2 task, and the results confirm a significantly improved performance under noisy conditions.

  • Multi-Input Feature Combination in the Cepstral Domain for Practical Speech Recognition Systems

    Yasunari OBUCHI  Nobuo HATAOKA  

     
    PAPER-Speech and Hearing

      Vol:
    E92-D No:4
      Page(s):
    662-670

    In this paper we describe a new framework of feature combination in the cepstral domain for multi-input robust speech recognition. The general framework of working in the cepstral domain has various advantages over working in the time or hypothesis domain. It is stable, easy to maintain, and less expensive because it does not require precise calibration. It is also easy to configure in a complex speech recognition system. However, it is not straightforward to improve the recognition performance by increasing the number of inputs, and we introduce the concept of variance re-scaling to compensate the negative effect of averaging several input features. Finally, we propose to take another advantage of working in the cepstral domain. The speech can be modeled using hidden Markov models, and the model can be used as prior knowledge. This approach is formulated as a new algorithm, referred to as Hypothesis-Based Feature Combination. The effectiveness of various algorithms are evaluated using two sets of speech databases. We also refer to automatic optimization of some parameters in the proposed algorithms.

  • Underwater Transient Signal Classification Using Binary Pattern Image of MFCC and Neural Network

    Taegyun LIM  Keunsung BAE  Chansik HWANG  Hyeonguk LEE  

     
    LETTER-Engineering Acoustics

      Vol:
    E91-A No:3
      Page(s):
    772-774

    This paper presents a new method for classification of underwater transient signals, which employs a binary image pattern of the mel-frequency cepstral coefficients as a feature vector and a feed-forward neural network as a classifier. The feature vector is obtained by taking DCT and 1-bit quantization for the square matrix of the mel-frequency cepstral coefficients that is derived from the frame based cepstral analysis. The classifier is a feed-forward neural network having one hidden layer and one output layer, and a back propagation algorithm is used to update the weighting vector of each layer. Experimental results with underwater transient signals demonstrate that the proposed method is very promising for classification of underwater transient signals.

  • A MFCC-Based CELP Speech Coder for Server-Based Speech Recognition in Network Environments

    Jae Sam YOON  Gil Ho LEE  Hong Kook KIM  

     
    PAPER-Speech/Audio Processing

      Vol:
    E90-A No:3
      Page(s):
    626-632

    Existing standard speech coders can provide high quality speech communication. However, they tend to degrade the performance of automatic speech recognition (ASR) systems that use the reconstructed speech. The main cause of the degradation is in that the linear predictive coefficients (LPCs), which are typical spectral envelope parameters in speech coding, are optimized to speech quality rather than to the performance of speech recognition. In this paper, we propose a speech coder using mel-frequency cepstral coefficients (MFCCs) instead of LPCs to improve the performance of a server-based speech recognition system in network environments. To develop the proposed speech coder with a low-bit rate, we first explore the interframe correlation of MFCCs, which results in the predictive quantization of MFCC. Second, a safety-net scheme is proposed to make the MFCC-based speech coder robust to channel errors. As a result, we propose an 8.7 kbps MFCC-based CELP coder. It is shown that the proposed speech coder has a comparable speech quality to 8 kbps G.729 and the ASR system using the proposed speech coder gives the relative word error rate reduction by 6.8% as compared to the ASR system using G.729 on a large vocabulary task (AURORA4).