Speech Analysis Method Based on Source-Filter Model Using Multivariate Empirical Mode Decomposition

Surasak BOONKLA; Masashi UNOKI; Stanislav S. MAKHANOV; Chai WUTIWIWATCHAI

doi:10.1587/transfun.E99.A.1762

IEICE TRANSACTIONS on Fundamentals

Speech Analysis Method Based on Source-Filter Model Using Multivariate Empirical Mode Decomposition

Surasak BOONKLA, Masashi UNOKI, Stanislav S. MAKHANOV, Chai WUTIWIWATCHAI

Full Text Views

0

Cite this

Summary :

We propose a speech analysis method based on the source-filter model using multivariate empirical mode decomposition (MEMD). The proposed method takes multiple adjacent frames of a speech signal into account by combining their log spectra into multivariate signals. The multivariate signals are then decomposed into intrinsic mode functions (IMFs). The IMFs are divided into two groups using the peak of the autocorrelation function (ACF) of an IMF. The first group characterized by a spectral fine structure is used to estimate the fundamental frequency F₀ by using the ACF, whereas the second group characterized by the frequency response of the vocal-tract filter is used to estimate formant frequencies by using a peak picking technique. There are two advantages of using MEMD: (i) the variation in the number of IMFs is eliminated in contrast with single-frame based empirical mode decomposition and (ii) the common information of the adjacent frames aligns in the same order of IMFs because of the common mode alignment property of MEMD. These advantages make the analysis more accurate than with other methods. As opposed to the conventional linear prediction (LP) and cepstrum methods, which rely on the LP order and cut-off frequency, respectively, the proposed method automatically separates the glottal-source and vocal-tract filter. The results showed that the proposed method exhibits the highest accuracy of F0 estimation and correctly estimates the formant frequencies of the vocal-tract filter.

Publication: IEICE TRANSACTIONS on Fundamentals Vol.E99-A No.10 pp.1762-1773

Publication Date: 2016/10/01

Publicized

Online ISSN: 1745-1337

DOI: 10.1587/transfun.E99.A.1762

Type of Manuscript: PAPER

Category: Speech and Hearing

Authors

Surasak BOONKLA
  Japan Advanced Institute of Science and Technology,Thammasat University
Masashi UNOKI
  Japan Advanced Institute of Science and Technology
Stanislav S. MAKHANOV
  Thammasat University
Chai WUTIWIWATCHAI
  National Electronics and Computer Technology Center

Keyword

multivariate empirical mode decomposition, speech analysis, fundamental frequency, formant frequency

Cite this

Copy

Surasak BOONKLA, Masashi UNOKI, Stanislav S. MAKHANOV, Chai WUTIWIWATCHAI, "Speech Analysis Method Based on Source-Filter Model Using Multivariate Empirical Mode Decomposition" in IEICE TRANSACTIONS on Fundamentals, vol. E99-A, no. 10, pp. 1762-1773, October 2016, doi: 10.1587/transfun.E99.A.1762.
Abstract: We propose a speech analysis method based on the source-filter model using multivariate empirical mode decomposition (MEMD). The proposed method takes multiple adjacent frames of a speech signal into account by combining their log spectra into multivariate signals. The multivariate signals are then decomposed into intrinsic mode functions (IMFs). The IMFs are divided into two groups using the peak of the autocorrelation function (ACF) of an IMF. The first group characterized by a spectral fine structure is used to estimate the fundamental frequency F₀ by using the ACF, whereas the second group characterized by the frequency response of the vocal-tract filter is used to estimate formant frequencies by using a peak picking technique. There are two advantages of using MEMD: (i) the variation in the number of IMFs is eliminated in contrast with single-frame based empirical mode decomposition and (ii) the common information of the adjacent frames aligns in the same order of IMFs because of the common mode alignment property of MEMD. These advantages make the analysis more accurate than with other methods. As opposed to the conventional linear prediction (LP) and cepstrum methods, which rely on the LP order and cut-off frequency, respectively, the proposed method automatically separates the glottal-source and vocal-tract filter. The results showed that the proposed method exhibits the highest accuracy of F0 estimation and correctly estimates the formant frequencies of the vocal-tract filter.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/transfun.E99.A.1762/_p

Copy

@ARTICLE{e99-a_10_1762,
author={Surasak BOONKLA, Masashi UNOKI, Stanislav S. MAKHANOV, Chai WUTIWIWATCHAI, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Speech Analysis Method Based on Source-Filter Model Using Multivariate Empirical Mode Decomposition},
year={2016},
volume={E99-A},
number={10},
pages={1762-1773},
abstract={We propose a speech analysis method based on the source-filter model using multivariate empirical mode decomposition (MEMD). The proposed method takes multiple adjacent frames of a speech signal into account by combining their log spectra into multivariate signals. The multivariate signals are then decomposed into intrinsic mode functions (IMFs). The IMFs are divided into two groups using the peak of the autocorrelation function (ACF) of an IMF. The first group characterized by a spectral fine structure is used to estimate the fundamental frequency F₀ by using the ACF, whereas the second group characterized by the frequency response of the vocal-tract filter is used to estimate formant frequencies by using a peak picking technique. There are two advantages of using MEMD: (i) the variation in the number of IMFs is eliminated in contrast with single-frame based empirical mode decomposition and (ii) the common information of the adjacent frames aligns in the same order of IMFs because of the common mode alignment property of MEMD. These advantages make the analysis more accurate than with other methods. As opposed to the conventional linear prediction (LP) and cepstrum methods, which rely on the LP order and cut-off frequency, respectively, the proposed method automatically separates the glottal-source and vocal-tract filter. The results showed that the proposed method exhibits the highest accuracy of F0 estimation and correctly estimates the formant frequencies of the vocal-tract filter.},
keywords={},
doi={10.1587/transfun.E99.A.1762},
ISSN={1745-1337},
month={October},}

Copy

TY - JOUR
TI - Speech Analysis Method Based on Source-Filter Model Using Multivariate Empirical Mode Decomposition
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1762
EP - 1773
AU - Surasak BOONKLA
AU - Masashi UNOKI
AU - Stanislav S. MAKHANOV
AU - Chai WUTIWIWATCHAI
PY - 2016
DO - 10.1587/transfun.E99.A.1762
JO - IEICE TRANSACTIONS on Fundamentals
SN - 1745-1337
VL - E99-A
IS - 10
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - October 2016
AB - We propose a speech analysis method based on the source-filter model using multivariate empirical mode decomposition (MEMD). The proposed method takes multiple adjacent frames of a speech signal into account by combining their log spectra into multivariate signals. The multivariate signals are then decomposed into intrinsic mode functions (IMFs). The IMFs are divided into two groups using the peak of the autocorrelation function (ACF) of an IMF. The first group characterized by a spectral fine structure is used to estimate the fundamental frequency F₀ by using the ACF, whereas the second group characterized by the frequency response of the vocal-tract filter is used to estimate formant frequencies by using a peak picking technique. There are two advantages of using MEMD: (i) the variation in the number of IMFs is eliminated in contrast with single-frame based empirical mode decomposition and (ii) the common information of the adjacent frames aligns in the same order of IMFs because of the common mode alignment property of MEMD. These advantages make the analysis more accurate than with other methods. As opposed to the conventional linear prediction (LP) and cepstrum methods, which rely on the LP order and cut-off frequency, respectively, the proposed method automatically separates the glottal-source and vocal-tract filter. The results showed that the proposed method exhibits the highest accuracy of F0 estimation and correctly estimates the formant frequencies of the vocal-tract filter.
ER -

IEICE TRANSACTIONS on Fundamentals