A Speech Intelligibility Estimation Method Using a Non-reference Feature Set

Toshihiro SAKANO; Yosuke KOBAYASHI; Kazuhiro KONDO

doi:10.1587/transinf.2014MUP0004

A Speech Intelligibility Estimation Method Using a Non-reference Feature Set

Toshihiro SAKANO, Yosuke KOBAYASHI, Kazuhiro KONDO

Full Text Views

0

Cite this

Summary :

We proposed and evaluated a speech intelligibility estimation method that does not require a clean speech reference signal. The propose method uses the features defined in the ITU-T standard P.563, which estimates the overall quality of speech without the reference signal. We selected two sets of features from the P.563 features; the basic 9-feature set, which includes basic features that characterize both speech and background noise, e.g., cepstrum skewness and LPC kurtosis, and the extended 31-feature set with 22 additional features for a more accurate description of the degraded speech and noise, e.g., SNR, average pitch, and spectral clarity among others. Four hundred noise samples were added to speech, and about 70% of these samples were used to train a support vector regression (SVR) model. The trained models were used to estimate the intelligibility of speech degraded by added noise. The proposed method showed a root mean square error (RMSE) value of about 10% and correlation with subjective intelligibility of about 0.93 for speech distorted with known noise type, and RMSE of about 16% and a correlation of about 0.84 for speech distorted with unknown noise type, both with either the 9 or the 31-dimension feature set. These results were higher than the estimation using frequency-weighed SNR calculated in critical frequency bands, which requires the clean reference signal for its calculation. We believe this level of accuracy proves the proposed method to be applicable to real-time speech quality monitoring in the field.

Publication: IEICE TRANSACTIONS on Information Vol.E98-D No.1 pp.21-28

Publication Date: 2015/01/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2014MUP0004

Type of Manuscript: Special Section PAPER (Special Section on Enriched Multimedia)

Category

Authors

Toshihiro SAKANO
  Yamagata Unversity
Yosuke KOBAYASHI
  Yamagata Unversity
Kazuhiro KONDO
  Yamagata Unversity

Keyword

speech intelligibility, non-reference estimation, support vector regression, P.563, diagnostic rhyme test

Cite this

Copy

Toshihiro SAKANO, Yosuke KOBAYASHI, Kazuhiro KONDO, "A Speech Intelligibility Estimation Method Using a Non-reference Feature Set" in IEICE TRANSACTIONS on Information, vol. E98-D, no. 1, pp. 21-28, January 2015, doi: 10.1587/transinf.2014MUP0004.
Abstract: We proposed and evaluated a speech intelligibility estimation method that does not require a clean speech reference signal. The propose method uses the features defined in the ITU-T standard P.563, which estimates the overall quality of speech without the reference signal. We selected two sets of features from the P.563 features; the basic 9-feature set, which includes basic features that characterize both speech and background noise, e.g., cepstrum skewness and LPC kurtosis, and the extended 31-feature set with 22 additional features for a more accurate description of the degraded speech and noise, e.g., SNR, average pitch, and spectral clarity among others. Four hundred noise samples were added to speech, and about 70% of these samples were used to train a support vector regression (SVR) model. The trained models were used to estimate the intelligibility of speech degraded by added noise. The proposed method showed a root mean square error (RMSE) value of about 10% and correlation with subjective intelligibility of about 0.93 for speech distorted with known noise type, and RMSE of about 16% and a correlation of about 0.84 for speech distorted with unknown noise type, both with either the 9 or the 31-dimension feature set. These results were higher than the estimation using frequency-weighed SNR calculated in critical frequency bands, which requires the clean reference signal for its calculation. We believe this level of accuracy proves the proposed method to be applicable to real-time speech quality monitoring in the field.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2014MUP0004/_p

Copy

@ARTICLE{e98-d_1_21,
author={Toshihiro SAKANO, Yosuke KOBAYASHI, Kazuhiro KONDO, },
journal={IEICE TRANSACTIONS on Information},
title={A Speech Intelligibility Estimation Method Using a Non-reference Feature Set},
year={2015},
volume={E98-D},
number={1},
pages={21-28},
abstract={We proposed and evaluated a speech intelligibility estimation method that does not require a clean speech reference signal. The propose method uses the features defined in the ITU-T standard P.563, which estimates the overall quality of speech without the reference signal. We selected two sets of features from the P.563 features; the basic 9-feature set, which includes basic features that characterize both speech and background noise, e.g., cepstrum skewness and LPC kurtosis, and the extended 31-feature set with 22 additional features for a more accurate description of the degraded speech and noise, e.g., SNR, average pitch, and spectral clarity among others. Four hundred noise samples were added to speech, and about 70% of these samples were used to train a support vector regression (SVR) model. The trained models were used to estimate the intelligibility of speech degraded by added noise. The proposed method showed a root mean square error (RMSE) value of about 10% and correlation with subjective intelligibility of about 0.93 for speech distorted with known noise type, and RMSE of about 16% and a correlation of about 0.84 for speech distorted with unknown noise type, both with either the 9 or the 31-dimension feature set. These results were higher than the estimation using frequency-weighed SNR calculated in critical frequency bands, which requires the clean reference signal for its calculation. We believe this level of accuracy proves the proposed method to be applicable to real-time speech quality monitoring in the field.},
keywords={},
doi={10.1587/transinf.2014MUP0004},
ISSN={1745-1361},
month={January},}

Copy

TY - JOUR
TI - A Speech Intelligibility Estimation Method Using a Non-reference Feature Set
T2 - IEICE TRANSACTIONS on Information
SP - 21
EP - 28
AU - Toshihiro SAKANO
AU - Yosuke KOBAYASHI
AU - Kazuhiro KONDO
PY - 2015
DO - 10.1587/transinf.2014MUP0004
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E98-D
IS - 1
JA - IEICE TRANSACTIONS on Information
Y1 - January 2015
AB - We proposed and evaluated a speech intelligibility estimation method that does not require a clean speech reference signal. The propose method uses the features defined in the ITU-T standard P.563, which estimates the overall quality of speech without the reference signal. We selected two sets of features from the P.563 features; the basic 9-feature set, which includes basic features that characterize both speech and background noise, e.g., cepstrum skewness and LPC kurtosis, and the extended 31-feature set with 22 additional features for a more accurate description of the degraded speech and noise, e.g., SNR, average pitch, and spectral clarity among others. Four hundred noise samples were added to speech, and about 70% of these samples were used to train a support vector regression (SVR) model. The trained models were used to estimate the intelligibility of speech degraded by added noise. The proposed method showed a root mean square error (RMSE) value of about 10% and correlation with subjective intelligibility of about 0.93 for speech distorted with known noise type, and RMSE of about 16% and a correlation of about 0.84 for speech distorted with unknown noise type, both with either the 9 or the 31-dimension feature set. These results were higher than the estimation using frequency-weighed SNR calculated in critical frequency bands, which requires the clean reference signal for its calculation. We believe this level of accuracy proves the proposed method to be applicable to real-time speech quality monitoring in the field.
ER -