Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis

Kazuhiro NAKAMURA; Kei HASHIMOTO; Yoshihiko NANKAKU; Keiichi TOKUDA

doi:10.1587/transinf.E97.D.1438

IEICE TRANSACTIONS on Information

Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis

Kazuhiro NAKAMURA, Kei HASHIMOTO, Yoshihiko NANKAKU, Keiichi TOKUDA

Full Text Views

0

Cite this

Summary :

This paper proposes a novel approach for integrating spectral feature extraction and acoustic modeling in hidden Markov model (HMM) based speech synthesis. The statistical modeling process of speech waveforms is typically divided into two component modules: the frame-by-frame feature extraction module and the acoustic modeling module. In the feature extraction module, the statistical mel-cepstral analysis technique has been used and the objective function is the likelihood of mel-cepstral coefficients for given speech waveforms. In the acoustic modeling module, the objective function is the likelihood of model parameters for given mel-cepstral coefficients. It is important to improve the performance of each component module for achieving higher quality synthesized speech. However, the final objective of speech synthesis systems is to generate natural speech waveforms from given texts, and the improvement of each component module does not always lead to the improvement of the quality of synthesized speech. Therefore, ideally all objective functions should be optimized based on an integrated criterion which well represents subjective speech quality of human perception. In this paper, we propose an approach to model speech waveforms directly and optimize the final objective function. Experimental results show that the proposed method outperformed the conventional methods in objective and subjective measures.

Publication: IEICE TRANSACTIONS on Information Vol.E97-D No.6 pp.1438-1448

Publication Date: 2014/06/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E97.D.1438

Type of Manuscript: Special Section PAPER (Special Section on Advances in Modeling for Real-world Speech Information Processing and its Application)

Category: HMM-based Speech Synthesis

Authors

Kazuhiro NAKAMURA
  Nagoya Institute of Technology
Kei HASHIMOTO
  Nagoya Institute of Technology
Yoshihiko NANKAKU
  Nagoya Institute of Technology
Keiichi TOKUDA
  Nagoya Institute of Technology

Keyword

integrative model, HMM-based speech synthesis, acoustic modeling, mel-cepstral analysis, trajectory HMM

Cite this

Copy

Kazuhiro NAKAMURA, Kei HASHIMOTO, Yoshihiko NANKAKU, Keiichi TOKUDA, "Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis" in IEICE TRANSACTIONS on Information, vol. E97-D, no. 6, pp. 1438-1448, June 2014, doi: 10.1587/transinf.E97.D.1438.
Abstract: This paper proposes a novel approach for integrating spectral feature extraction and acoustic modeling in hidden Markov model (HMM) based speech synthesis. The statistical modeling process of speech waveforms is typically divided into two component modules: the frame-by-frame feature extraction module and the acoustic modeling module. In the feature extraction module, the statistical mel-cepstral analysis technique has been used and the objective function is the likelihood of mel-cepstral coefficients for given speech waveforms. In the acoustic modeling module, the objective function is the likelihood of model parameters for given mel-cepstral coefficients. It is important to improve the performance of each component module for achieving higher quality synthesized speech. However, the final objective of speech synthesis systems is to generate natural speech waveforms from given texts, and the improvement of each component module does not always lead to the improvement of the quality of synthesized speech. Therefore, ideally all objective functions should be optimized based on an integrated criterion which well represents subjective speech quality of human perception. In this paper, we propose an approach to model speech waveforms directly and optimize the final objective function. Experimental results show that the proposed method outperformed the conventional methods in objective and subjective measures.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E97.D.1438/_p

Copy

@ARTICLE{e97-d_6_1438,
author={Kazuhiro NAKAMURA, Kei HASHIMOTO, Yoshihiko NANKAKU, Keiichi TOKUDA, },
journal={IEICE TRANSACTIONS on Information},
title={Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis},
year={2014},
volume={E97-D},
number={6},
pages={1438-1448},
abstract={This paper proposes a novel approach for integrating spectral feature extraction and acoustic modeling in hidden Markov model (HMM) based speech synthesis. The statistical modeling process of speech waveforms is typically divided into two component modules: the frame-by-frame feature extraction module and the acoustic modeling module. In the feature extraction module, the statistical mel-cepstral analysis technique has been used and the objective function is the likelihood of mel-cepstral coefficients for given speech waveforms. In the acoustic modeling module, the objective function is the likelihood of model parameters for given mel-cepstral coefficients. It is important to improve the performance of each component module for achieving higher quality synthesized speech. However, the final objective of speech synthesis systems is to generate natural speech waveforms from given texts, and the improvement of each component module does not always lead to the improvement of the quality of synthesized speech. Therefore, ideally all objective functions should be optimized based on an integrated criterion which well represents subjective speech quality of human perception. In this paper, we propose an approach to model speech waveforms directly and optimize the final objective function. Experimental results show that the proposed method outperformed the conventional methods in objective and subjective measures.},
keywords={},
doi={10.1587/transinf.E97.D.1438},
ISSN={1745-1361},
month={June},}

Copy

TY - JOUR
TI - Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis
T2 - IEICE TRANSACTIONS on Information
SP - 1438
EP - 1448
AU - Kazuhiro NAKAMURA
AU - Kei HASHIMOTO
AU - Yoshihiko NANKAKU
AU - Keiichi TOKUDA
PY - 2014
DO - 10.1587/transinf.E97.D.1438
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E97-D
IS - 6
JA - IEICE TRANSACTIONS on Information
Y1 - June 2014
AB - This paper proposes a novel approach for integrating spectral feature extraction and acoustic modeling in hidden Markov model (HMM) based speech synthesis. The statistical modeling process of speech waveforms is typically divided into two component modules: the frame-by-frame feature extraction module and the acoustic modeling module. In the feature extraction module, the statistical mel-cepstral analysis technique has been used and the objective function is the likelihood of mel-cepstral coefficients for given speech waveforms. In the acoustic modeling module, the objective function is the likelihood of model parameters for given mel-cepstral coefficients. It is important to improve the performance of each component module for achieving higher quality synthesized speech. However, the final objective of speech synthesis systems is to generate natural speech waveforms from given texts, and the improvement of each component module does not always lead to the improvement of the quality of synthesized speech. Therefore, ideally all objective functions should be optimized based on an integrated criterion which well represents subjective speech quality of human perception. In this paper, we propose an approach to model speech waveforms directly and optimize the final objective function. Experimental results show that the proposed method outperformed the conventional methods in objective and subjective measures.
ER -

IEICE TRANSACTIONS on Information