DNN-Based Full-Band Speech Synthesis Using GMM Approximation of Spectral Envelope

Junya KOGUCHI; Shinnosuke TAKAMICHI; Masanori MORISE; Hiroshi SARUWATARI; Shigeki SAGAYAMA

doi:10.1587/transinf.2020EDP7075

DNN-Based Full-Band Speech Synthesis Using GMM Approximation of Spectral Envelope

Junya KOGUCHI, Shinnosuke TAKAMICHI, Masanori MORISE, Hiroshi SARUWATARI, Shigeki SAGAYAMA

Full Text Views

0

Cite this

Summary :

We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.

Publication: IEICE TRANSACTIONS on Information Vol.E103-D No.12 pp.2673-2681

Publication Date: 2020/12/01

Publicized: 2020/09/03

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2020EDP7075

Type of Manuscript: PAPER

Category: Speech and Hearing

Authors

Junya KOGUCHI
  Meiji University
Shinnosuke TAKAMICHI
  University of Tokyo
Masanori MORISE
  Meiji University
Hiroshi SARUWATARI
  University of Tokyo
Shigeki SAGAYAMA
  University of Electro-Communications

Keyword

Gaussian mixture model, spectral envelope, vocoder, deep neural network, text-to-speech synthesis

Cite this

Copy

Junya KOGUCHI, Shinnosuke TAKAMICHI, Masanori MORISE, Hiroshi SARUWATARI, Shigeki SAGAYAMA, "DNN-Based Full-Band Speech Synthesis Using GMM Approximation of Spectral Envelope" in IEICE TRANSACTIONS on Information, vol. E103-D, no. 12, pp. 2673-2681, December 2020, doi: 10.1587/transinf.2020EDP7075.
Abstract: We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2020EDP7075/_p

Copy

@ARTICLE{e103-d_12_2673,
author={Junya KOGUCHI, Shinnosuke TAKAMICHI, Masanori MORISE, Hiroshi SARUWATARI, Shigeki SAGAYAMA, },
journal={IEICE TRANSACTIONS on Information},
title={DNN-Based Full-Band Speech Synthesis Using GMM Approximation of Spectral Envelope},
year={2020},
volume={E103-D},
number={12},
pages={2673-2681},
abstract={We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.},
keywords={},
doi={10.1587/transinf.2020EDP7075},
ISSN={1745-1361},
month={December},}

Copy

TY - JOUR
TI - DNN-Based Full-Band Speech Synthesis Using GMM Approximation of Spectral Envelope
T2 - IEICE TRANSACTIONS on Information
SP - 2673
EP - 2681
AU - Junya KOGUCHI
AU - Shinnosuke TAKAMICHI
AU - Masanori MORISE
AU - Hiroshi SARUWATARI
AU - Shigeki SAGAYAMA
PY - 2020
DO - 10.1587/transinf.2020EDP7075
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E103-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2020
AB - We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.
ER -