A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models

Shinnosuke TAKAMICHI; Tomoki TODA; Graham NEUBIG; Sakriani SAKTI; Satoshi NAKAMURA

doi:10.1587/transinf.2016SLP0020

IEICE TRANSACTIONS on Information

A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models

Shinnosuke TAKAMICHI, Tomoki TODA, Graham NEUBIG, Sakriani SAKTI, Satoshi NAKAMURA

Full Text Views

0

Cite this

Summary :

This paper presents a novel statistical sample-based approach for Gaussian Mixture Model (GMM)-based Voice Conversion (VC). Although GMM-based VC has the promising flexibility of model adaptation, quality in converted speech is significantly worse than that of natural speech. This paper addresses the problem of inaccurate modeling, which is one of the main reasons causing the quality degradation. Recently, we have proposed statistical sample-based speech synthesis using rich context models for high-quality and flexible Hidden Markov Model (HMM)-based Text-To-Speech (TTS) synthesis. This method makes it possible not only to produce high-quality speech by introducing ideas from unit selection synthesis, but also to preserve flexibility of the original HMM-based TTS. In this paper, we apply this idea to GMM-based VC. The rich context models are first trained for individual joint speech feature vectors, and then we gather them mixture by mixture to form a Rich context-GMM (R-GMM). In conversion, an iterative generation algorithm using R-GMMs is used to convert speech parameters, after initialization using over-trained probability distributions. Because the proposed method utilizes individual speech features, and its formulation is the same as that of conventional GMM-based VC, it makes it possible to produce high-quality speech while keeping flexibility of the original GMM-based VC. The experimental results demonstrate that the proposed method yields significant improvements in term of speech quality and speaker individuality in converted speech.

Publication: IEICE TRANSACTIONS on Information Vol.E99-D No.10 pp.2490-2498

Publication Date: 2016/10/01

Publicized: 2016/07/19

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2016SLP0020

Type of Manuscript: Special Section PAPER (Special Section on Recent Advances in Machine Learning for Spoken Language Processing)

Category: Voice conversion

Authors

Shinnosuke TAKAMICHI
  Nara Institute of Science and Technology
Tomoki TODA
  Nagoya University
Graham NEUBIG
  Nara Institute of Science and Technology
Sakriani SAKTI
  Nara Institute of Science and Technology
Satoshi NAKAMURA
  Nara Institute of Science and Technology

Keyword

GMM-based voice conversion, sample-based speech synthesis, speech parameter conversion, rich context model

Cite this

Copy

Shinnosuke TAKAMICHI, Tomoki TODA, Graham NEUBIG, Sakriani SAKTI, Satoshi NAKAMURA, "A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models" in IEICE TRANSACTIONS on Information, vol. E99-D, no. 10, pp. 2490-2498, October 2016, doi: 10.1587/transinf.2016SLP0020.
Abstract: This paper presents a novel statistical sample-based approach for Gaussian Mixture Model (GMM)-based Voice Conversion (VC). Although GMM-based VC has the promising flexibility of model adaptation, quality in converted speech is significantly worse than that of natural speech. This paper addresses the problem of inaccurate modeling, which is one of the main reasons causing the quality degradation. Recently, we have proposed statistical sample-based speech synthesis using rich context models for high-quality and flexible Hidden Markov Model (HMM)-based Text-To-Speech (TTS) synthesis. This method makes it possible not only to produce high-quality speech by introducing ideas from unit selection synthesis, but also to preserve flexibility of the original HMM-based TTS. In this paper, we apply this idea to GMM-based VC. The rich context models are first trained for individual joint speech feature vectors, and then we gather them mixture by mixture to form a Rich context-GMM (R-GMM). In conversion, an iterative generation algorithm using R-GMMs is used to convert speech parameters, after initialization using over-trained probability distributions. Because the proposed method utilizes individual speech features, and its formulation is the same as that of conventional GMM-based VC, it makes it possible to produce high-quality speech while keeping flexibility of the original GMM-based VC. The experimental results demonstrate that the proposed method yields significant improvements in term of speech quality and speaker individuality in converted speech.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016SLP0020/_p

Copy

@ARTICLE{e99-d_10_2490,
author={Shinnosuke TAKAMICHI, Tomoki TODA, Graham NEUBIG, Sakriani SAKTI, Satoshi NAKAMURA, },
journal={IEICE TRANSACTIONS on Information},
title={A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models},
year={2016},
volume={E99-D},
number={10},
pages={2490-2498},
abstract={This paper presents a novel statistical sample-based approach for Gaussian Mixture Model (GMM)-based Voice Conversion (VC). Although GMM-based VC has the promising flexibility of model adaptation, quality in converted speech is significantly worse than that of natural speech. This paper addresses the problem of inaccurate modeling, which is one of the main reasons causing the quality degradation. Recently, we have proposed statistical sample-based speech synthesis using rich context models for high-quality and flexible Hidden Markov Model (HMM)-based Text-To-Speech (TTS) synthesis. This method makes it possible not only to produce high-quality speech by introducing ideas from unit selection synthesis, but also to preserve flexibility of the original HMM-based TTS. In this paper, we apply this idea to GMM-based VC. The rich context models are first trained for individual joint speech feature vectors, and then we gather them mixture by mixture to form a Rich context-GMM (R-GMM). In conversion, an iterative generation algorithm using R-GMMs is used to convert speech parameters, after initialization using over-trained probability distributions. Because the proposed method utilizes individual speech features, and its formulation is the same as that of conventional GMM-based VC, it makes it possible to produce high-quality speech while keeping flexibility of the original GMM-based VC. The experimental results demonstrate that the proposed method yields significant improvements in term of speech quality and speaker individuality in converted speech.},
keywords={},
doi={10.1587/transinf.2016SLP0020},
ISSN={1745-1361},
month={October},}

Copy

TY - JOUR
TI - A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models
T2 - IEICE TRANSACTIONS on Information
SP - 2490
EP - 2498
AU - Shinnosuke TAKAMICHI
AU - Tomoki TODA
AU - Graham NEUBIG
AU - Sakriani SAKTI
AU - Satoshi NAKAMURA
PY - 2016
DO - 10.1587/transinf.2016SLP0020
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E99-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2016
AB - This paper presents a novel statistical sample-based approach for Gaussian Mixture Model (GMM)-based Voice Conversion (VC). Although GMM-based VC has the promising flexibility of model adaptation, quality in converted speech is significantly worse than that of natural speech. This paper addresses the problem of inaccurate modeling, which is one of the main reasons causing the quality degradation. Recently, we have proposed statistical sample-based speech synthesis using rich context models for high-quality and flexible Hidden Markov Model (HMM)-based Text-To-Speech (TTS) synthesis. This method makes it possible not only to produce high-quality speech by introducing ideas from unit selection synthesis, but also to preserve flexibility of the original HMM-based TTS. In this paper, we apply this idea to GMM-based VC. The rich context models are first trained for individual joint speech feature vectors, and then we gather them mixture by mixture to form a Rich context-GMM (R-GMM). In conversion, an iterative generation algorithm using R-GMMs is used to convert speech parameters, after initialization using over-trained probability distributions. Because the proposed method utilizes individual speech features, and its formulation is the same as that of conventional GMM-based VC, it makes it possible to produce high-quality speech while keeping flexibility of the original GMM-based VC. The experimental results demonstrate that the proposed method yields significant improvements in term of speech quality and speaker individuality in converted speech.
ER -

IEICE TRANSACTIONS on Information