Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training

Junichi YAMAGISHI; Takao KOBAYASHI

doi:10.1093/ietisy/e90-d.2.533

Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training

Junichi YAMAGISHI, Takao KOBAYASHI

Full Text Views

0

Cite this

Summary :

In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.

Publication: IEICE TRANSACTIONS on Information Vol.E90-D No.2 pp.533-543

Publication Date: 2007/02/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1093/ietisy/e90-d.2.533

Type of Manuscript: PAPER

Category: Speech and Hearing

Cite this

Copy

Junichi YAMAGISHI, Takao KOBAYASHI, "Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training" in IEICE TRANSACTIONS on Information, vol. E90-D, no. 2, pp. 533-543, February 2007, doi: 10.1093/ietisy/e90-d.2.533.
Abstract: In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.
URL: https://global.ieice.org/en_transactions/information/10.1093/ietisy/e90-d.2.533/_p

Copy

@ARTICLE{e90-d_2_533,
author={Junichi YAMAGISHI, Takao KOBAYASHI, },
journal={IEICE TRANSACTIONS on Information},
title={Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training},
year={2007},
volume={E90-D},
number={2},
pages={533-543},
abstract={In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.},
keywords={},
doi={10.1093/ietisy/e90-d.2.533},
ISSN={1745-1361},
month={February},}

Copy

TY - JOUR
TI - Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training
T2 - IEICE TRANSACTIONS on Information
SP - 533
EP - 543
AU - Junichi YAMAGISHI
AU - Takao KOBAYASHI
PY - 2007
DO - 10.1093/ietisy/e90-d.2.533
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E90-D
IS - 2
JA - IEICE TRANSACTIONS on Information
Y1 - February 2007
AB - In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.
ER -