DNN-Based Speech Synthesis Using Speaker Codes

Nobukatsu HOJO; Yusuke IJIMA; Hideyuki MIZUNO

doi:10.1587/transinf.2017EDP7165

IEICE TRANSACTIONS on Information

DNN-Based Speech Synthesis Using Speaker Codes

Nobukatsu HOJO, Yusuke IJIMA, Hideyuki MIZUNO

Full Text Views

0

Cite this

Summary :

Deep neural network (DNN)-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, it is not revealed whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a method to improve the performance of the conventional speaker dependent DNN-based method. In order to model speaker variation in the DNN, the augmented feature (speaker codes) is fed to the hidden layer(s) of the conventional DNN. This paper investigates the effectiveness of introducing speaker codes to DNN acoustic models for speech synthesis for two tasks: multi-speaker modeling and speaker adaptation. For the multi-speaker modeling task, the method we propose trains connection weights of the whole DNN using a multi-speaker speech corpus. When performing multi-speaker synthesis, the speaker code corresponding to the selected target speaker is fed to the DNN to generate the speaker's voice. When performing speaker adaptation, a set of connection weights of the multi-speaker model is re-estimated to generate a new target speaker's voice. We investigated the relationship between the prediction performance and architecture of the DNNs through objective measurements. Objective evaluation experiments revealed that the proposed model outperformed conventional methods (HMMs, speaker dependent DNNs and multi-speaker DNNs based on a shared hidden layer structure). Subjective evaluation experimental results showed that the proposed model again outperformed the conventional methods (HMMs, speaker dependent DNNs), especially when using a small number of target speaker utterances.

Publication: IEICE TRANSACTIONS on Information Vol.E101-D No.2 pp.462-472

Publication Date: 2018/02/01

Publicized: 2017/11/01

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2017EDP7165

Type of Manuscript: PAPER

Category: Speech and Hearing

Authors

Nobukatsu HOJO
  NTT Corporation
Yusuke IJIMA
  NTT Corporation
Hideyuki MIZUNO
  Tokyo University of Science

Keyword

speech synthesis, acoustic model, deep neural network, speaker codes

Cite this

Copy

Nobukatsu HOJO, Yusuke IJIMA, Hideyuki MIZUNO, "DNN-Based Speech Synthesis Using Speaker Codes" in IEICE TRANSACTIONS on Information, vol. E101-D, no. 2, pp. 462-472, February 2018, doi: 10.1587/transinf.2017EDP7165.
Abstract: Deep neural network (DNN)-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, it is not revealed whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a method to improve the performance of the conventional speaker dependent DNN-based method. In order to model speaker variation in the DNN, the augmented feature (speaker codes) is fed to the hidden layer(s) of the conventional DNN. This paper investigates the effectiveness of introducing speaker codes to DNN acoustic models for speech synthesis for two tasks: multi-speaker modeling and speaker adaptation. For the multi-speaker modeling task, the method we propose trains connection weights of the whole DNN using a multi-speaker speech corpus. When performing multi-speaker synthesis, the speaker code corresponding to the selected target speaker is fed to the DNN to generate the speaker's voice. When performing speaker adaptation, a set of connection weights of the multi-speaker model is re-estimated to generate a new target speaker's voice. We investigated the relationship between the prediction performance and architecture of the DNNs through objective measurements. Objective evaluation experiments revealed that the proposed model outperformed conventional methods (HMMs, speaker dependent DNNs and multi-speaker DNNs based on a shared hidden layer structure). Subjective evaluation experimental results showed that the proposed model again outperformed the conventional methods (HMMs, speaker dependent DNNs), especially when using a small number of target speaker utterances.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2017EDP7165/_p

Copy

@ARTICLE{e101-d_2_462,
author={Nobukatsu HOJO, Yusuke IJIMA, Hideyuki MIZUNO, },
journal={IEICE TRANSACTIONS on Information},
title={DNN-Based Speech Synthesis Using Speaker Codes},
year={2018},
volume={E101-D},
number={2},
pages={462-472},
abstract={Deep neural network (DNN)-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, it is not revealed whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a method to improve the performance of the conventional speaker dependent DNN-based method. In order to model speaker variation in the DNN, the augmented feature (speaker codes) is fed to the hidden layer(s) of the conventional DNN. This paper investigates the effectiveness of introducing speaker codes to DNN acoustic models for speech synthesis for two tasks: multi-speaker modeling and speaker adaptation. For the multi-speaker modeling task, the method we propose trains connection weights of the whole DNN using a multi-speaker speech corpus. When performing multi-speaker synthesis, the speaker code corresponding to the selected target speaker is fed to the DNN to generate the speaker's voice. When performing speaker adaptation, a set of connection weights of the multi-speaker model is re-estimated to generate a new target speaker's voice. We investigated the relationship between the prediction performance and architecture of the DNNs through objective measurements. Objective evaluation experiments revealed that the proposed model outperformed conventional methods (HMMs, speaker dependent DNNs and multi-speaker DNNs based on a shared hidden layer structure). Subjective evaluation experimental results showed that the proposed model again outperformed the conventional methods (HMMs, speaker dependent DNNs), especially when using a small number of target speaker utterances.},
keywords={},
doi={10.1587/transinf.2017EDP7165},
ISSN={1745-1361},
month={February},}

Copy

TY - JOUR
TI - DNN-Based Speech Synthesis Using Speaker Codes
T2 - IEICE TRANSACTIONS on Information
SP - 462
EP - 472
AU - Nobukatsu HOJO
AU - Yusuke IJIMA
AU - Hideyuki MIZUNO
PY - 2018
DO - 10.1587/transinf.2017EDP7165
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E101-D
IS - 2
JA - IEICE TRANSACTIONS on Information
Y1 - February 2018
AB - Deep neural network (DNN)-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, it is not revealed whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a method to improve the performance of the conventional speaker dependent DNN-based method. In order to model speaker variation in the DNN, the augmented feature (speaker codes) is fed to the hidden layer(s) of the conventional DNN. This paper investigates the effectiveness of introducing speaker codes to DNN acoustic models for speech synthesis for two tasks: multi-speaker modeling and speaker adaptation. For the multi-speaker modeling task, the method we propose trains connection weights of the whole DNN using a multi-speaker speech corpus. When performing multi-speaker synthesis, the speaker code corresponding to the selected target speaker is fed to the DNN to generate the speaker's voice. When performing speaker adaptation, a set of connection weights of the multi-speaker model is re-estimated to generate a new target speaker's voice. We investigated the relationship between the prediction performance and architecture of the DNNs through objective measurements. Objective evaluation experiments revealed that the proposed model outperformed conventional methods (HMMs, speaker dependent DNNs and multi-speaker DNNs based on a shared hidden layer structure). Subjective evaluation experimental results showed that the proposed model again outperformed the conventional methods (HMMs, speaker dependent DNNs), especially when using a small number of target speaker utterances.
ER -

IEICE TRANSACTIONS on Information