This paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output speech parameters, and DNN-based acoustic models for VC are used to estimate the output speech parameters from the input speech parameters. Given that the input and output are often in the same domain (e.g., cepstrum) in VC, this paper proposes a VC using highway networks connected from the input to output. The acoustic models predict the weighted spectral differentials between the input and output spectral parameters. The architecture not only alleviates over-smoothing effects that degrade speech quality, but also effectively represents the characteristics of spectral parameters. The experimental results demonstrate that the proposed architecture outperforms Feed-Forward neural networks in terms of the speech quality and speaker individuality of the converted speech.
Yuki SAITO
The University of Tokyo
Shinnosuke TAKAMICHI
The University of Tokyo
Hiroshi SARUWATARI
The University of Tokyo
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Yuki SAITO, Shinnosuke TAKAMICHI, Hiroshi SARUWATARI, "Voice Conversion Using Input-to-Output Highway Networks" in IEICE TRANSACTIONS on Information,
vol. E100-D, no. 8, pp. 1925-1928, August 2017, doi: 10.1587/transinf.2017EDL8034.
Abstract: This paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output speech parameters, and DNN-based acoustic models for VC are used to estimate the output speech parameters from the input speech parameters. Given that the input and output are often in the same domain (e.g., cepstrum) in VC, this paper proposes a VC using highway networks connected from the input to output. The acoustic models predict the weighted spectral differentials between the input and output spectral parameters. The architecture not only alleviates over-smoothing effects that degrade speech quality, but also effectively represents the characteristics of spectral parameters. The experimental results demonstrate that the proposed architecture outperforms Feed-Forward neural networks in terms of the speech quality and speaker individuality of the converted speech.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2017EDL8034/_p
Copy
@ARTICLE{e100-d_8_1925,
author={Yuki SAITO, Shinnosuke TAKAMICHI, Hiroshi SARUWATARI, },
journal={IEICE TRANSACTIONS on Information},
title={Voice Conversion Using Input-to-Output Highway Networks},
year={2017},
volume={E100-D},
number={8},
pages={1925-1928},
abstract={This paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output speech parameters, and DNN-based acoustic models for VC are used to estimate the output speech parameters from the input speech parameters. Given that the input and output are often in the same domain (e.g., cepstrum) in VC, this paper proposes a VC using highway networks connected from the input to output. The acoustic models predict the weighted spectral differentials between the input and output spectral parameters. The architecture not only alleviates over-smoothing effects that degrade speech quality, but also effectively represents the characteristics of spectral parameters. The experimental results demonstrate that the proposed architecture outperforms Feed-Forward neural networks in terms of the speech quality and speaker individuality of the converted speech.},
keywords={},
doi={10.1587/transinf.2017EDL8034},
ISSN={1745-1361},
month={August},}
Copy
TY - JOUR
TI - Voice Conversion Using Input-to-Output Highway Networks
T2 - IEICE TRANSACTIONS on Information
SP - 1925
EP - 1928
AU - Yuki SAITO
AU - Shinnosuke TAKAMICHI
AU - Hiroshi SARUWATARI
PY - 2017
DO - 10.1587/transinf.2017EDL8034
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E100-D
IS - 8
JA - IEICE TRANSACTIONS on Information
Y1 - August 2017
AB - This paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output speech parameters, and DNN-based acoustic models for VC are used to estimate the output speech parameters from the input speech parameters. Given that the input and output are often in the same domain (e.g., cepstrum) in VC, this paper proposes a VC using highway networks connected from the input to output. The acoustic models predict the weighted spectral differentials between the input and output spectral parameters. The architecture not only alleviates over-smoothing effects that degrade speech quality, but also effectively represents the characteristics of spectral parameters. The experimental results demonstrate that the proposed architecture outperforms Feed-Forward neural networks in terms of the speech quality and speaker individuality of the converted speech.
ER -