Generative Moment Matching Network-Based Neural Double-Tracking for Synthesized and Natural Singing Voices

Hiroki TAMARU; Yuki SAITO; Shinnosuke TAKAMICHI; Tomoki KORIYAMA; Hiroshi SARUWATARI

doi:10.1587/transinf.2019EDP7228

Generative Moment Matching Network-Based Neural Double-Tracking for Synthesized and Natural Singing Voices

Hiroki TAMARU, Yuki SAITO, Shinnosuke TAKAMICHI, Tomoki KORIYAMA, Hiroshi SARUWATARI

Full Text Views

0

Cite this

Summary :

This paper proposes a generative moment matching network (GMMN)-based post-filtering method for providing inter-utterance pitch variation to singing voices and discusses its application to our developed mixing method called neural double-tracking (NDT). When a human singer sings and records the same song twice, there is a difference between the two recordings. The difference, which is called inter-utterance variation, enriches the performer's musical expression and the audience's experience. For example, it makes every concert special because it never recurs in exactly the same manner. Inter-utterance variation enables a mixing method called double-tracking (DT). With DT, the same phrase is recorded twice, then the two recordings are mixed to give richness to singing voices. However, in synthesized singing voices, which are commonly used to create music, there is no inter-utterance variation because the synthesis process is deterministic. There is also no inter-utterance variation when only one voice is recorded. Although there is a signal processing-based method called artificial DT (ADT) to layer singing voices, the signal processing results in unnatural sound artifacts. To solve these problems, we propose a post-filtering method for randomly modulating synthesized or natural singing voices as if the singer sang again. The post-filter built with our method models the inter-utterance pitch variation of human singing voices using a conditional GMMN. Evaluation results indicate that 1) the proposed method provides perceptible and natural inter-utterance variation to synthesized singing voices and that 2) our NDT exhibits higher double-trackedness than ADT when applied to both synthesized and natural singing voices.

Publication: IEICE TRANSACTIONS on Information Vol.E103-D No.3 pp.639-647

Publication Date: 2020/03/01

Publicized: 2019/12/23

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2019EDP7228

Type of Manuscript: PAPER

Category: Speech and Hearing

Authors

Hiroki TAMARU
  The University of Tokyo
Yuki SAITO
  The University of Tokyo
Shinnosuke TAKAMICHI
  The University of Tokyo
Tomoki KORIYAMA
  The University of Tokyo
Hiroshi SARUWATARI
  The University of Tokyo

Keyword

DNN-based singing-voice synthesis, generative moment matching network, inter-utterance pitch variation, artificial double-tracking, modulation spectrum

Cite this

Copy

Hiroki TAMARU, Yuki SAITO, Shinnosuke TAKAMICHI, Tomoki KORIYAMA, Hiroshi SARUWATARI, "Generative Moment Matching Network-Based Neural Double-Tracking for Synthesized and Natural Singing Voices" in IEICE TRANSACTIONS on Information, vol. E103-D, no. 3, pp. 639-647, March 2020, doi: 10.1587/transinf.2019EDP7228.
Abstract: This paper proposes a generative moment matching network (GMMN)-based post-filtering method for providing inter-utterance pitch variation to singing voices and discusses its application to our developed mixing method called neural double-tracking (NDT). When a human singer sings and records the same song twice, there is a difference between the two recordings. The difference, which is called inter-utterance variation, enriches the performer's musical expression and the audience's experience. For example, it makes every concert special because it never recurs in exactly the same manner. Inter-utterance variation enables a mixing method called double-tracking (DT). With DT, the same phrase is recorded twice, then the two recordings are mixed to give richness to singing voices. However, in synthesized singing voices, which are commonly used to create music, there is no inter-utterance variation because the synthesis process is deterministic. There is also no inter-utterance variation when only one voice is recorded. Although there is a signal processing-based method called artificial DT (ADT) to layer singing voices, the signal processing results in unnatural sound artifacts. To solve these problems, we propose a post-filtering method for randomly modulating synthesized or natural singing voices as if the singer sang again. The post-filter built with our method models the inter-utterance pitch variation of human singing voices using a conditional GMMN. Evaluation results indicate that 1) the proposed method provides perceptible and natural inter-utterance variation to synthesized singing voices and that 2) our NDT exhibits higher double-trackedness than ADT when applied to both synthesized and natural singing voices.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2019EDP7228/_p

Copy

@ARTICLE{e103-d_3_639,
author={Hiroki TAMARU, Yuki SAITO, Shinnosuke TAKAMICHI, Tomoki KORIYAMA, Hiroshi SARUWATARI, },
journal={IEICE TRANSACTIONS on Information},
title={Generative Moment Matching Network-Based Neural Double-Tracking for Synthesized and Natural Singing Voices},
year={2020},
volume={E103-D},
number={3},
pages={639-647},
abstract={This paper proposes a generative moment matching network (GMMN)-based post-filtering method for providing inter-utterance pitch variation to singing voices and discusses its application to our developed mixing method called neural double-tracking (NDT). When a human singer sings and records the same song twice, there is a difference between the two recordings. The difference, which is called inter-utterance variation, enriches the performer's musical expression and the audience's experience. For example, it makes every concert special because it never recurs in exactly the same manner. Inter-utterance variation enables a mixing method called double-tracking (DT). With DT, the same phrase is recorded twice, then the two recordings are mixed to give richness to singing voices. However, in synthesized singing voices, which are commonly used to create music, there is no inter-utterance variation because the synthesis process is deterministic. There is also no inter-utterance variation when only one voice is recorded. Although there is a signal processing-based method called artificial DT (ADT) to layer singing voices, the signal processing results in unnatural sound artifacts. To solve these problems, we propose a post-filtering method for randomly modulating synthesized or natural singing voices as if the singer sang again. The post-filter built with our method models the inter-utterance pitch variation of human singing voices using a conditional GMMN. Evaluation results indicate that 1) the proposed method provides perceptible and natural inter-utterance variation to synthesized singing voices and that 2) our NDT exhibits higher double-trackedness than ADT when applied to both synthesized and natural singing voices.},
keywords={},
doi={10.1587/transinf.2019EDP7228},
ISSN={1745-1361},
month={March},}

Copy

TY - JOUR
TI - Generative Moment Matching Network-Based Neural Double-Tracking for Synthesized and Natural Singing Voices
T2 - IEICE TRANSACTIONS on Information
SP - 639
EP - 647
AU - Hiroki TAMARU
AU - Yuki SAITO
AU - Shinnosuke TAKAMICHI
AU - Tomoki KORIYAMA
AU - Hiroshi SARUWATARI
PY - 2020
DO - 10.1587/transinf.2019EDP7228
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E103-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2020
AB - This paper proposes a generative moment matching network (GMMN)-based post-filtering method for providing inter-utterance pitch variation to singing voices and discusses its application to our developed mixing method called neural double-tracking (NDT). When a human singer sings and records the same song twice, there is a difference between the two recordings. The difference, which is called inter-utterance variation, enriches the performer's musical expression and the audience's experience. For example, it makes every concert special because it never recurs in exactly the same manner. Inter-utterance variation enables a mixing method called double-tracking (DT). With DT, the same phrase is recorded twice, then the two recordings are mixed to give richness to singing voices. However, in synthesized singing voices, which are commonly used to create music, there is no inter-utterance variation because the synthesis process is deterministic. There is also no inter-utterance variation when only one voice is recorded. Although there is a signal processing-based method called artificial DT (ADT) to layer singing voices, the signal processing results in unnatural sound artifacts. To solve these problems, we propose a post-filtering method for randomly modulating synthesized or natural singing voices as if the singer sang again. The post-filter built with our method models the inter-utterance pitch variation of human singing voices using a conditional GMMN. Evaluation results indicate that 1) the proposed method provides perceptible and natural inter-utterance variation to synthesized singing voices and that 2) our NDT exhibits higher double-trackedness than ADT when applied to both synthesized and natural singing voices.
ER -