Improvements of Voice Timbre Control Based on Perceived Age in Singing Voice Conversion

Kazuhiro KOBAYASHI; Tomoki TODA; Tomoyasu NAKANO; Masataka GOTO; Satoshi NAKAMURA

doi:10.1587/transinf.2016EDP7234

IEICE TRANSACTIONS on Information

Improvements of Voice Timbre Control Based on Perceived Age in Singing Voice Conversion

Kazuhiro KOBAYASHI, Tomoki TODA, Tomoyasu NAKANO, Masataka GOTO, Satoshi NAKAMURA

Full Text Views

0

Cite this

Summary :

As one of the techniques enabling individual singers to produce the varieties of voice timbre beyond their own physical constraints, a statistical voice timbre control technique based on the perceived age has been developed. In this technique, the perceived age of a singing voice, which is the age of the singer as perceived by the listener, is used as one of the intuitively understandable measures to describe voice characteristics of the singing voice. The use of statistical voice conversion (SVC) with a singer-dependent multiple-regression Gaussian mixture model (MR-GMM), which effectively models the voice timbre variations caused by a change of the perceived age, makes it possible for individual singers to manipulate the perceived ages of their own singing voices while retaining their own singer identities. However, there still remain several issues; e.g., 1) a controllable range of the perceived age is limited; 2) quality of the converted singing voice is significantly degraded compared to that of a natural singing voice; and 3) each singer needs to sing the same phrase set as sung by a reference singer to develop the singer-dependent MR-GMM. To address these issues, we propose the following three methods; 1) a method using gender-dependent modeling to expand the controllable range of the perceived age; 2) a method using direct waveform modification based on spectrum differential to improve quality of the converted singing voice; and 3) a rapid unsupervised adaptation method based on maximum a posteriori (MAP) estimation to easily develop the singer-dependent MR-GMM. The experimental results show that the proposed methods achieve a wider controllable range of the perceived age, a significant quality improvement of the converted singing voice, and the development of the singer-dependnet MR-GMM using only a few arbitrary phrases as adaptation data.

Publication: IEICE TRANSACTIONS on Information Vol.E99-D No.11 pp.2767-2777

Publication Date: 2016/11/01

Publicized: 2016/07/21

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2016EDP7234

Type of Manuscript: PAPER

Category: Speech and Hearing

Authors

Kazuhiro KOBAYASHI
  Nara Institute of Science and Technology (NAIST)
Tomoki TODA
  Nagoya University
Tomoyasu NAKANO
  National Institute of Advanced Industrial Science and Technology (AIST)
Masataka GOTO
  National Institute of Advanced Industrial Science and Technology (AIST)
Satoshi NAKAMURA
  Nara Institute of Science and Technology (NAIST)

Keyword

statistical singing voice conversion, perceived age, gender-dependent modeling, direct waveform modification, unsupervised adaptation

Cite this

Copy

Kazuhiro KOBAYASHI, Tomoki TODA, Tomoyasu NAKANO, Masataka GOTO, Satoshi NAKAMURA, "Improvements of Voice Timbre Control Based on Perceived Age in Singing Voice Conversion" in IEICE TRANSACTIONS on Information, vol. E99-D, no. 11, pp. 2767-2777, November 2016, doi: 10.1587/transinf.2016EDP7234.
Abstract: As one of the techniques enabling individual singers to produce the varieties of voice timbre beyond their own physical constraints, a statistical voice timbre control technique based on the perceived age has been developed. In this technique, the perceived age of a singing voice, which is the age of the singer as perceived by the listener, is used as one of the intuitively understandable measures to describe voice characteristics of the singing voice. The use of statistical voice conversion (SVC) with a singer-dependent multiple-regression Gaussian mixture model (MR-GMM), which effectively models the voice timbre variations caused by a change of the perceived age, makes it possible for individual singers to manipulate the perceived ages of their own singing voices while retaining their own singer identities. However, there still remain several issues; e.g., 1) a controllable range of the perceived age is limited; 2) quality of the converted singing voice is significantly degraded compared to that of a natural singing voice; and 3) each singer needs to sing the same phrase set as sung by a reference singer to develop the singer-dependent MR-GMM. To address these issues, we propose the following three methods; 1) a method using gender-dependent modeling to expand the controllable range of the perceived age; 2) a method using direct waveform modification based on spectrum differential to improve quality of the converted singing voice; and 3) a rapid unsupervised adaptation method based on maximum a posteriori (MAP) estimation to easily develop the singer-dependent MR-GMM. The experimental results show that the proposed methods achieve a wider controllable range of the perceived age, a significant quality improvement of the converted singing voice, and the development of the singer-dependnet MR-GMM using only a few arbitrary phrases as adaptation data.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016EDP7234/_p

Copy

@ARTICLE{e99-d_11_2767,
author={Kazuhiro KOBAYASHI, Tomoki TODA, Tomoyasu NAKANO, Masataka GOTO, Satoshi NAKAMURA, },
journal={IEICE TRANSACTIONS on Information},
title={Improvements of Voice Timbre Control Based on Perceived Age in Singing Voice Conversion},
year={2016},
volume={E99-D},
number={11},
pages={2767-2777},
abstract={As one of the techniques enabling individual singers to produce the varieties of voice timbre beyond their own physical constraints, a statistical voice timbre control technique based on the perceived age has been developed. In this technique, the perceived age of a singing voice, which is the age of the singer as perceived by the listener, is used as one of the intuitively understandable measures to describe voice characteristics of the singing voice. The use of statistical voice conversion (SVC) with a singer-dependent multiple-regression Gaussian mixture model (MR-GMM), which effectively models the voice timbre variations caused by a change of the perceived age, makes it possible for individual singers to manipulate the perceived ages of their own singing voices while retaining their own singer identities. However, there still remain several issues; e.g., 1) a controllable range of the perceived age is limited; 2) quality of the converted singing voice is significantly degraded compared to that of a natural singing voice; and 3) each singer needs to sing the same phrase set as sung by a reference singer to develop the singer-dependent MR-GMM. To address these issues, we propose the following three methods; 1) a method using gender-dependent modeling to expand the controllable range of the perceived age; 2) a method using direct waveform modification based on spectrum differential to improve quality of the converted singing voice; and 3) a rapid unsupervised adaptation method based on maximum a posteriori (MAP) estimation to easily develop the singer-dependent MR-GMM. The experimental results show that the proposed methods achieve a wider controllable range of the perceived age, a significant quality improvement of the converted singing voice, and the development of the singer-dependnet MR-GMM using only a few arbitrary phrases as adaptation data.},
keywords={},
doi={10.1587/transinf.2016EDP7234},
ISSN={1745-1361},
month={November},}

Copy

TY - JOUR
TI - Improvements of Voice Timbre Control Based on Perceived Age in Singing Voice Conversion
T2 - IEICE TRANSACTIONS on Information
SP - 2767
EP - 2777
AU - Kazuhiro KOBAYASHI
AU - Tomoki TODA
AU - Tomoyasu NAKANO
AU - Masataka GOTO
AU - Satoshi NAKAMURA
PY - 2016
DO - 10.1587/transinf.2016EDP7234
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E99-D
IS - 11
JA - IEICE TRANSACTIONS on Information
Y1 - November 2016
AB - As one of the techniques enabling individual singers to produce the varieties of voice timbre beyond their own physical constraints, a statistical voice timbre control technique based on the perceived age has been developed. In this technique, the perceived age of a singing voice, which is the age of the singer as perceived by the listener, is used as one of the intuitively understandable measures to describe voice characteristics of the singing voice. The use of statistical voice conversion (SVC) with a singer-dependent multiple-regression Gaussian mixture model (MR-GMM), which effectively models the voice timbre variations caused by a change of the perceived age, makes it possible for individual singers to manipulate the perceived ages of their own singing voices while retaining their own singer identities. However, there still remain several issues; e.g., 1) a controllable range of the perceived age is limited; 2) quality of the converted singing voice is significantly degraded compared to that of a natural singing voice; and 3) each singer needs to sing the same phrase set as sung by a reference singer to develop the singer-dependent MR-GMM. To address these issues, we propose the following three methods; 1) a method using gender-dependent modeling to expand the controllable range of the perceived age; 2) a method using direct waveform modification based on spectrum differential to improve quality of the converted singing voice; and 3) a rapid unsupervised adaptation method based on maximum a posteriori (MAP) estimation to easily develop the singer-dependent MR-GMM. The experimental results show that the proposed methods achieve a wider controllable range of the perceived age, a significant quality improvement of the converted singing voice, and the development of the singer-dependnet MR-GMM using only a few arbitrary phrases as adaptation data.
ER -

IEICE TRANSACTIONS on Information