The search functionality is under construction.
The search functionality is under construction.

INmfCA Algorithm for Training of Nonparallel Voice Conversion Systems Based on Non-Negative Matrix Factorization

Hitoshi SUDA, Gaku KOTANI, Daisuke SAITO

  • Full Text Views

    0

  • Cite this

Summary :

In this paper, we propose a new training framework named the INmfCA algorithm for nonparallel voice conversion (VC) systems. To train conversion models, traditional VC frameworks require parallel corpora, in which source and target speakers utter the same linguistic contents. Although the frameworks have achieved high-quality VC, they are not applicable in situations where parallel corpora are unavailable. To acquire conversion models without parallel corpora, nonparallel methods are widely studied. Although the frameworks achieve VC under nonparallel conditions, they tend to require huge background knowledge or many training utterances. This is because of difficulty in disentangling linguistic and speaker information without a large amount of data. In this work, we tackle this problem by exploiting NMF, which can factorize acoustic features into time-variant and time-invariant components in an unsupervised manner. The method acquires alignment between the acoustic features of a source speaker's utterances and a target dictionary and uses the obtained alignment as activation of NMF to train the source speaker's dictionary without parallel corpora. The acquisition method is based on the INCA algorithm, which obtains the alignment of nonparallel corpora. In contrast to the INCA algorithm, the alignment is not restricted to observed samples, and thus the proposed method can efficiently utilize small nonparallel corpora. The results of subjective experiments show that the combination of the proposed algorithm and the INCA algorithm outperformed not only an INCA-based nonparallel framework but also CycleGAN-VC, which performs nonparallel VC without any additional training data. The results also indicate that a one-shot VC framework, which does not need to train source speakers, can be constructed on the basis of the proposed method.

Publication
IEICE TRANSACTIONS on Information Vol.E105-D No.6 pp.1196-1210
Publication Date
2022/06/01
Publicized
2022/03/03
Online ISSN
1745-1361
DOI
10.1587/transinf.2021EDP7234
Type of Manuscript
PAPER
Category
Speech and Hearing

Authors

Hitoshi SUDA
  the University of Tokyo
Gaku KOTANI
  the University of Tokyo
Daisuke SAITO
  the University of Tokyo

Keyword