The search functionality is under construction.
The search functionality is under construction.

Speech Chain VC: Linking Linguistic and Acoustic Levels via Latent Distinctive Features for RBM-Based Voice Conversion

Takuya KISHIDA, Toru NAKASHIKA

  • Full Text Views

    0

  • Cite this

Summary :

This paper proposes a voice conversion (VC) method based on a model that links linguistic and acoustic representations via latent phonological distinctive features. Our method, called speech chain VC, is inspired by the concept of the speech chain, where speech communication consists of a chain of events linking the speaker's brain with the listener's brain. We assume that speaker identity information, which appears in the acoustic level, is embedded in two steps — where phonological information is encoded into articulatory movements (linguistic to physiological) and where articulatory movements generate sound waves (physiological to acoustic). Speech chain VC represents these event links by using an adaptive restricted Boltzmann machine (ARBM) introducing phoneme labels and acoustic features as two classes of visible units and latent phonological distinctive features associated with articulatory movements as hidden units. Subjective evaluation experiments showed that intelligibility of the converted speech significantly improved compared with the conventional ARBM-based method. The speaker-identity conversion quality of the proposed method was comparable to that of a Gaussian mixture model (GMM)-based method. Analyses on the representations of the hidden layer of the speech chain VC model supported that some of the hidden units actually correspond to phonological distinctive features. Final part of this paper proposes approaches to achieve one-shot VC by using the speech chain VC model. Subjective evaluation experiments showed that when a target speaker is the same gender as a source speaker, the proposed methods can achieve one-shot VC based on each single source and target speaker's utterance.

Publication
IEICE TRANSACTIONS on Information Vol.E103-D No.11 pp.2340-2350
Publication Date
2020/11/01
Publicized
2020/08/06
Online ISSN
1745-1361
DOI
10.1587/transinf.2020EDP7032
Type of Manuscript
PAPER
Category
Speech and Hearing

Authors

Takuya KISHIDA
  The University of Electro-Communications
Toru NAKASHIKA
  The University of Electro-Communications

Keyword