Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics

Yuji OSHIMA; Shinnosuke TAKAMICHI; Tomoki TODA; Graham NEUBIG; Sakriani SAKTI; Satoshi NAKAMURA

doi:10.1587/transinf.2016EDP7231

IEICE TRANSACTIONS on Information

Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics

Yuji OSHIMA, Shinnosuke TAKAMICHI, Tomoki TODA, Graham NEUBIG, Sakriani SAKTI, Satoshi NAKAMURA

Full Text Views

0

Cite this

Summary :

This paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Cross-lingual speech synthesis based on voice conversion or Hidden Markov Model (HMM)-based speech synthesis is a technique to synthesize foreign language speech using a target speaker's natural speech uttered in his/her mother tongue. Although the technique holds promise to improve a wide variety of applications, it tends to cause degradation of target speaker's individuality in synthetic speech compared to intra-lingual speech synthesis. This paper proposes a new approach to speech synthesis that preserves speaker individuality by using non-native speech spoken by the target speaker. Although the use of non-native speech makes it possible to preserve the speaker individuality in the synthesized target speech, naturalness is significantly degraded as the synthesized speech waveform is directly affected by unnatural prosody and pronunciation often caused by differences in the linguistic systems of the source and target languages. To improve naturalness while preserving speaker individuality, we propose (1) a prosody correction method based on model adaptation, and (2) a phonetic correction method based on spectrum replacement for unvoiced consonants. The experimental results using English speech uttered by native Japanese speakers demonstrate that (1) the proposed methods are capable of significantly improving naturalness while preserving the speaker individuality in synthetic speech, and (2) the proposed methods also improve intelligibility as confirmed by a dictation test.

Publication: IEICE TRANSACTIONS on Information Vol.E99-D No.12 pp.3132-3139

Publication Date: 2016/12/01

Publicized: 2016/08/30

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2016EDP7231

Type of Manuscript: PAPER

Category: Speech and Hearing

Authors

Yuji OSHIMA
  Nara Institute of Science and Technology
Shinnosuke TAKAMICHI
  The University of Tokyo
Tomoki TODA
  Nara Institute of Science and Technology,Nagoya University
Graham NEUBIG
  Nara Institute of Science and Technology
Sakriani SAKTI
  Nara Institute of Science and Technology
Satoshi NAKAMURA
  Nara Institute of Science and Technology

Keyword

cross-lingual speech synthesis, English-Read-by-Japanese, speaker individuality, HMM-based speech synthesis, prosody correction, phonetic correction

Cite this

Copy

Yuji OSHIMA, Shinnosuke TAKAMICHI, Tomoki TODA, Graham NEUBIG, Sakriani SAKTI, Satoshi NAKAMURA, "Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics" in IEICE TRANSACTIONS on Information, vol. E99-D, no. 12, pp. 3132-3139, December 2016, doi: 10.1587/transinf.2016EDP7231.
Abstract: This paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Cross-lingual speech synthesis based on voice conversion or Hidden Markov Model (HMM)-based speech synthesis is a technique to synthesize foreign language speech using a target speaker's natural speech uttered in his/her mother tongue. Although the technique holds promise to improve a wide variety of applications, it tends to cause degradation of target speaker's individuality in synthetic speech compared to intra-lingual speech synthesis. This paper proposes a new approach to speech synthesis that preserves speaker individuality by using non-native speech spoken by the target speaker. Although the use of non-native speech makes it possible to preserve the speaker individuality in the synthesized target speech, naturalness is significantly degraded as the synthesized speech waveform is directly affected by unnatural prosody and pronunciation often caused by differences in the linguistic systems of the source and target languages. To improve naturalness while preserving speaker individuality, we propose (1) a prosody correction method based on model adaptation, and (2) a phonetic correction method based on spectrum replacement for unvoiced consonants. The experimental results using English speech uttered by native Japanese speakers demonstrate that (1) the proposed methods are capable of significantly improving naturalness while preserving the speaker individuality in synthetic speech, and (2) the proposed methods also improve intelligibility as confirmed by a dictation test.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016EDP7231/_p

Copy

@ARTICLE{e99-d_12_3132,
author={Yuji OSHIMA, Shinnosuke TAKAMICHI, Tomoki TODA, Graham NEUBIG, Sakriani SAKTI, Satoshi NAKAMURA, },
journal={IEICE TRANSACTIONS on Information},
title={Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics},
year={2016},
volume={E99-D},
number={12},
pages={3132-3139},
abstract={This paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Cross-lingual speech synthesis based on voice conversion or Hidden Markov Model (HMM)-based speech synthesis is a technique to synthesize foreign language speech using a target speaker's natural speech uttered in his/her mother tongue. Although the technique holds promise to improve a wide variety of applications, it tends to cause degradation of target speaker's individuality in synthetic speech compared to intra-lingual speech synthesis. This paper proposes a new approach to speech synthesis that preserves speaker individuality by using non-native speech spoken by the target speaker. Although the use of non-native speech makes it possible to preserve the speaker individuality in the synthesized target speech, naturalness is significantly degraded as the synthesized speech waveform is directly affected by unnatural prosody and pronunciation often caused by differences in the linguistic systems of the source and target languages. To improve naturalness while preserving speaker individuality, we propose (1) a prosody correction method based on model adaptation, and (2) a phonetic correction method based on spectrum replacement for unvoiced consonants. The experimental results using English speech uttered by native Japanese speakers demonstrate that (1) the proposed methods are capable of significantly improving naturalness while preserving the speaker individuality in synthetic speech, and (2) the proposed methods also improve intelligibility as confirmed by a dictation test.},
keywords={},
doi={10.1587/transinf.2016EDP7231},
ISSN={1745-1361},
month={December},}

Copy

TY - JOUR
TI - Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics
T2 - IEICE TRANSACTIONS on Information
SP - 3132
EP - 3139
AU - Yuji OSHIMA
AU - Shinnosuke TAKAMICHI
AU - Tomoki TODA
AU - Graham NEUBIG
AU - Sakriani SAKTI
AU - Satoshi NAKAMURA
PY - 2016
DO - 10.1587/transinf.2016EDP7231
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E99-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2016
AB - This paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Cross-lingual speech synthesis based on voice conversion or Hidden Markov Model (HMM)-based speech synthesis is a technique to synthesize foreign language speech using a target speaker's natural speech uttered in his/her mother tongue. Although the technique holds promise to improve a wide variety of applications, it tends to cause degradation of target speaker's individuality in synthetic speech compared to intra-lingual speech synthesis. This paper proposes a new approach to speech synthesis that preserves speaker individuality by using non-native speech spoken by the target speaker. Although the use of non-native speech makes it possible to preserve the speaker individuality in the synthesized target speech, naturalness is significantly degraded as the synthesized speech waveform is directly affected by unnatural prosody and pronunciation often caused by differences in the linguistic systems of the source and target languages. To improve naturalness while preserving speaker individuality, we propose (1) a prosody correction method based on model adaptation, and (2) a phonetic correction method based on spectrum replacement for unvoiced consonants. The experimental results using English speech uttered by native Japanese speakers demonstrate that (1) the proposed methods are capable of significantly improving naturalness while preserving the speaker individuality in synthetic speech, and (2) the proposed methods also improve intelligibility as confirmed by a dictation test.
ER -

IEICE TRANSACTIONS on Information