Code-Switching ASR and TTS Using Semisupervised Learning with Machine Speech Chain

Sahoko NAKAYAMA; Andros TJANDRA; Sakriani SAKTI; Satoshi NAKAMURA

doi:10.1587/transinf.2021EDP7005

Code-Switching ASR and TTS Using Semisupervised Learning with Machine Speech Chain

Sahoko NAKAYAMA, Andros TJANDRA, Sakriani SAKTI, Satoshi NAKAMURA

Full Text Views

0

Cite this

Summary :

The phenomenon where a speaker mixes two or more languages within the same conversation is called code-switching (CS). Handling CS is challenging for automatic speech recognition (ASR) and text-to-speech (TTS) because it requires coping with multilingual input. Although CS text or speech may be found in social media, the datasets of CS speech and corresponding CS transcriptions are hard to obtain even though they are required for supervised training. This work adopts a deep learning-based machine speech chain to train CS ASR and CS TTS with each other with semisupervised learning. After supervised learning with monolingual data, the machine speech chain is then carried out with unsupervised learning of either the CS text or speech. The results show that the machine speech chain trains ASR and TTS together and improves performance without requiring the pair of CS speech and corresponding CS text. We also integrate language embedding and language identification into the CS machine speech chain in order to handle CS better by giving language information. We demonstrate that our proposed approach can improve the performance on both a single CS language pair and multiple CS language pairs, including the unknown CS excluded from training data.

Publication: IEICE TRANSACTIONS on Information Vol.E104-D No.10 pp.1661-1677

Publication Date: 2021/10/01

Publicized: 2021/07/08

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2021EDP7005

Type of Manuscript: PAPER

Category: Speech and Hearing

Authors

Sahoko NAKAYAMA
  Nara Institute of Science and Technology,RIKEN, Center for Advanced Intelligence Project AIP
Andros TJANDRA
  Nara Institute of Science and Technology
Sakriani SAKTI
  Nara Institute of Science and Technology,RIKEN, Center for Advanced Intelligence Project AIP
Satoshi NAKAMURA
  Nara Institute of Science and Technology,RIKEN, Center for Advanced Intelligence Project AIP

Keyword

ASR, code-switching, language identification, semisupervised learning, TTS, machine speech chain

Cite this

Copy

Sahoko NAKAYAMA, Andros TJANDRA, Sakriani SAKTI, Satoshi NAKAMURA, "Code-Switching ASR and TTS Using Semisupervised Learning with Machine Speech Chain" in IEICE TRANSACTIONS on Information, vol. E104-D, no. 10, pp. 1661-1677, October 2021, doi: 10.1587/transinf.2021EDP7005.
Abstract: The phenomenon where a speaker mixes two or more languages within the same conversation is called code-switching (CS). Handling CS is challenging for automatic speech recognition (ASR) and text-to-speech (TTS) because it requires coping with multilingual input. Although CS text or speech may be found in social media, the datasets of CS speech and corresponding CS transcriptions are hard to obtain even though they are required for supervised training. This work adopts a deep learning-based machine speech chain to train CS ASR and CS TTS with each other with semisupervised learning. After supervised learning with monolingual data, the machine speech chain is then carried out with unsupervised learning of either the CS text or speech. The results show that the machine speech chain trains ASR and TTS together and improves performance without requiring the pair of CS speech and corresponding CS text. We also integrate language embedding and language identification into the CS machine speech chain in order to handle CS better by giving language information. We demonstrate that our proposed approach can improve the performance on both a single CS language pair and multiple CS language pairs, including the unknown CS excluded from training data.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021EDP7005/_p

Copy

@ARTICLE{e104-d_10_1661,
author={Sahoko NAKAYAMA, Andros TJANDRA, Sakriani SAKTI, Satoshi NAKAMURA, },
journal={IEICE TRANSACTIONS on Information},
title={Code-Switching ASR and TTS Using Semisupervised Learning with Machine Speech Chain},
year={2021},
volume={E104-D},
number={10},
pages={1661-1677},
abstract={The phenomenon where a speaker mixes two or more languages within the same conversation is called code-switching (CS). Handling CS is challenging for automatic speech recognition (ASR) and text-to-speech (TTS) because it requires coping with multilingual input. Although CS text or speech may be found in social media, the datasets of CS speech and corresponding CS transcriptions are hard to obtain even though they are required for supervised training. This work adopts a deep learning-based machine speech chain to train CS ASR and CS TTS with each other with semisupervised learning. After supervised learning with monolingual data, the machine speech chain is then carried out with unsupervised learning of either the CS text or speech. The results show that the machine speech chain trains ASR and TTS together and improves performance without requiring the pair of CS speech and corresponding CS text. We also integrate language embedding and language identification into the CS machine speech chain in order to handle CS better by giving language information. We demonstrate that our proposed approach can improve the performance on both a single CS language pair and multiple CS language pairs, including the unknown CS excluded from training data.},
keywords={},
doi={10.1587/transinf.2021EDP7005},
ISSN={1745-1361},
month={October},}

Copy

TY - JOUR
TI - Code-Switching ASR and TTS Using Semisupervised Learning with Machine Speech Chain
T2 - IEICE TRANSACTIONS on Information
SP - 1661
EP - 1677
AU - Sahoko NAKAYAMA
AU - Andros TJANDRA
AU - Sakriani SAKTI
AU - Satoshi NAKAMURA
PY - 2021
DO - 10.1587/transinf.2021EDP7005
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E104-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2021
AB - The phenomenon where a speaker mixes two or more languages within the same conversation is called code-switching (CS). Handling CS is challenging for automatic speech recognition (ASR) and text-to-speech (TTS) because it requires coping with multilingual input. Although CS text or speech may be found in social media, the datasets of CS speech and corresponding CS transcriptions are hard to obtain even though they are required for supervised training. This work adopts a deep learning-based machine speech chain to train CS ASR and CS TTS with each other with semisupervised learning. After supervised learning with monolingual data, the machine speech chain is then carried out with unsupervised learning of either the CS text or speech. The results show that the machine speech chain trains ASR and TTS together and improves performance without requiring the pair of CS speech and corresponding CS text. We also integrate language embedding and language identification into the CS machine speech chain in order to handle CS better by giving language information. We demonstrate that our proposed approach can improve the performance on both a single CS language pair and multiple CS language pairs, including the unknown CS excluded from training data.
ER -