End-to-End Multilingual Speech Recognition System with Language Supervision Training

Danyang LIU; Ji XU; Pengyuan ZHANG

doi:10.1587/transinf.2019EDL8214

IEICE TRANSACTIONS on Information

End-to-End Multilingual Speech Recognition System with Language Supervision Training

Danyang LIU, Ji XU, Pengyuan ZHANG

Full Text Views

0

Cite this

Summary :

End-to-end (E2E) multilingual automatic speech recognition (ASR) systems aim to recognize multilingual speeches in a unified framework. In the current E2E multilingual ASR framework, the output prediction for a specific language lacks constraints on the output scope of modeling units. In this paper, a language supervision training strategy is proposed with language masks to constrain the neural network output distribution. To simulate the multilingual ASR scenario with unknown language identity information, a language identification (LID) classifier is applied to estimate the language masks. On four Babel corpora, the proposed E2E multilingual ASR system achieved an average absolute word error rate (WER) reduction of 2.6% compared with the multilingual baseline system.

Publication: IEICE TRANSACTIONS on Information Vol.E103-D No.6 pp.1427-1430

Publication Date: 2020/06/01

Publicized: 2020/03/19

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2019EDL8214

Type of Manuscript: LETTER

Category: Speech and Hearing

Authors

Danyang LIU
  Chinese Academy of Sciences,University of Chinese Academy of Sciences
Ji XU
  Chinese Academy of Sciences,University of Chinese Academy of Sciences
Pengyuan ZHANG
  Chinese Academy of Sciences,University of Chinese Academy of Sciences

Keyword

multilingual speech recognition, language-adaptive training, hybrid attention/CTC

Cite this

Copy

Danyang LIU, Ji XU, Pengyuan ZHANG, "End-to-End Multilingual Speech Recognition System with Language Supervision Training" in IEICE TRANSACTIONS on Information, vol. E103-D, no. 6, pp. 1427-1430, June 2020, doi: 10.1587/transinf.2019EDL8214.
Abstract: End-to-end (E2E) multilingual automatic speech recognition (ASR) systems aim to recognize multilingual speeches in a unified framework. In the current E2E multilingual ASR framework, the output prediction for a specific language lacks constraints on the output scope of modeling units. In this paper, a language supervision training strategy is proposed with language masks to constrain the neural network output distribution. To simulate the multilingual ASR scenario with unknown language identity information, a language identification (LID) classifier is applied to estimate the language masks. On four Babel corpora, the proposed E2E multilingual ASR system achieved an average absolute word error rate (WER) reduction of 2.6% compared with the multilingual baseline system.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2019EDL8214/_p

Copy

@ARTICLE{e103-d_6_1427,
author={Danyang LIU, Ji XU, Pengyuan ZHANG, },
journal={IEICE TRANSACTIONS on Information},
title={End-to-End Multilingual Speech Recognition System with Language Supervision Training},
year={2020},
volume={E103-D},
number={6},
pages={1427-1430},
abstract={End-to-end (E2E) multilingual automatic speech recognition (ASR) systems aim to recognize multilingual speeches in a unified framework. In the current E2E multilingual ASR framework, the output prediction for a specific language lacks constraints on the output scope of modeling units. In this paper, a language supervision training strategy is proposed with language masks to constrain the neural network output distribution. To simulate the multilingual ASR scenario with unknown language identity information, a language identification (LID) classifier is applied to estimate the language masks. On four Babel corpora, the proposed E2E multilingual ASR system achieved an average absolute word error rate (WER) reduction of 2.6% compared with the multilingual baseline system.},
keywords={},
doi={10.1587/transinf.2019EDL8214},
ISSN={1745-1361},
month={June},}

Copy

TY - JOUR
TI - End-to-End Multilingual Speech Recognition System with Language Supervision Training
T2 - IEICE TRANSACTIONS on Information
SP - 1427
EP - 1430
AU - Danyang LIU
AU - Ji XU
AU - Pengyuan ZHANG
PY - 2020
DO - 10.1587/transinf.2019EDL8214
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E103-D
IS - 6
JA - IEICE TRANSACTIONS on Information
Y1 - June 2020
AB - End-to-end (E2E) multilingual automatic speech recognition (ASR) systems aim to recognize multilingual speeches in a unified framework. In the current E2E multilingual ASR framework, the output prediction for a specific language lacks constraints on the output scope of modeling units. In this paper, a language supervision training strategy is proposed with language masks to constrain the neural network output distribution. To simulate the multilingual ASR scenario with unknown language identity information, a language identification (LID) classifier is applied to estimate the language masks. On four Babel corpora, the proposed E2E multilingual ASR system achieved an average absolute word error rate (WER) reduction of 2.6% compared with the multilingual baseline system.
ER -

IEICE TRANSACTIONS on Information