Cross-Lingual Phone Mapping for Large Vocabulary Speech Recognition of Under-Resourced Languages

Van Hai DO; Xiong XIAO; Eng Siong CHNG; Haizhou LI

doi:10.1587/transinf.E97.D.285

IEICE TRANSACTIONS on Information

Cross-Lingual Phone Mapping for Large Vocabulary Speech Recognition of Under-Resourced Languages

Van Hai DO, Xiong XIAO, Eng Siong CHNG, Haizhou LI

Full Text Views

0

Cite this

Summary :

This paper presents a novel acoustic modeling technique of large vocabulary automatic speech recognition for under-resourced languages by leveraging well-trained acoustic models of other languages (called source languages). The idea is to use source language acoustic model to score the acoustic features of the target language, and then map these scores to the posteriors of the target phones using a classifier. The target phone posteriors are then used for decoding in the usual way of hybrid acoustic modeling. The motivation of such a strategy is that human languages usually share similar phone sets and hence it may be easier to predict the target phone posteriors from the scores generated by source language acoustic models than to train from scratch an under-resourced language acoustic model. The proposed method is evaluated using on the Aurora-4 task with less than 1 hour of training data. Two types of source language acoustic models are considered, i.e. hybrid HMM/MLP and conventional HMM/GMM models. In addition, we also use triphone tied states in the mapping. Our experimental results show that by leveraging well trained Malay and Hungarian acoustic models, we achieved 9.0% word error rate (WER) given 55 minutes of English training data. This is close to the WER of 7.9% obtained by using the full 15 hours of training data and much better than the WER of 14.4% obtained by conventional acoustic modeling techniques with the same 55 minutes of training data.

Publication: IEICE TRANSACTIONS on Information Vol.E97-D No.2 pp.285-295

Publication Date: 2014/02/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E97.D.285

Type of Manuscript: PAPER

Category: Speech and Hearing

Authors

Van Hai DO
  Nanyang Technological University
Xiong XIAO
  Nanyang Technological University
Eng Siong CHNG
  Nanyang Technological University
Haizhou LI
  Nanyang Technological University,Institute for Infocomm Research

Keyword

speech recognition, under-resourced language, cross-lingual LVCSR, context-dependent, phone mapping

Cite this

Copy

Van Hai DO, Xiong XIAO, Eng Siong CHNG, Haizhou LI, "Cross-Lingual Phone Mapping for Large Vocabulary Speech Recognition of Under-Resourced Languages" in IEICE TRANSACTIONS on Information, vol. E97-D, no. 2, pp. 285-295, February 2014, doi: 10.1587/transinf.E97.D.285.
Abstract: This paper presents a novel acoustic modeling technique of large vocabulary automatic speech recognition for under-resourced languages by leveraging well-trained acoustic models of other languages (called source languages). The idea is to use source language acoustic model to score the acoustic features of the target language, and then map these scores to the posteriors of the target phones using a classifier. The target phone posteriors are then used for decoding in the usual way of hybrid acoustic modeling. The motivation of such a strategy is that human languages usually share similar phone sets and hence it may be easier to predict the target phone posteriors from the scores generated by source language acoustic models than to train from scratch an under-resourced language acoustic model. The proposed method is evaluated using on the Aurora-4 task with less than 1 hour of training data. Two types of source language acoustic models are considered, i.e. hybrid HMM/MLP and conventional HMM/GMM models. In addition, we also use triphone tied states in the mapping. Our experimental results show that by leveraging well trained Malay and Hungarian acoustic models, we achieved 9.0% word error rate (WER) given 55 minutes of English training data. This is close to the WER of 7.9% obtained by using the full 15 hours of training data and much better than the WER of 14.4% obtained by conventional acoustic modeling techniques with the same 55 minutes of training data.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E97.D.285/_p

Copy

@ARTICLE{e97-d_2_285,
author={Van Hai DO, Xiong XIAO, Eng Siong CHNG, Haizhou LI, },
journal={IEICE TRANSACTIONS on Information},
title={Cross-Lingual Phone Mapping for Large Vocabulary Speech Recognition of Under-Resourced Languages},
year={2014},
volume={E97-D},
number={2},
pages={285-295},
abstract={This paper presents a novel acoustic modeling technique of large vocabulary automatic speech recognition for under-resourced languages by leveraging well-trained acoustic models of other languages (called source languages). The idea is to use source language acoustic model to score the acoustic features of the target language, and then map these scores to the posteriors of the target phones using a classifier. The target phone posteriors are then used for decoding in the usual way of hybrid acoustic modeling. The motivation of such a strategy is that human languages usually share similar phone sets and hence it may be easier to predict the target phone posteriors from the scores generated by source language acoustic models than to train from scratch an under-resourced language acoustic model. The proposed method is evaluated using on the Aurora-4 task with less than 1 hour of training data. Two types of source language acoustic models are considered, i.e. hybrid HMM/MLP and conventional HMM/GMM models. In addition, we also use triphone tied states in the mapping. Our experimental results show that by leveraging well trained Malay and Hungarian acoustic models, we achieved 9.0% word error rate (WER) given 55 minutes of English training data. This is close to the WER of 7.9% obtained by using the full 15 hours of training data and much better than the WER of 14.4% obtained by conventional acoustic modeling techniques with the same 55 minutes of training data.},
keywords={},
doi={10.1587/transinf.E97.D.285},
ISSN={1745-1361},
month={February},}

Copy

TY - JOUR
TI - Cross-Lingual Phone Mapping for Large Vocabulary Speech Recognition of Under-Resourced Languages
T2 - IEICE TRANSACTIONS on Information
SP - 285
EP - 295
AU - Van Hai DO
AU - Xiong XIAO
AU - Eng Siong CHNG
AU - Haizhou LI
PY - 2014
DO - 10.1587/transinf.E97.D.285
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E97-D
IS - 2
JA - IEICE TRANSACTIONS on Information
Y1 - February 2014
AB - This paper presents a novel acoustic modeling technique of large vocabulary automatic speech recognition for under-resourced languages by leveraging well-trained acoustic models of other languages (called source languages). The idea is to use source language acoustic model to score the acoustic features of the target language, and then map these scores to the posteriors of the target phones using a classifier. The target phone posteriors are then used for decoding in the usual way of hybrid acoustic modeling. The motivation of such a strategy is that human languages usually share similar phone sets and hence it may be easier to predict the target phone posteriors from the scores generated by source language acoustic models than to train from scratch an under-resourced language acoustic model. The proposed method is evaluated using on the Aurora-4 task with less than 1 hour of training data. Two types of source language acoustic models are considered, i.e. hybrid HMM/MLP and conventional HMM/GMM models. In addition, we also use triphone tied states in the mapping. Our experimental results show that by leveraging well trained Malay and Hungarian acoustic models, we achieved 9.0% word error rate (WER) given 55 minutes of English training data. This is close to the WER of 7.9% obtained by using the full 15 hours of training data and much better than the WER of 14.4% obtained by conventional acoustic modeling techniques with the same 55 minutes of training data.
ER -

IEICE TRANSACTIONS on Information