Unsupervised Prosodic Labeling of Speech Synthesis Databases Using Context-Dependent HMMs

Chen-Yu YANG; Zhen-Hua LING; Li-Rong DAI

doi:10.1587/transinf.E97.D.1449

IEICE TRANSACTIONS on Information

Unsupervised Prosodic Labeling of Speech Synthesis Databases Using Context-Dependent HMMs

Chen-Yu YANG, Zhen-Hua LING, Li-Rong DAI

Full Text Views

0

Cite this

Summary :

In this paper, an automatic and unsupervised method using context-dependent hidden Markov models (CD-HMMs) is proposed for the prosodic labeling of speech synthesis databases. This method consists of three main steps, i.e., initialization, model training and prosodic labeling. The initial prosodic labels are obtained by unsupervised clustering using the acoustic features designed according to the characteristics of the prosodic descriptor to be labeled. Then, CD-HMMs of the spectral parameters, F0s and phone durations are estimated by a means similar to the HMM-based parametric speech synthesis using the initial prosodic labels. These labels are further updated by Viterbi decoding under the maximum likelihood criterion given the acoustic feature sequences and the trained CD-HMMs. The model training and prosodic labeling procedures are conducted iteratively until convergence. The performance of the proposed method is evaluated on Mandarin speech synthesis databases and two prosodic descriptors are investigated, i.e., the prosodic phrase boundary and the emphasis expression. In our implementation, the prosodic phrase boundary labels are initialized by clustering the durations of the pauses between every two consecutive prosodic words, and the emphasis expression labels are initialized by examining the differences between the original and the synthetic F0 trajectories. Experimental results show that the proposed method is able to label the prosodic phrase boundary positions much more accurately than the text-analysis-based method without requiring any manually labeled training data. The unit selection speech synthesis system constructed using the prosodic phrase boundary labels generated by our proposed method achieves similar performance to that using the manual labels. Furthermore, the unit selection speech synthesis system constructed using the emphasis expression labels generated by our proposed method can convey the emphasis information effectively while maintaining the naturalness of synthetic speech.

Publication: IEICE TRANSACTIONS on Information Vol.E97-D No.6 pp.1449-1460

Publication Date: 2014/06/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E97.D.1449

Type of Manuscript: Special Section PAPER (Special Section on Advances in Modeling for Real-world Speech Information Processing and its Application)

Category: Speech Synthesis and Related Topics

Authors

Chen-Yu YANG
  University of Science and Technology of China
Zhen-Hua LING
  University of Science and Technology of China
Li-Rong DAI
  University of Science and Technology of China

Keyword

speech synthesis, prosodic labeling, hidden Markov model, prosodic phrase boundary, emphasis expression

Cite this

Copy

Chen-Yu YANG, Zhen-Hua LING, Li-Rong DAI, "Unsupervised Prosodic Labeling of Speech Synthesis Databases Using Context-Dependent HMMs" in IEICE TRANSACTIONS on Information, vol. E97-D, no. 6, pp. 1449-1460, June 2014, doi: 10.1587/transinf.E97.D.1449.
Abstract: In this paper, an automatic and unsupervised method using context-dependent hidden Markov models (CD-HMMs) is proposed for the prosodic labeling of speech synthesis databases. This method consists of three main steps, i.e., initialization, model training and prosodic labeling. The initial prosodic labels are obtained by unsupervised clustering using the acoustic features designed according to the characteristics of the prosodic descriptor to be labeled. Then, CD-HMMs of the spectral parameters, F0s and phone durations are estimated by a means similar to the HMM-based parametric speech synthesis using the initial prosodic labels. These labels are further updated by Viterbi decoding under the maximum likelihood criterion given the acoustic feature sequences and the trained CD-HMMs. The model training and prosodic labeling procedures are conducted iteratively until convergence. The performance of the proposed method is evaluated on Mandarin speech synthesis databases and two prosodic descriptors are investigated, i.e., the prosodic phrase boundary and the emphasis expression. In our implementation, the prosodic phrase boundary labels are initialized by clustering the durations of the pauses between every two consecutive prosodic words, and the emphasis expression labels are initialized by examining the differences between the original and the synthetic F0 trajectories. Experimental results show that the proposed method is able to label the prosodic phrase boundary positions much more accurately than the text-analysis-based method without requiring any manually labeled training data. The unit selection speech synthesis system constructed using the prosodic phrase boundary labels generated by our proposed method achieves similar performance to that using the manual labels. Furthermore, the unit selection speech synthesis system constructed using the emphasis expression labels generated by our proposed method can convey the emphasis information effectively while maintaining the naturalness of synthetic speech.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E97.D.1449/_p

Copy

@ARTICLE{e97-d_6_1449,
author={Chen-Yu YANG, Zhen-Hua LING, Li-Rong DAI, },
journal={IEICE TRANSACTIONS on Information},
title={Unsupervised Prosodic Labeling of Speech Synthesis Databases Using Context-Dependent HMMs},
year={2014},
volume={E97-D},
number={6},
pages={1449-1460},
abstract={In this paper, an automatic and unsupervised method using context-dependent hidden Markov models (CD-HMMs) is proposed for the prosodic labeling of speech synthesis databases. This method consists of three main steps, i.e., initialization, model training and prosodic labeling. The initial prosodic labels are obtained by unsupervised clustering using the acoustic features designed according to the characteristics of the prosodic descriptor to be labeled. Then, CD-HMMs of the spectral parameters, F0s and phone durations are estimated by a means similar to the HMM-based parametric speech synthesis using the initial prosodic labels. These labels are further updated by Viterbi decoding under the maximum likelihood criterion given the acoustic feature sequences and the trained CD-HMMs. The model training and prosodic labeling procedures are conducted iteratively until convergence. The performance of the proposed method is evaluated on Mandarin speech synthesis databases and two prosodic descriptors are investigated, i.e., the prosodic phrase boundary and the emphasis expression. In our implementation, the prosodic phrase boundary labels are initialized by clustering the durations of the pauses between every two consecutive prosodic words, and the emphasis expression labels are initialized by examining the differences between the original and the synthetic F0 trajectories. Experimental results show that the proposed method is able to label the prosodic phrase boundary positions much more accurately than the text-analysis-based method without requiring any manually labeled training data. The unit selection speech synthesis system constructed using the prosodic phrase boundary labels generated by our proposed method achieves similar performance to that using the manual labels. Furthermore, the unit selection speech synthesis system constructed using the emphasis expression labels generated by our proposed method can convey the emphasis information effectively while maintaining the naturalness of synthetic speech.},
keywords={},
doi={10.1587/transinf.E97.D.1449},
ISSN={1745-1361},
month={June},}

Copy

TY - JOUR
TI - Unsupervised Prosodic Labeling of Speech Synthesis Databases Using Context-Dependent HMMs
T2 - IEICE TRANSACTIONS on Information
SP - 1449
EP - 1460
AU - Chen-Yu YANG
AU - Zhen-Hua LING
AU - Li-Rong DAI
PY - 2014
DO - 10.1587/transinf.E97.D.1449
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E97-D
IS - 6
JA - IEICE TRANSACTIONS on Information
Y1 - June 2014
AB - In this paper, an automatic and unsupervised method using context-dependent hidden Markov models (CD-HMMs) is proposed for the prosodic labeling of speech synthesis databases. This method consists of three main steps, i.e., initialization, model training and prosodic labeling. The initial prosodic labels are obtained by unsupervised clustering using the acoustic features designed according to the characteristics of the prosodic descriptor to be labeled. Then, CD-HMMs of the spectral parameters, F0s and phone durations are estimated by a means similar to the HMM-based parametric speech synthesis using the initial prosodic labels. These labels are further updated by Viterbi decoding under the maximum likelihood criterion given the acoustic feature sequences and the trained CD-HMMs. The model training and prosodic labeling procedures are conducted iteratively until convergence. The performance of the proposed method is evaluated on Mandarin speech synthesis databases and two prosodic descriptors are investigated, i.e., the prosodic phrase boundary and the emphasis expression. In our implementation, the prosodic phrase boundary labels are initialized by clustering the durations of the pauses between every two consecutive prosodic words, and the emphasis expression labels are initialized by examining the differences between the original and the synthetic F0 trajectories. Experimental results show that the proposed method is able to label the prosodic phrase boundary positions much more accurately than the text-analysis-based method without requiring any manually labeled training data. The unit selection speech synthesis system constructed using the prosodic phrase boundary labels generated by our proposed method achieves similar performance to that using the manual labels. Furthermore, the unit selection speech synthesis system constructed using the emphasis expression labels generated by our proposed method can convey the emphasis information effectively while maintaining the naturalness of synthetic speech.
ER -

IEICE TRANSACTIONS on Information