Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units

Lijuan WANG; Yong ZHAO; Min CHU; Frank K. SOONG; Jianlai ZHOU; Zhigang CAO

doi:10.1093/ietisy/e89-d.3.1082

IEICE TRANSACTIONS on Information

Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units

Lijuan WANG, Yong ZHAO, Min CHU, Frank K. SOONG, Jianlai ZHOU, Zhigang CAO

Full Text Views

0

Cite this

Summary :

For producing high quality synthesis, a concatenation-based Text-to-Speech (TTS) system usually requires a large number of segmental units to cover various acoustic-phonetic contexts. However, careful manual labeling and segmentation by human experts, which is still the most reliable way to prepare such units, is labor intensive. In this paper we adopt a two-step procedure to automate the labeling, segmentation and refinement process. In the first step, coarse segmentation of speech data is performed by aligning speech signals with the corresponding sequence of Hidden Markov Models (HMMs). Then in the second step, segment boundaries are refined with a proposed Context-Dependent Boundary Model (CDBM). Classification and Regression Tree (CART) is adopted to organize available data into a structured hierarchical tree, where acoustically similar boundaries are clustered together to train tied CDBM models for boundary refinement. Optimal CDBM parameters and training conditions are found through a series of experimental studies. Comparing with manual segmentation reference, segmentation accuracy (within a tolerance of 20 ms) is improved by the CDBMs from 78.1% (baseline) to 94.8% in Mandarin Chinese and from 81.4% to 92.7% in English, with about 1,000 manually segmented sentences used in training the models. To further reduce the amount of manual data for training CDBMs of a new speaker, we adapt a well-trained CDBM via efficient adaptation algorithms. With only 10-20 manually segmented sentences as adaptation data, the adapted CDBM achieves a segmentation accuracy of 90%.

Publication: IEICE TRANSACTIONS on Information Vol.E89-D No.3 pp.1082-1091

Publication Date: 2006/03/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1093/ietisy/e89-d.3.1082

Type of Manuscript: Special Section PAPER (Special Section on Statistical Modeling for Speech Processing)

Category: Speech Synthesis

Cite this

Copy

Lijuan WANG, Yong ZHAO, Min CHU, Frank K. SOONG, Jianlai ZHOU, Zhigang CAO, "Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units" in IEICE TRANSACTIONS on Information, vol. E89-D, no. 3, pp. 1082-1091, March 2006, doi: 10.1093/ietisy/e89-d.3.1082.
Abstract: For producing high quality synthesis, a concatenation-based Text-to-Speech (TTS) system usually requires a large number of segmental units to cover various acoustic-phonetic contexts. However, careful manual labeling and segmentation by human experts, which is still the most reliable way to prepare such units, is labor intensive. In this paper we adopt a two-step procedure to automate the labeling, segmentation and refinement process. In the first step, coarse segmentation of speech data is performed by aligning speech signals with the corresponding sequence of Hidden Markov Models (HMMs). Then in the second step, segment boundaries are refined with a proposed Context-Dependent Boundary Model (CDBM). Classification and Regression Tree (CART) is adopted to organize available data into a structured hierarchical tree, where acoustically similar boundaries are clustered together to train tied CDBM models for boundary refinement. Optimal CDBM parameters and training conditions are found through a series of experimental studies. Comparing with manual segmentation reference, segmentation accuracy (within a tolerance of 20 ms) is improved by the CDBMs from 78.1% (baseline) to 94.8% in Mandarin Chinese and from 81.4% to 92.7% in English, with about 1,000 manually segmented sentences used in training the models. To further reduce the amount of manual data for training CDBMs of a new speaker, we adapt a well-trained CDBM via efficient adaptation algorithms. With only 10-20 manually segmented sentences as adaptation data, the adapted CDBM achieves a segmentation accuracy of 90%.
URL: https://global.ieice.org/en_transactions/information/10.1093/ietisy/e89-d.3.1082/_p

Copy

@ARTICLE{e89-d_3_1082,
author={Lijuan WANG, Yong ZHAO, Min CHU, Frank K. SOONG, Jianlai ZHOU, Zhigang CAO, },
journal={IEICE TRANSACTIONS on Information},
title={Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units},
year={2006},
volume={E89-D},
number={3},
pages={1082-1091},
abstract={For producing high quality synthesis, a concatenation-based Text-to-Speech (TTS) system usually requires a large number of segmental units to cover various acoustic-phonetic contexts. However, careful manual labeling and segmentation by human experts, which is still the most reliable way to prepare such units, is labor intensive. In this paper we adopt a two-step procedure to automate the labeling, segmentation and refinement process. In the first step, coarse segmentation of speech data is performed by aligning speech signals with the corresponding sequence of Hidden Markov Models (HMMs). Then in the second step, segment boundaries are refined with a proposed Context-Dependent Boundary Model (CDBM). Classification and Regression Tree (CART) is adopted to organize available data into a structured hierarchical tree, where acoustically similar boundaries are clustered together to train tied CDBM models for boundary refinement. Optimal CDBM parameters and training conditions are found through a series of experimental studies. Comparing with manual segmentation reference, segmentation accuracy (within a tolerance of 20 ms) is improved by the CDBMs from 78.1% (baseline) to 94.8% in Mandarin Chinese and from 81.4% to 92.7% in English, with about 1,000 manually segmented sentences used in training the models. To further reduce the amount of manual data for training CDBMs of a new speaker, we adapt a well-trained CDBM via efficient adaptation algorithms. With only 10-20 manually segmented sentences as adaptation data, the adapted CDBM achieves a segmentation accuracy of 90%.},
keywords={},
doi={10.1093/ietisy/e89-d.3.1082},
ISSN={1745-1361},
month={March},}

Copy

TY - JOUR
TI - Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units
T2 - IEICE TRANSACTIONS on Information
SP - 1082
EP - 1091
AU - Lijuan WANG
AU - Yong ZHAO
AU - Min CHU
AU - Frank K. SOONG
AU - Jianlai ZHOU
AU - Zhigang CAO
PY - 2006
DO - 10.1093/ietisy/e89-d.3.1082
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E89-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2006
AB - For producing high quality synthesis, a concatenation-based Text-to-Speech (TTS) system usually requires a large number of segmental units to cover various acoustic-phonetic contexts. However, careful manual labeling and segmentation by human experts, which is still the most reliable way to prepare such units, is labor intensive. In this paper we adopt a two-step procedure to automate the labeling, segmentation and refinement process. In the first step, coarse segmentation of speech data is performed by aligning speech signals with the corresponding sequence of Hidden Markov Models (HMMs). Then in the second step, segment boundaries are refined with a proposed Context-Dependent Boundary Model (CDBM). Classification and Regression Tree (CART) is adopted to organize available data into a structured hierarchical tree, where acoustically similar boundaries are clustered together to train tied CDBM models for boundary refinement. Optimal CDBM parameters and training conditions are found through a series of experimental studies. Comparing with manual segmentation reference, segmentation accuracy (within a tolerance of 20 ms) is improved by the CDBMs from 78.1% (baseline) to 94.8% in Mandarin Chinese and from 81.4% to 92.7% in English, with about 1,000 manually segmented sentences used in training the models. To further reduce the amount of manual data for training CDBMs of a new speaker, we adapt a well-trained CDBM via efficient adaptation algorithms. With only 10-20 manually segmented sentences as adaptation data, the adapted CDBM achieves a segmentation accuracy of 90%.
ER -

IEICE TRANSACTIONS on Information

Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units

Summary :

Authors

Keyword

Latest Issue

Contents

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

IEICE TRANSACTIONS on Information

Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units

Summary :

Authors

Keyword

Latest Issue

Contents

Copyrights notice of machine-translated contents

Cite this

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles