1-1hit |
Ryuki TACHIBANA Tohru NAGANO Gakuto KURATA Masafumi NISHIMURA Noboru BABAGUCHI
Automatic prosody labeling is the task of automatically annotating prosodic labels such as syllable stresses or break indices into speech corpora. Prosody-labeled corpora are important for speech synthesis and automatic speech understanding. However, the subtleness of physical features makes accurate labeling difficult. Since errors in the prosodic labels can lead to incorrect prosody estimation and unnatural synthetic sound, the accuracy of the labels is a key factor for text-to-speech (TTS) systems. In particular, mora accent labels relevant to pitch are very important for Japanese, since Japanese is a pitch-accent language and Japanese people have a particularly keen sense of pitch accents. However, the determination of the mora accents of Japanese is a more difficult task than English stress detection in a way. This is because the context of words changes the mora accents within the word, which is different from English stress where the stress is normally put at the lexical primary stress of a word. In this paper, we propose a method that can accurately determine the prosodic labels of Japanese using both acoustic and linguistic models. A speaker-independent linguistic model provides mora-level knowledge about the possible correct accentuations in Japanese, and contributes to reduction of the required size of the speaker-dependent speech corpus for training the other stochastic models. Our experiments show the effectiveness of the combination of models.