1-16hit |
Junya KOGUCHI Shinnosuke TAKAMICHI Masanori MORISE Hiroshi SARUWATARI Shigeki SAGAYAMA
We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.
Daiki SEKIZAWA Shinnosuke TAKAMICHI Hiroshi SARUWATARI
This article proposes a prosody correction method based on partial model adaptation for Chinese-accented Japanese hidden Markov model (HMM)-based text-to-speech synthesis. Although text-to-speech synthesis built from non-native speech accurately reproduces the speaker's individuality in synthetic speech, the naturalness of the synthetic speech is strongly degraded. In the proposed model, to improve the naturalness while preserving the speaker individuality of Chinese-accented Japanese text-to-speech synthesis, we partially utilize HMM parameters of native Japanese speech to synthesize prosody-corrected synthetic speech. Results of an experimental evaluation demonstrate that duration and F0 correction are significantly effective for improving naturalness.
Masayuki SUZUKI Ryo KUROIWA Keisuke INNAMI Shumpei KOBAYASHI Shinya SHIMIZU Nobuaki MINEMATSU Keikichi HIROSE
When synthesizing speech from Japanese text, correct assignment of accent nuclei for input text with arbitrary contents is indispensable in obtaining naturally-sounding synthetic speech. A phenomenon called accent sandhi occurs in utterances of Japanese; when a word is uttered in a sentence, its accent nucleus may change depending on the contexts of preceding/succeeding words. This paper describes a statistical method for automatically predicting the accent nucleus changes due to accent sandhi. First, as the basis of the research, a database of Japanese text was constructed with labels of accent phrase boundaries and accent nucleus positions when uttered in sentences. A single native speaker of Tokyo dialect Japanese annotated all the labels for 6,344 Japanese sentences. Then, using this database, a conditional-random-field-based method was developed using this database to predict accent phrase boundaries and accent nuclei. The proposed method predicted accent nucleus positions for accent phrases with 94.66% accuracy, clearly surpassing the 87.48% accuracy obtained using our rule-based method. A listening experiment was also conducted on synthetic speech obtained using the proposed method and that obtained using the rule-based method. The results show that our method significantly improved the naturalness of synthetic speech.
Xin WANG Shinji TAKAKI Junichi YAMAGISHI
Building high-quality text-to-speech (TTS) systems without expert knowledge of the target language and/or time-consuming manual annotation of speech and text data is an important yet challenging research topic. In this kind of TTS system, it is vital to find representation of the input text that is both effective and easy to acquire. Recently, the continuous representation of raw word inputs, called “word embedding”, has been successfully used in various natural language processing tasks. It has also been used as the additional or alternative linguistic input features to a neural-network-based acoustic model for TTS systems. In this paper, we further investigate the use of this embedding technique to represent phonemes, syllables and phrases for the acoustic model based on the recurrent and feed-forward neural network. Results of the experiments show that most of these continuous representations cannot significantly improve the system's performance when they are fed into the acoustic model either as additional component or as a replacement of the conventional prosodic context. However, subjective evaluation shows that the continuous representation of phrases can achieve significant improvement when it is combined with the prosodic context as input to the acoustic model based on the feed-forward neural network.
Break prediction is an important step in text-to-speech systems as break indices (BIs) have a great influence on how to correctly represent prosodic phrase boundaries. However, an accurate prediction is difficult since BIs are often chosen according to the meaning of a sentence or the reading style of the speaker. In Japanese, the prediction of an accentual phrase boundary (APB) and major phrase boundary (MPB) is particularly difficult. Thus, this paper presents a method to complement the prediction errors of an APB and MPB. First, we define a subtle BI in which it is difficult to decide between an APB and MPB clearly as a variable break (VB), and an explicit BI as a fixed break (FB). The VB is chosen using the classification and regression tree, and multiple prosodic targets in relation to the pith and duration are then generated. Finally, unit-selection is conducted using multiple prosodic targets. The experimental results show that the proposed method improves the naturalness of synthesized speech.
Ryuki TACHIBANA Tohru NAGANO Gakuto KURATA Masafumi NISHIMURA Noboru BABAGUCHI
Automatic prosody labeling is the task of automatically annotating prosodic labels such as syllable stresses or break indices into speech corpora. Prosody-labeled corpora are important for speech synthesis and automatic speech understanding. However, the subtleness of physical features makes accurate labeling difficult. Since errors in the prosodic labels can lead to incorrect prosody estimation and unnatural synthetic sound, the accuracy of the labels is a key factor for text-to-speech (TTS) systems. In particular, mora accent labels relevant to pitch are very important for Japanese, since Japanese is a pitch-accent language and Japanese people have a particularly keen sense of pitch accents. However, the determination of the mora accents of Japanese is a more difficult task than English stress detection in a way. This is because the context of words changes the mora accents within the word, which is different from English stress where the stress is normally put at the lexical primary stress of a word. In this paper, we propose a method that can accurately determine the prosodic labels of Japanese using both acoustic and linguistic models. A speaker-independent linguistic model provides mora-level knowledge about the possible correct accentuations in Japanese, and contributes to reduction of the required size of the speaker-dependent speech corpus for training the other stochastic models. Our experiments show the effectiveness of the combination of models.
Yousif A. EL-IMAM Zuraidah Mohd DON
Phonetic transcription of text is an indispensable component of text-to-speech (TTS) systems and is used in acoustic modeling for speech recognition and other natural language processing applications. One approach to the transcription of written text into phonetic entities or sounds is to use a set of well-defined context and language-dependent rules. The process of transcribing text into sounds starts by preprocessing the text and representing it by lexical items to which the rules are applicable. The rules can be segregated into phonemic and phonetic rules. Phonemic rules operate on graphemes to convert them into phonemes. Phonetic rules operate on phonemes and convert them into context-dependent phonetic entities with actual sounds. Converting from written text into actual sounds, developing a comprehensive set of rules, and transforming the rules into implementable algorithms for any language cause several problems that have their origins in the relative lack of correspondence between the spelling of the lexical items and their sound contents. For Standard Malay (SM) these problems are not as severe as those for languages of complex spelling systems, such as English and French, but they do exist. In this paper, developing a comprehensive computerized system for processing SM text and transcribing it into phonetic entities and evaluating the performance of this system, irrespective of the application, is discussed. In particular, the following issues are dealt with in this paper: (1) the spelling and other problems of SM writing and their impact on converting graphemes into phonemes, (2) the development of a comprehensive set of grapheme-to-phoneme rules for SM, (3) a description of the phonetic variations of SM or how the phonemes of SM vary in context and the development of a set of phoneme-to-phonetic transcription rules, (4) the formulation of the phonemic and phonetic rules into algorithms that are applicable to the computer-based processing of input SM text, and (5) the evaluation of the performance of the process of converting SM text into actual sounds by the above mentioned methods.
Gerasimos XYDAS Dimitris SPILIOTOPOULOS Georgios KOUROUPETROGLOU
Synthetic speech usually suffers from bad F0 contour surface. The prediction of the underlying pitch targets robustly relies on the quality of the predicted prosodic structures, i.e. the corresponding sequences of tones and breaks. In the present work, we have utilized a linguistically enriched annotated corpus to build data-driven models for predicting prosodic structures with increased accuracy. We have then used a linear regression approach for the F0 modeling. An appropriate XML annotation scheme has been introduced to encode syntax, grammar, new or already given information, phrase subject/object information, as well as rhetorical elements in the corpus, by exploiting a Natural Language Generator (NLG) system. To prove the benefits from the introduction of the enriched input meta-information, we first show that while tone and break CART predictors have high accuracy when standing alone (92.35% for breaks, 87.76% for accents and 99.03% for endtones), their application in the TtS chain degrades the Linear Regression pitch target model. On the other hand, the enriched linguistic meta-information minimizes errors of models leading to a more natural F0 surface. Both objective and subjective evaluation were adopted for the intonation contours by taking into account the propagated errors introduced by each model in the synthesis chain.
Nobuaki MINEMATSU Ryuji KITA Keikichi HIROSE
Accurate estimation of accentual attribute values of words, which is required to apply rules of Japanese word accent sandhi to prosody generation, is an important factor to realize high-quality text-to-speech (TTS) conversion. The rules were already formulated by Sagisaka et al. and are widely used in Japanese TTS conversion systems. Application of these rules, however, requires values of a few accentual attributes of each constituent word of input text. The attribute values cannot be found in any public database or any accent dictionaries of Japanese. Further, these values are difficult even for native speakers of Japanese to estimate only with their introspective consideration of properties of their mother tongue. In this paper, an algorithm was proposed, where these values were automatically estimated from a large amount of data of accent types of accentual phrases, which were collected through a long series of listening experiments. In the proposed algorithm, inter-speaker differences of knowledge of accent sandhi were well considered. To improve the coverage of the estimated values over the obtained data, the rules were tentatively modified. Evaluation experiments using two-mora accentual phrases showed the high validity of the estimated values and the modified rules and also some defects caused by varieties of linguistic expressions of Japanese.
Keiichi TOKUDA Takashi MASUKO Noboru MIYAZAKI Takao KOBAYASHI
This paper proposes a new kind of hidden Markov model (HMM) based on multi-space probability distribution, and derives a parameter estimation algorithm for the extended HMM. HMMs are widely used statistical models for characterizing sequences of speech spectra, and have been successfully applied to speech recognition systems. HMMs are categorized into discrete HMMs and continuous HMMs, which can model sequences of discrete symbols and continuous vectors, respectively. However, we cannot apply both the conventional discrete and continuous HMMs to observation sequences which consist of continuous values and discrete symbols: F0 pattern modeling of speech is a good illustration. The proposed HMM includes discrete HMM and continuous HMM as special cases, and furthermore, can model sequences which consist of observation vectors with variable dimensionality and discrete symbols.
This paper describes a complete Mandarin text-to-speech system on time domain. We take advantage of the advancement of memory technology, which achieves ever-increasing capacity and ever-lower price. We try to collect as more as possible the synthesis units in a Mandarin text-to-speech system. With such an effort, we developed simpler speech processing techniques and achieved faster processing speed by using only an ordinary personal computer. We also developed delicate methods to measure the intelligibility, comprehensibility, and naturalness of a Mandarin text-to-speech system. Our system performs very well compared with existing systems. We first develop a set of useful algorithms and methods to deal with some features of the syllables, such as duration, amplitude, fundamental frequency, pause, and so on. Based on these algorithms and methods, we then build a Mandarin text-to-speech system. Given any Chinese text in some computerized form, e. g. , in BIG-5 code representation, our system can pronounce the text in real time. Our text-to-speech system runs on an IBM 80486 compatible PC, with no special hardware for signal processing. The evaluation of our text-to-speech system is based on a proposed subjective evaluation method. An evaluation was made by 51 undergraduate students. The intelligibility of our text-to-speech system is 99. 5%, the comprehensibility of our text-to-speech system is 92. 6%, and the naturalness of our text-to-speech system is 81. 512 points in a percentile grading system (the highest score is 100 points, and the lowest score is 0 point). Other 40 Ph. D. students also did the same evaluation about naturalness. The result shows that the naturalness of our text-to-speech system is 82. 8 points in a percentile grading system.
Tomohisa HIROKAWA Kenzo ITOH Hirokazu SATO
A new system for speech synthesis by concatenating waveforms selected from a dictionary is described. The dictionary is constructed from a two-hour speech that includes isolated words and sentences uttered by one male speaker, and contains over 45,000 entries which are identified by their average pitch, dynamic pitch parameter which represents micro pitch structure in a segment, duration and average amplitude. Phoneme duration is set according to phoneme environment, and phoneme power is controlled, by both pitch frequency and phoneme environment. Tests show the average errors in vowel duration and consonant duration are 28.8 ms and 16.8 ms respectively, and the vowel power average error is 2.9 dB. The pitch frequency patterns are calculated according to a conventional model in which the accent component is abbed to a gross phrase component. Set a phoneme string and prosody information, the optimum waveforms are selected from the dictionary by matching their attributes with the given phonetic and prosodic information. A waveform selection function, which has two terms corresponding to prosody and phonological coincidence between rule-set values and waveform values from the dictionary, is proposed. The weight coefficients used in the selection function are determined through subjective hearing tests. The selected waveform segments are then modified in waveform domain to further adjust for the desired prosody. A pitch frequency modification method based on pitch synchronous overlap-add technique is introduced into the system. Lastly, the waveforms are interpolated between voiced waveforms to avoid abrupt changes in voice spectrum and waveform shape. An absolute evaluation test of five grades is performed to the synthesized voice and the mean of the score is 3.1, which is over "good," and while the original speaker quality is retained.
Yoshiyuki HARA Tsuneo NITTA Hiroyoshi SAITO Ken'ichiro KOBAYASHI
Text-to-speech synthesis (TTS) is currently one of the most important media conversion techniques. In this paper, we describe a Japanese TTS card developed for constructing a personal-computer-based multimedia platform, and a TTS software package developed for a workstation-based multimedia platform. Some applications of this hardware and software are also discussed. The TTS consists of a linguistic processing stage for converting text into phonetic and prosodic information, and a speech processing stage for producing speech from the phonetic and prosodic symbols. The linguistic processing stage uses morphological analysis, rewriting rules for accent movement and pause insertion, and other techniques to impart correct accentuation and a natural-sounding intonation to the synthesized speech. The speech processing stage employs the cepstrum method with consonant-vowel (CV) syllables as the synthesis unit to achieve clear and smooth synthesized speech. All of the processing for converting Japanese text (consisting of mixed Japanese Kanji and Kana characters) to synthesized speech is done internally on the TTS card. This allows the card to be used widely in various applications, including electronic mail and telephone service systems without placing any processing burden on the personal computer. The TTS software was used for an E-mail reading tool on a workstation.
Norio HIGUCHI Tohru SHIMIZU Hisashi KAWAI Seiichi YAMAMOTO
The authors developed a portable Japanese text-to-speech system using a pocket-sized formant speech synthesizer. It consists of a linguistic processor and an acoustic processor. The linguistic processor runs on an MS-DOS personal computer and has functions to determine readings and prosodic information for input sentences written in kana-kanji-mixed style. New techniques, such as minimization of a cost function for phrases, rare-compound flag, semantic information, information of reading selection and restriction by associated particles, are used to increase the accuracy of readings and accent positions. The accuracy of determining readings and accent positions is 98.6% for sentences in newspaper articles. It is possible to use the linguistic processor through an interface library which has also been developed by the authors. Consequently, it has become possible not only to convert whole texts stored in text files but also to convert parts of sentences sent by the interface library sequentially, and the readings and prosodic information are optimized for the whole sentence at one time. The acoustic processor is custom-made hardware, and it has adopted new techniques, for the improvement of rules for vowel devoicing, control of phoneme durations, control of the phrase components of voice fundamental frequency and the construction of the acoustic parameter database. Due to the above-mentioned modifications, the naturalness of synthetic speech generated by a Klatt-type formant speech synthesizer was improved. On a naturalness test it was rated 3.61 on a scale of 5 points from 0 to 4.
Kenzo ITOH Tomohisa HIROKAWA Hirokazu SATO
This paper proposes a new method of phoneme power control for speech synthesis by rule. The innovation of this method lies in its use of the phoneme environment and the relationship between speech power and pitch frequency. First, the permissible threshold (PT) for power modification is measured by subjective experiments using power manipulated speech material. As a result, it is concluded that the PT of power modification is 4.1 dB. This experimental result is significant when discussing power control and gives a criterion for power control accuracy. Next, the relationship between speech power and pitch frequency is analyzed using a very large speech data base. The results show that the relationship between phoneme power and pitch frequency is affected by the kind of phoneme, the adjoining phonemes, rising or falling pitch, and initial or final position in the sentence. Finally, we propose that the phoneme power should be controlled by pitch frequency and phoneme environment. This proposal is implemented in a waveform concatenation type text-to-speech synthesizer. This new method yields an averaged root mean square error between real and estimated speech power of 2.17 dB. This value indicates that 94% of the estimated power values are within the permissible threshold of human perception.
Keikichi HIROSE Hiroya FUJISAKI
A text-to-speech conversion system for Japanese has been developed for the purpose of producing high-quality speech output. This system consists of four processing stages: 1) linguistic processing, 2) phonological processing, 3) control parameter generation, and 4) speech waveform generation. Although the processing at the first stage is restricted to the texts on general weather conditions, the other three stages can also cope with texts of news and narrations on other topics. Since the prosodic features of speech are largely related to the linguistic information, such as word accent, syntactic structure and discourse structure, linguistic processing of a wider range than ever, at least a sentence, is indispensable to obtain good quality speech with respect to the prosody. From this point of view, input text was restricted to the weather forecast sentences and a method for linguistic processing was developed to conduct morpheme, syntactic and semantic analyses simultaneously. A quantitative model for generating fundamental frequency contours was adopted to make a good reflection of the linguistic information on the prosody of synthetic speech. A set of prosodic rules was constructed to generate prosodic symbols representing prosodic structures of the text from the linguistic information obtained at the first stage. A new speech synthesizer based on the terminal analog method was also developed to improve the segmental quality of synthetic speech. It consists of four paths of cascade connection of pole/zero filters and three waveform generators. The four paths are respectively used for the synthesis of vowels and vowel-like sounds, nasal murmur and buzz bar, friction, and plosion, while the three generators produce voicing source waveform approximated by polynomials, white Gaussian noise source for fricatives and impulse source for plosives. The validity of the approach above has been confirmed by the listening tests using speech synthesized by the developed system. Improvements both in the quality of prosodic features and in the quality of segmental features were realized for the synthetic speech.