1-3hit |
Kenzo ITOH Tomohisa HIROKAWA Hirokazu SATO
This paper proposes a new method of phoneme power control for speech synthesis by rule. The innovation of this method lies in its use of the phoneme environment and the relationship between speech power and pitch frequency. First, the permissible threshold (PT) for power modification is measured by subjective experiments using power manipulated speech material. As a result, it is concluded that the PT of power modification is 4.1 dB. This experimental result is significant when discussing power control and gives a criterion for power control accuracy. Next, the relationship between speech power and pitch frequency is analyzed using a very large speech data base. The results show that the relationship between phoneme power and pitch frequency is affected by the kind of phoneme, the adjoining phonemes, rising or falling pitch, and initial or final position in the sentence. Finally, we propose that the phoneme power should be controlled by pitch frequency and phoneme environment. This proposal is implemented in a waveform concatenation type text-to-speech synthesizer. This new method yields an averaged root mean square error between real and estimated speech power of 2.17 dB. This value indicates that 94% of the estimated power values are within the permissible threshold of human perception.
Hiroshi IRII Kenzo ITOH Nobuhiko KITAWAKI
This paper proposes a multilingual set of speech samples serving as the data base for standardizing an artificial voice which will be used to evaluate the talker dependency of digital coding algorithms. To investigate the impartiality of this data base, the fundamental statistical speech characteristics of these samples-long-term average spectrum, instantaneous amplitude distribution, segmental power, fundamental frequency, and voiced/ unvoiced ratio-are analyzed. Detailed dispersions in the samples as well as average values are derived. Taking the dispersions into account, the results agree with previous research. Speech quality dependency on talker and language is investigated when this set of speech samples is coded using three typical digital coding algorithms: PCM, ADPCM, and APC-AB. This set of speech samples reduces the evaluation bias caused by small speech-sample sizes. These multilingual speech samples are stored on CD-ROM and are publicly available.
Tomohisa HIROKAWA Kenzo ITOH Hirokazu SATO
A new system for speech synthesis by concatenating waveforms selected from a dictionary is described. The dictionary is constructed from a two-hour speech that includes isolated words and sentences uttered by one male speaker, and contains over 45,000 entries which are identified by their average pitch, dynamic pitch parameter which represents micro pitch structure in a segment, duration and average amplitude. Phoneme duration is set according to phoneme environment, and phoneme power is controlled, by both pitch frequency and phoneme environment. Tests show the average errors in vowel duration and consonant duration are 28.8 ms and 16.8 ms respectively, and the vowel power average error is 2.9 dB. The pitch frequency patterns are calculated according to a conventional model in which the accent component is abbed to a gross phrase component. Set a phoneme string and prosody information, the optimum waveforms are selected from the dictionary by matching their attributes with the given phonetic and prosodic information. A waveform selection function, which has two terms corresponding to prosody and phonological coincidence between rule-set values and waveform values from the dictionary, is proposed. The weight coefficients used in the selection function are determined through subjective hearing tests. The selected waveform segments are then modified in waveform domain to further adjust for the desired prosody. A pitch frequency modification method based on pitch synchronous overlap-add technique is introduced into the system. Lastly, the waveforms are interpolated between voiced waveforms to avoid abrupt changes in voice spectrum and waveform shape. An absolute evaluation test of five grades is performed to the synthesized voice and the mean of the score is 3.1, which is over "good," and while the original speaker quality is retained.