1-1hit |
Tomohisa HIROKAWA Kenzo ITOH Hirokazu SATO
A new system for speech synthesis by concatenating waveforms selected from a dictionary is described. The dictionary is constructed from a two-hour speech that includes isolated words and sentences uttered by one male speaker, and contains over 45,000 entries which are identified by their average pitch, dynamic pitch parameter which represents micro pitch structure in a segment, duration and average amplitude. Phoneme duration is set according to phoneme environment, and phoneme power is controlled, by both pitch frequency and phoneme environment. Tests show the average errors in vowel duration and consonant duration are 28.8 ms and 16.8 ms respectively, and the vowel power average error is 2.9 dB. The pitch frequency patterns are calculated according to a conventional model in which the accent component is abbed to a gross phrase component. Set a phoneme string and prosody information, the optimum waveforms are selected from the dictionary by matching their attributes with the given phonetic and prosodic information. A waveform selection function, which has two terms corresponding to prosody and phonological coincidence between rule-set values and waveform values from the dictionary, is proposed. The weight coefficients used in the selection function are determined through subjective hearing tests. The selected waveform segments are then modified in waveform domain to further adjust for the desired prosody. A pitch frequency modification method based on pitch synchronous overlap-add technique is introduced into the system. Lastly, the waveforms are interpolated between voiced waveforms to avoid abrupt changes in voice spectrum and waveform shape. An absolute evaluation test of five grades is performed to the synthesized voice and the mean of the score is 3.1, which is over "good," and while the original speaker quality is retained.