Tomohisa HIROKAWA Kenzo ITOH Hirokazu SATO
A new system for speech synthesis by concatenating waveforms selected from a dictionary is described. The dictionary is constructed from a two-hour speech that includes isolated words and sentences uttered by one male speaker, and contains over 45,000 entries which are identified by their average pitch, dynamic pitch parameter which represents micro pitch structure in a segment, duration and average amplitude. Phoneme duration is set according to phoneme environment, and phoneme power is controlled, by both pitch frequency and phoneme environment. Tests show the average errors in vowel duration and consonant duration are 28.8 ms and 16.8 ms respectively, and the vowel power average error is 2.9 dB. The pitch frequency patterns are calculated according to a conventional model in which the accent component is abbed to a gross phrase component. Set a phoneme string and prosody information, the optimum waveforms are selected from the dictionary by matching their attributes with the given phonetic and prosodic information. A waveform selection function, which has two terms corresponding to prosody and phonological coincidence between rule-set values and waveform values from the dictionary, is proposed. The weight coefficients used in the selection function are determined through subjective hearing tests. The selected waveform segments are then modified in waveform domain to further adjust for the desired prosody. A pitch frequency modification method based on pitch synchronous overlap-add technique is introduced into the system. Lastly, the waveforms are interpolated between voiced waveforms to avoid abrupt changes in voice spectrum and waveform shape. An absolute evaluation test of five grades is performed to the synthesized voice and the mean of the score is 3.1, which is over "good," and while the original speaker quality is retained.
The development of computers capable of handling complex objects requires nonverbal interfaces that can bidirectionally mediate nonverbal communication including the gestures of both people and computers. Nonverbal expressions are poweful media for enriching and facilitating humancomputer interaction when used as interface languages. Four gestural modes are appropriate for human-computer interaction: the sign, indication, illustration and manipulation modes. All these modes can be conveyed by a generalized gesture interface that has specific processors for each mode. The basic component of the generalized gesture interface, a gesture dictionary, is proposed. The dictionary can accept sign and indicating gestures in which postures or body shapes are significant, pass their meaning to a computer and display gestures from the computer. For this purpose it converts body shapes into gestural codes by means of two code systems and, moreover, it performs bidirectional conversions of several gesture representations. This dictionary is applied to the translation of Japanese into sign language; it displays an actor who speaks the given Japanese sentences by gesture of sign words and finger alphabets. The performance of this application confirms the adequacy and usefulness of the gesture dictionary.