1-1hit |
Yoshinori KITAHARA Yoh'ichi TOHKURA
In speech output expected as an ideal man-machine interface, there exists an important issue on emotion production in order to not only improve its naturalness but also achieve more sophisticated speech interaction between man and machine. Speech has two aspects, which are prosodic information and phonetic feature. For the purpose of application to natural and high quality speech synthesis, the role of prosody in speech perception has been studied. In this paper, prosodic components, which contribute to the expression of emotions and their intensity, are clarified by analyzing emotional speech and by conducting listening tests of synthetic speech. The analysis is performed by substituting the components of neutral speech (i.e., one with no particular emotion) with those of emotional speech preserving the temporal correspondence by means of DTW. It has been confirmed that prosodic components, which are composed of pitch structure, temporal structure and amplitude structure, contribute to the expression of emotions more than the spectral structure of speech. The results of listening tests using prosodic substituted speech show that temporal structure is the most important for the expression of anger, while all of three components are much more important for the intensity of anger. Pitch structure also plays a significant role in the expression of joy and sadness and their intensity. These results make it possible to convert neutral utterances into utterances expressing various emotions. The results can also be applied to controlling the emotional characteristics of speech in synthesis by rule.