1-9hit |
Junya KOGUCHI Shinnosuke TAKAMICHI Masanori MORISE Hiroshi SARUWATARI Shigeki SAGAYAMA
We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.
Takahiro MIYAZAKI Masanori MORISE
This work introduces a measurement model to estimate the naturalness of vibrato. We carried out a subjective evaluation using a mean opinion score (MOS). We then built a measurement model by using two-dimensional Gaussian functions. We found that three Gaussian functions can measure naturalness with an error of 4.0%.
Kenji OZAWA Shota TSUKAHARA Yuichiro KINOSHITA Masanori MORISE
The sense of presence is often used to evaluate the performances of audio-visual (AV) content and systems. However, a presence meter has yet to be realized. We consider that the sense of presence can be divided into two aspects: system presence and content presence. In this study we focused on content presence. To estimate the overall presence of a content item, we have developed estimation models for the sense of presence in audio-only and audio-visual content. In this study, the audio-visual model is expanded to estimate the instantaneous presence in an AV content item. Initially, we conducted an evaluation experiment of the presence with 40 content items to investigate the relationship between the features of the AV content and the instantaneous presence. Based on the experimental data, a neural-network-based model was developed by expanding the previous model. To express the variation in instantaneous presence, 6 audio-related features and 14 visual-related features, which are extracted from the content items in 500-ms intervals, are used as inputs for the model. The audio-related features are loudness, sharpness, roughness, dynamic range and standard deviation in sound pressure levels, and movement of sound images. The visual-related features involve hue, lightness, saturation, and movement of visual images. After constructing the model, a generalization test confirmed that the model is sufficiently accurate to estimate the instantaneous presence. Hence, the model should contribute to the development of a presence meter.
Masanori MORISE Satoshi TSUZUKI Hideki BANNO Kenji OZAWA
This research deals with muffled speech as the evaluation target and introduces a criterion for evaluating the auditory impression in muffled speech. It focuses on the vocal tract area function (VTAF) to evaluate the auditory impression, and the criterion uses temporal differentiation of this function to track the temporal variation of the shape of the mouth. The experimental results indicate that the proposed criterion can be used to evaluate the auditory impression as well as the subjective impression.
Masanori MORISE Fumiya YOKOMORI Kenji OZAWA
A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of real-time applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing.
Shinya HORIIKE Masanori MORISE
To improve the likability of speech, we propose a voice conversion algorithm by controlling the fundamental frequency (F0) and the spectral envelope and carry out a subjective evaluation. The subjects can manipulate these two speech parameters. From the result, the subjects preferred speech with a parameter related to higher brightness.
Kenji OZAWA Shota TSUKAHARA Yuichiro KINOSHITA Masanori MORISE
The sense of presence is crucial to evaluate the performance of audio-visual (AV) equipment and content. Previously, the overall presence was evaluated for a set of AV content items by asking subjects to judge the presence of the entire content item. In this study, the sense of presence is evaluated for a time-series using the method of continuous judgment by category. Specifically, the audio signals of 40 content items with durations of approximately 30 s each were recorded with a dummy head, and then presented as stimuli to subjects via headphones. The corresponding visual signals were recorded using a video camera in the full-HD format, and reproduced on a 65-inch display. In the experiments, 20 subjects evaluated the instantaneous sense of presence of each item on a seven-point scale under two conditions: audio-only or audio-visual. At the end of the time-series, the subjects also evaluated the overall presence of the item by seven categories. Based on these results, the effects of visual information on the sense of presence were examined. The overall presence is highly correlated with the ten-percentile exceeded presence score, S10, which is the score that is exceeded for the 10% of the time during the responses. Based on the instantaneous presence data in this study, we are one step closer to our ultimate goal of developing a real-time operational presence meter.
This paper describes an evaluation of a temporally stable spectral envelope estimator proposed in our past research. The past research demonstrated that the proposed algorithm can synthesize speech that is as natural as the input speech. This paper focuses on an objective comparison, in which the proposed algorithm is compared with two modern estimation algorithms in terms of estimation performance and temporal stability. The results show that the proposed algorithm is superior to the others in both aspects.
This paper introduces a new noise generation algorithm for vocoder-based speech waveform generation. White noise is generally used for generating an aperiodic component. Since short-term white noise includes a zero-frequency component (ZFC) and inaudible components below 20 Hz, they are reduced in advance when synthesizing. We propose a new noise generation algorithm based on that for velvet noise to overcome the problem. The objective evaluation demonstrated that the proposed algorithm can reduce the unwanted components.