1-4hit |
Kazuhiro NAKAMURA Kei HASHIMOTO Yoshihiko NANKAKU Keiichi TOKUDA
This paper proposes a novel approach for integrating spectral feature extraction and acoustic modeling in hidden Markov model (HMM) based speech synthesis. The statistical modeling process of speech waveforms is typically divided into two component modules: the frame-by-frame feature extraction module and the acoustic modeling module. In the feature extraction module, the statistical mel-cepstral analysis technique has been used and the objective function is the likelihood of mel-cepstral coefficients for given speech waveforms. In the acoustic modeling module, the objective function is the likelihood of model parameters for given mel-cepstral coefficients. It is important to improve the performance of each component module for achieving higher quality synthesized speech. However, the final objective of speech synthesis systems is to generate natural speech waveforms from given texts, and the improvement of each component module does not always lead to the improvement of the quality of synthesized speech. Therefore, ideally all objective functions should be optimized based on an integrated criterion which well represents subjective speech quality of human perception. In this paper, we propose an approach to model speech waveforms directly and optimize the final objective function. Experimental results show that the proposed method outperformed the conventional methods in objective and subjective measures.
Hongwu YANG Dezhi HUANG Lianhong CAI
This letter proposes a novel approach for mel-cepstral analysis based on the psychoacoustic model of MPEG. A perceptual weighting function is developed by applying cubic spline interpolation on the signal-to-mask ratios (SMRs) which are obtained from the psychoacoustic model. Experiments on speaker identification and speech re-synthesis showed that the proposed method not only improved the speaker recognition performance, but also improved the speech quality of the re-synthesized speech.
In this paper, we propose a peak-weighted cepstral lifter (PWL) for enhancing the spectral peaks of an all-pole model spectrum in the cepstral domain. The design parameter of the PWL is the degree of pole enhancement or pole shifting toward the unit circle. The optimal pole shifting factor is chosen by considering the sensitivity to spectral resonance peaks, the variability of cepstral variances, and the recognition accuracy. Next, we generalize the PWL so that the optimal shifting factor is adaptively determined in frame-by-frame basis. Compared with other cepstral lifters, a speech recognizer employing the frame-adaptive PWL provides better recognition performance.
Kazuhito KOISHIDA Gou HIRABAYASHI Keiichi TOKUDA Takao KOBAYASHI
We propose a wideband CELP-type speech coder at 16 kb/s based on a mel-generalized cepstral (MGC) analysis technique. MGC analysis makes it possible to obtain a more accurate representation of spectral zeros compared to linear predictive (LP) analysis and take a perceptual frequency scale into account. A major advantage of the proposed coder is that the benefits of MGC representation of speech spectra can be incorporated into the CELP coding process. Subjective tests show that the proposed coder at 16 kb/s achieves a significant improvement in performance over a 16 kb/s conventional CELP coder under the same coding framework and bit allocation. Moreover, the proposed coder is found to outperform the ITU-T G. 722 standard at 64 kb/s.