IEICE global.ieice.org Site

Author Search Result

[Author] Takashi MASUKO(12hit)

1-12hit

Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis
Junichi YAMAGISHI Koji ONISHI Takashi MASUKO Takao KOBAYASHI

PAPER-Speech Synthesis and Prosody

Vol:
E88-D No:3
Page(s):
502-509
This paper describes the modeling of various emotional expressions and speaking styles in synthetic speech using HMM-based speech synthesis. We show two methods for modeling speaking styles and emotional expressions. In the first method called style-dependent modeling, each speaking style and emotional expression is modeled individually. In the second one called style-mixed modeling, each speaking style and emotional expression is treated as one of contexts as well as phonetic, prosodic, and linguistic features, and all speaking styles and emotional expressions are modeled simultaneously by using a single acoustic model. We chose four styles of read speech -- neutral, rough, joyful, and sad -- and compared the above two modeling methods using these styles. The results of subjective evaluation tests show that both modeling methods have almost the same accuracy, and that it is possible to synthesize speech with the speaking style and emotional expression similar to those of the target speech. In a test of classification of styles in synthesized speech, more than 80% of speech samples generated using both the models were judged to be similar to the target styles. We also show that the style-mixed modeling method gives fewer output and duration distributions than the style-dependent modeling method.
A Context Clustering Technique for Average Voice Models
Junichi YAMAGISHI Masatsune TAMURA Takashi MASUKO Keiichi TOKUDA Takao KOBAYASHI

PAPER-Speech Synthesis and Prosody

Vol:
E86-D No:3
Page(s):
534-542
This paper describes a new context clustering technique for average voice model, which is a set of speaker independent speech synthesis units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a decision tree common to these speaker dependent models for context clustering. When a node of the decision tree is split, only the context related questions which are applicable to all speaker dependent models are adopted. As a result, every node of the decision tree always has training data of all speakers. After construction of the decision tree, all speaker dependent models are clustered using the common decision tree and a speaker independent model, i.e., an average voice model is obtained by combining speaker dependent models. From the results of subjective tests, we show that the average voice models trained using the proposed technique can generate more natural sounding speech than the conventional average voice models.
A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features
Makoto TACHIBANA Junichi YAMAGISHI Takashi MASUKO Takao KOBAYASHI

PAPER-Speech Synthesis

Vol:
E89-D No:3
Page(s):
1092-1099
This paper proposes a technique for synthesizing speech with a desired speaking style and/or emotional expression, based on model adaptation in an HMM-based speech synthesis framework. Speaking styles and emotional expressions are characterized by many segmental and suprasegmental features in both spectral and prosodic features. Therefore, it is essential to take account of these features in the model adaptation. The proposed technique called style adaptation, deals with this issue. Firstly, the maximum likelihood linear regression (MLLR) algorithm, based on a framework of hidden semi-Markov model (HSMM) is presented to provide a mathematically rigorous and robust adaptation of state duration and to adapt both the spectral and prosodic features. Then, a novel tying method for the regression matrices of the MLLR algorithm is also presented to allow the incorporation of both the segmental and suprasegmental speech features into the style adaptation. The proposed tying method uses regression class trees with contextual information. From the results of several subjective tests, we show that these techniques can perform style adaptation while maintaining naturalness of the synthetic speech.
Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing
Makoto TACHIBANA Junichi YAMAGISHI Takashi MASUKO Takao KOBAYASHI

PAPER

Vol:
E88-D No:11
Page(s):
2484-2491
This paper describes an approach to generating speech with emotional expressivity and speaking style variability. The approach is based on a speaking style and emotional expression modeling technique for HMM-based speech synthesis. We first model several representative styles, each of which is a speaking style and/or an emotional expression, in an HMM-based speech synthesis framework. Then, to generate synthetic speech with an intermediate style from representative ones, we synthesize speech from a model obtained by interpolating representative style models using a model interpolation technique. We assess the style interpolation technique with subjective evaluation tests using four representative styles, i.e., neutral, joyful, sad, and rough in read speech and synthesized speech from models obtained by interpolating models for all combinations of two styles. The results show that speech synthesized from the interpolated model has a style in between the two representative ones. Moreover, we can control the degree of expressivity for speaking styles or emotions in synthesized speech by changing the interpolation ratio in interpolation between neutral and other representative styles. We also show that we can achieve style morphing in speech synthesis, namely, changing style smoothly from one representative style to another by gradually changing the interpolation ratio.
A Style Control Technique for HMM-Based Expressive Speech Synthesis
Takashi NOSE Junichi YAMAGISHI Takashi MASUKO Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E90-D No:9
Page(s):
1406-1413
This paper describes a technique for controlling the degree of expressivity of a desired emotional expression and/or speaking style of synthesized speech in an HMM-based speech synthesis framework. With this technique, multiple emotional expressions and speaking styles of speech are modeled in a single model by using a multiple-regression hidden semi-Markov model (MRHSMM). A set of control parameters, called the style vector, is defined, and each speech synthesis unit is modeled by using the MRHSMM, in which mean parameters of the state output and duration distributions are expressed by multiple-regression of the style vector. In the synthesis stage, the mean parameters of the synthesis units are modified by transforming an arbitrarily given style vector that corresponds to a point in a low-dimensional space, called style space, each of whose coordinates represents a certain specific speaking style or emotion of speech. The results of subjective evaluation tests show that style and its intensity can be controlled by changing the style vector.
A Hidden Semi-Markov Model-Based Speech Synthesis System
Heiga ZEN Keiichi TOKUDA Takashi MASUKO Takao KOBAYASIH Tadashi KITAMURA

PAPER-Speech and Hearing

Vol:
E90-D No:5
Page(s):
825-834
A statistical speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and duration of speech are modeled simultaneously by context-dependent HMMs, and speech parameter vector sequences are generated from the HMMs themselves. This system defines a speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. However, there is an inconsistency: although state duration probability density functions (PDFs) are explicitly used in the synthesis part of the system, they have not been incorporated into its training part. This inconsistency can make the synthesized speech sound less natural. In this paper, we propose a statistical speech synthesis system based on a hidden semi-Markov model (HSMM), which can be viewed as an HMM with explicit state duration PDFs. The use of HSMMs can solve the above inconsistency because we can incorporate the state duration PDFs explicitly into both the synthesis and the training parts of the system. Subjective listening test results show that use of HSMMs improves the reported naturalness of synthesized speech.
Vector Quantization of Speech Spectral Parameters Using Statistics of Static and Dynamic Features
Kazuhito KOISHIDA Keiichi TOKUDA Takashi MASUKO Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E84-D No:10
Page(s):
1427-1434
This paper proposes a vector quantization scheme which makes it possible to consider the dynamics of input vectors. In the proposed scheme, a linear transformation is applied to the consecutive input vectors and the resulting vector is quantized with a distortion measure defined by the statistics. At the decoder side, the output vector sequence is determined using the statistics associated with the transmitted indices in such a way that a likelihood is maximized. To solve the maximization problem, a computationally efficient algorithm is derived. The performance of the proposed method is evaluated in LSP parameter quantization. It is found that the LSP trajectories and the corresponding spectra change quite smoothly in the proposed method. It is also shown that the use of the proposed method results in a significant improvement of subjective quality.
State Duration Modeling for HMM-Based Speech Synthesis
Heiga ZEN Takashi MASUKO Keiichi TOKUDA Takayoshi YOSHIMURA Takao KOBAYASIH Tadashi KITAMURA

LETTER-Speech and Hearing

Vol:
E90-D No:3
Page(s):
692-693
This paper describes the explicit modeling of a state duration's probability density function in HMM-based speech synthesis. We redefine, in a statistically correct manner, the probability of staying in a state for a time interval used to obtain the state duration PDF and demonstrate improvements in the duration of synthesized speech.
Robust F₀ Estimation of Speech Signal Using Harmonicity Measure Based on Instantaneous Frequency
Dhany ARIFIANTO Tomohiro TANAKA Takashi MASUKO Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E87-D No:12
Page(s):
2812-2820
Borrowing the notion of instantaneous frequency that was developed in the context of time-frequency signal analysis, an instantaneous frequency amplitude spectrum (IFAS) is introduced for estimating fundamental frequency of speech signal in both noiseless and adverse environments. We define harmonicity measure as a quantity that indicates degree of periodical regularity in the IFAS and that shows substantial difference between periodic signal and noise-like waveform. The harmonicity measure is applied to estimate the existence of fundamental frequency. We provide experimental examples to demonstrate the general applicability of the harmonicity measure and apply the proposed procedure to Japanese continuous speech signals. The results show that the proposed method outperforms the conventional methods with or without the presence of noise.
Multi-Space Probability Distribution HMM
Keiichi TOKUDA Takashi MASUKO Noboru MIYAZAKI Takao KOBAYASHI

INVITED PAPER-Pattern Recognition

Vol:
E85-D No:3
Page(s):
455-464
This paper proposes a new kind of hidden Markov model (HMM) based on multi-space probability distribution, and derives a parameter estimation algorithm for the extended HMM. HMMs are widely used statistical models for characterizing sequences of speech spectra, and have been successfully applied to speech recognition systems. HMMs are categorized into discrete HMMs and continuous HMMs, which can model sequences of discrete symbols and continuous vectors, respectively. However, we cannot apply both the conventional discrete and continuous HMMs to observation sequences which consist of continuous values and discrete symbols: F0 pattern modeling of speech is a good illustration. The proposed HMM includes discrete HMM and continuous HMM as special cases, and furthermore, can model sequences which consist of observation vectors with variable dimensionality and discrete symbols.
Text-Independent Speaker Identification Using Gaussian Mixture Models Based on Multi-Space Probability Distribution
Chiyomi MIYAJIMA Yosuke HATTORI Keiichi TOKUDA Takashi MASUKO Takao KOBAYASHI Tadashi KITAMURA

PAPER

Vol:
E84-D No:7
Page(s):
847-855
This paper presents a new approach to modeling speech spectra and pitch for text-independent speaker identification using Gaussian mixture models based on multi-space probability distribution (MSD-GMM). MSD-GMM allows us to model continuous pitch values of voiced frames and discrete symbols for unvoiced frames in a unified framework. Spectral and pitch features are jointly modeled by a two-stream MSD-GMM. We derive maximum likelihood (ML) estimation formulae and minimum classification error (MCE) training procedure for MSD-GMM parameters. The MSD-GMM speaker models are evaluated for text-independent speaker identification tasks. The experimental results show that the MSD-GMM can efficiently model spectral and pitch features of each speaker and outperforms conventional speaker models. The results also demonstrate the utility of the MCE training of the MSD-GMM parameters and the robustness for the inter-session variability.
A Training Method of Average Voice Model for HMM-Based Speech Synthesis
Junichi YAMAGISHI Masatsune TAMURA Takashi MASUKO Keiichi TOKUDA Takao KOBAYASHI

PAPER

Vol:
E86-A No:8
Page(s):
1956-1963
This paper describes a new training method of average voice model for speech synthesis in which arbitrary speaker's voice is generated based on speaker adaptation. When the amount of training data is limited, the distributions of average voice model often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. In the proposed method, to reduce the influence of speaker dependence, we incorporate a context clustering technique called shared decision tree context clustering and speaker adaptive training into the training procedure of average voice model. From the results of subjective tests, we show that the average voice model trained using the proposed method generates more natural sounding speech than the conventional average voice model. Moreover, it is shown that voice characteristics and prosodic features of synthetic speech generated from the adapted model using the proposed method are closer to the target speaker than the conventional method.

Author Search Result

[Author] Takashi MASUKO(12hit)

Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis

A Context Clustering Technique for Average Voice Models

A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features

Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing

A Style Control Technique for HMM-Based Expressive Speech Synthesis

A Hidden Semi-Markov Model-Based Speech Synthesis System

Vector Quantization of Speech Spectral Parameters Using Statistics of Static and Dynamic Features

State Duration Modeling for HMM-Based Speech Synthesis

Robust F₀ Estimation of Speech Signal Using Harmonicity Measure Based on Instantaneous Frequency

Multi-Space Probability Distribution HMM

Text-Independent Speaker Identification Using Gaussian Mixture Models Based on Multi-Space Probability Distribution

A Training Method of Average Voice Model for HMM-Based Speech Synthesis

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles