The search functionality is under construction.

Author Search Result

[Author] Takao KOBAYASHI(23hit)

1-20hit(23hit)

  • A Training Method of Average Voice Model for HMM-Based Speech Synthesis

    Junichi YAMAGISHI  Masatsune TAMURA  Takashi MASUKO  Keiichi TOKUDA  Takao KOBAYASHI  

     
    PAPER

      Vol:
    E86-A No:8
      Page(s):
    1956-1963

    This paper describes a new training method of average voice model for speech synthesis in which arbitrary speaker's voice is generated based on speaker adaptation. When the amount of training data is limited, the distributions of average voice model often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. In the proposed method, to reduce the influence of speaker dependence, we incorporate a context clustering technique called shared decision tree context clustering and speaker adaptive training into the training procedure of average voice model. From the results of subjective tests, we show that the average voice model trained using the proposed method generates more natural sounding speech than the conventional average voice model. Moreover, it is shown that voice characteristics and prosodic features of synthetic speech generated from the adapted model using the proposed method are closer to the target speaker than the conventional method.

  • A Technique for Estimating Intensity of Emotional Expressions and Speaking Styles in Speech Based on Multiple-Regression HSMM

    Takashi NOSE  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E93-D No:1
      Page(s):
    116-124

    In this paper, we propose a technique for estimating the degree or intensity of emotional expressions and speaking styles appearing in speech. The key idea is based on a style control technique for speech synthesis using a multiple regression hidden semi-Markov model (MRHSMM), and the proposed technique can be viewed as the inverse of the style control. In the proposed technique, the acoustic features of spectrum, power, fundamental frequency, and duration are simultaneously modeled using the MRHSMM. We derive an algorithm for estimating explanatory variables of the MRHSMM, each of which represents the degree or intensity of emotional expressions and speaking styles appearing in acoustic features of speech, based on a maximum likelihood criterion. We show experimental results to demonstrate the ability of the proposed technique using two types of speech data, simulated emotional speech and spontaneous speech with different speaking styles. It is found that the estimated values have correlation with human perception.

  • Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis

    Junichi YAMAGISHI  Koji ONISHI  Takashi MASUKO  Takao KOBAYASHI  

     
    PAPER-Speech Synthesis and Prosody

      Vol:
    E88-D No:3
      Page(s):
    502-509

    This paper describes the modeling of various emotional expressions and speaking styles in synthetic speech using HMM-based speech synthesis. We show two methods for modeling speaking styles and emotional expressions. In the first method called style-dependent modeling, each speaking style and emotional expression is modeled individually. In the second one called style-mixed modeling, each speaking style and emotional expression is treated as one of contexts as well as phonetic, prosodic, and linguistic features, and all speaking styles and emotional expressions are modeled simultaneously by using a single acoustic model. We chose four styles of read speech -- neutral, rough, joyful, and sad -- and compared the above two modeling methods using these styles. The results of subjective evaluation tests show that both modeling methods have almost the same accuracy, and that it is possible to synthesize speech with the speaking style and emotional expression similar to those of the target speech. In a test of classification of styles in synthesized speech, more than 80% of speech samples generated using both the models were judged to be similar to the target styles. We also show that the style-mixed modeling method gives fewer output and duration distributions than the style-dependent modeling method.

  • A Rapid Model Adaptation Technique for Emotional Speech Recognition with Style Estimation Based on Multiple-Regression HMM

    Yusuke IJIMA  Takashi NOSE  Makoto TACHIBANA  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E93-D No:1
      Page(s):
    107-115

    In this paper, we propose a rapid model adaptation technique for emotional speech recognition which enables us to extract paralinguistic information as well as linguistic information contained in speech signals. This technique is based on style estimation and style adaptation using a multiple-regression HMM (MRHMM). In the MRHMM, the mean parameters of the output probability density function are controlled by a low-dimensional parameter vector, called a style vector, which corresponds to a set of the explanatory variables of the multiple regression. The recognition process consists of two stages. In the first stage, the style vector that represents the emotional expression category and the intensity of its expressiveness for the input speech is estimated on a sentence-by-sentence basis. Next, the acoustic models are adapted using the estimated style vector, and then standard HMM-based speech recognition is performed in the second stage. We assess the performance of the proposed technique in the recognition of simulated emotional speech uttered by both professional narrators and non-professional speakers.

  • A Context Clustering Technique for Average Voice Models

    Junichi YAMAGISHI  Masatsune TAMURA  Takashi MASUKO  Keiichi TOKUDA  Takao KOBAYASHI  

     
    PAPER-Speech Synthesis and Prosody

      Vol:
    E86-D No:3
      Page(s):
    534-542

    This paper describes a new context clustering technique for average voice model, which is a set of speaker independent speech synthesis units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a decision tree common to these speaker dependent models for context clustering. When a node of the decision tree is split, only the context related questions which are applicable to all speaker dependent models are adopted. As a result, every node of the decision tree always has training data of all speakers. After construction of the decision tree, all speaker dependent models are clustered using the common decision tree and a speaker independent model, i.e., an average voice model is obtained by combining speaker dependent models. From the results of subjective tests, we show that the average voice models trained using the proposed technique can generate more natural sounding speech than the conventional average voice models.

  • A Small-Chip-Area Transceiver IC for Bluetooth Featuring a Digital Channel-Selection Filter

    Masaru KOKUBO  Masaaki SHIDA  Takashi OSHIMA  Yoshiyuki SHIBAHARA  Tatsuji MATSUURA  Kazuhiko KAWAI  Takefumi ENDO  Katsumi OSAKI  Hiroki SONODA  Katsumi YAMAMOTO  Masaharu MATSUOKA  Takao KOBAYASHI  Takaaki HEMMI  Junya KUDOH  Hirokazu MIYAGAWA  Hiroto UTSUNOMIYA  Yoshiyuki EZUMI  Kunio TAKAYASU  Jun SUZUKI  Shinya AIZAWA  Mikihiko MOTOKI  Yoshiyuki ABE  Takao KUROSAWA  Satoru OOKAWARA  

     
    PAPER

      Vol:
    E87-C No:6
      Page(s):
    878-887

    We have proposed a new low-IF transceiver architecture to simultaneously achieve both a small chip area and good minimum input sensitivity. The distinctive point of the receiver architecture is that we replace the complicated high-order analog filter for channel selection with the combination of a simple low-order analog filter and a sharp digital band-pass filter. We also proposed a high-speed convergence AGC (automatic gain controller) and a demodulation block to realize the proposed digital architecture. For the transceiver, we further reduce the chip area by applying a new form of direct modulation for the VCO. Since conventional VCO direct modulation tends to suffer from variation of the modulation index with frequency, we have developed a new compensation technique that minimizes this variation, and designed the low-phase noise VCO with a new biasing method to achieve large PSRR (power-supply rejection ratio) for oscillation frequency. The test chip was fabricated in 0.35-µm BiCMOS. The chip size was 3 3 mm2; this very small area was realized by the advantages of the proposed transceiver architecture. The transceiver also achieved good minimum input sensitivity of -85 dBm and showed interference performance that satisfied the requirements of the Bluetooth standard.

  • Mixture Density Models Based on Mel-Cepstral Representation of Gaussian Process

    Toru TAKAHASHI  Keiichi TOKUDA  Takao KOBAYASHI  Tadashi KITAMURA  

     
    PAPER

      Vol:
    E86-A No:8
      Page(s):
    1971-1978

    This paper defines a new kind of a mixture density model for modeling a quasi-stationary Gaussian process based on mel-cepstral representation. The conventional AR mixture density model can be applied to modeling a quasi-stationary Gaussian AR process. However, it cannot model spectral zeros. In contrast, the proposed model is based on a frequency-warped exponential (EX) model. Accordingly, it can represent spectral poles and zeros with equal weights, and, furthermore, the model spectrum has a high resolution at low frequencies. The parameter estimation algorithm for the proposed model was also derived based on an EM algorithm. Experimental results show that the proposed model has better performance than the AR mixture density model for modeling a frequency-warped EX process.

  • A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features

    Makoto TACHIBANA  Junichi YAMAGISHI  Takashi MASUKO  Takao KOBAYASHI  

     
    PAPER-Speech Synthesis

      Vol:
    E89-D No:3
      Page(s):
    1092-1099

    This paper proposes a technique for synthesizing speech with a desired speaking style and/or emotional expression, based on model adaptation in an HMM-based speech synthesis framework. Speaking styles and emotional expressions are characterized by many segmental and suprasegmental features in both spectral and prosodic features. Therefore, it is essential to take account of these features in the model adaptation. The proposed technique called style adaptation, deals with this issue. Firstly, the maximum likelihood linear regression (MLLR) algorithm, based on a framework of hidden semi-Markov model (HSMM) is presented to provide a mathematically rigorous and robust adaptation of state duration and to adapt both the spectral and prosodic features. Then, a novel tying method for the regression matrices of the MLLR algorithm is also presented to allow the incorporation of both the segmental and suprasegmental speech features into the style adaptation. The proposed tying method uses regression class trees with contextual information. From the results of several subjective tests, we show that these techniques can perform style adaptation while maintaining naturalness of the synthetic speech.

  • Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing

    Makoto TACHIBANA  Junichi YAMAGISHI  Takashi MASUKO  Takao KOBAYASHI  

     
    PAPER

      Vol:
    E88-D No:11
      Page(s):
    2484-2491

    This paper describes an approach to generating speech with emotional expressivity and speaking style variability. The approach is based on a speaking style and emotional expression modeling technique for HMM-based speech synthesis. We first model several representative styles, each of which is a speaking style and/or an emotional expression, in an HMM-based speech synthesis framework. Then, to generate synthetic speech with an intermediate style from representative ones, we synthesize speech from a model obtained by interpolating representative style models using a model interpolation technique. We assess the style interpolation technique with subjective evaluation tests using four representative styles, i.e., neutral, joyful, sad, and rough in read speech and synthesized speech from models obtained by interpolating models for all combinations of two styles. The results show that speech synthesized from the interpolated model has a style in between the two representative ones. Moreover, we can control the degree of expressivity for speaking styles or emotions in synthesized speech by changing the interpolation ratio in interpolation between neutral and other representative styles. We also show that we can achieve style morphing in speech synthesis, namely, changing style smoothly from one representative style to another by gradually changing the interpolation ratio.

  • A Style Control Technique for HMM-Based Expressive Speech Synthesis

    Takashi NOSE  Junichi YAMAGISHI  Takashi MASUKO  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:9
      Page(s):
    1406-1413

    This paper describes a technique for controlling the degree of expressivity of a desired emotional expression and/or speaking style of synthesized speech in an HMM-based speech synthesis framework. With this technique, multiple emotional expressions and speaking styles of speech are modeled in a single model by using a multiple-regression hidden semi-Markov model (MRHSMM). A set of control parameters, called the style vector, is defined, and each speech synthesis unit is modeled by using the MRHSMM, in which mean parameters of the state output and duration distributions are expressed by multiple-regression of the style vector. In the synthesis stage, the mean parameters of the synthesis units are modified by transforming an arbitrarily given style vector that corresponds to a point in a low-dimensional space, called style space, each of whose coordinates represents a certain specific speaking style or emotion of speech. The results of subjective evaluation tests show that style and its intensity can be controlled by changing the style vector.

  • Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training

    Junichi YAMAGISHI  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:2
      Page(s):
    533-543

    In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.

  • Vector Quantization of Speech Spectral Parameters Using Statistics of Static and Dynamic Features

    Kazuhito KOISHIDA  Keiichi TOKUDA  Takashi MASUKO  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E84-D No:10
      Page(s):
    1427-1434

    This paper proposes a vector quantization scheme which makes it possible to consider the dynamics of input vectors. In the proposed scheme, a linear transformation is applied to the consecutive input vectors and the resulting vector is quantized with a distortion measure defined by the statistics. At the decoder side, the output vector sequence is determined using the statistics associated with the transmitted indices in such a way that a likelihood is maximized. To solve the maximization problem, a computationally efficient algorithm is derived. The performance of the proposed method is evaluated in LSP parameter quantization. It is found that the LSP trajectories and the corresponding spectra change quite smoothly in the proposed method. It is also shown that the use of the proposed method results in a significant improvement of subjective quality.

  • Robust F0 Estimation of Speech Signal Using Harmonicity Measure Based on Instantaneous Frequency

    Dhany ARIFIANTO  Tomohiro TANAKA  Takashi MASUKO  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E87-D No:12
      Page(s):
    2812-2820

    Borrowing the notion of instantaneous frequency that was developed in the context of time-frequency signal analysis, an instantaneous frequency amplitude spectrum (IFAS) is introduced for estimating fundamental frequency of speech signal in both noiseless and adverse environments. We define harmonicity measure as a quantity that indicates degree of periodical regularity in the IFAS and that shows substantial difference between periodic signal and noise-like waveform. The harmonicity measure is applied to estimate the existence of fundamental frequency. We provide experimental examples to demonstrate the general applicability of the harmonicity measure and apply the proposed procedure to Japanese continuous speech signals. The results show that the proposed method outperforms the conventional methods with or without the presence of noise.

  • Harmonics Estimation Based on Instantaneous Frequency and Its Application to Pitch Determination of Speech

    Toshihiko ABE  Takao KOBAYASHI  Satoshi IMAI  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E78-D No:9
      Page(s):
    1188-1194

    This paper proposes a technique for estimating the harmonic frequencies based on instantaneous frequency (IF) of speech signals. The main problem is how to decompose the speech signal into the harmonic components. For this purpose, we use a set of bandpass-filters, each of whose center frequencies changes with time in order to track the instantaneous freuency of its output. As a result, the outputs of the band-pass filters become the harmonic components, and the instantaneous frequencies of the harmonics are accurately estimated. To evaluate the effectiveness of the approach, we apply it to pitch determination of speech. Pitch determination is simply accomplished by selecting the correct fundamental frequency out of the harmonic components. It is confirmed that the pitch extraction using the proposed pitch determination algorithm (PDA) is stable and accurate. The most significant feature of the PDA is that the extracted pitch contour is smooth and it requires no post-processing such as nonlinear filtering or any smoothing processes. Several examples are presented to demonstrate the capability of the harmonics estimation technique and the PDA.

  • 2-D LMA Filters--Design of Stable Two-Dimensional Digital Filters with Arbitrary Magnitude Function--

    Takao KOBAYASHI  Kazuyoshi FUKUSHI  Keiichi TOKUDA  Satoshi IMAI  

     
    PAPER-Digital Image Processing

      Vol:
    E75-A No:2
      Page(s):
    240-246

    This paper proposes a technique for designing two-dimensional (2-D) digital filters approximating an arbitrary magnitude function. The technique is based on 2-D spectral factorization and rational approximation of the complex exponential function. A 2-D spectral factorization technique is used to obtain a recursively computable and stable system with nonsymmetric half-plane support from the desired 2-D magnitude function. Since the obtained system has an exponential function type transfer function and cannot be realized directly in a rational form, a class of realizable 2-D digital filters is introduced to approximate the exponential type transfer function. This class of filters referred to as two-dimensional log magnitude approximation (2-D LMA) filters can be viewed as an extension of the class of 1-D LMA filters to the 2-D case. Filter coefficients are given by the 2-D complex cepstrum coefficients, i.e., the inverse Fourier transform of the logarithm of the given magnitude function, which can be efficiently computed using 2-D FFT algorithm. Consequently, computation of the filter coefficients is straightforward and efficient. A simple stability condition for the 2-D LMA filters is given. Under this condition, the stability of the designed filter is guaranteed. Parallel implementation of the 2-D LMA filters is also discussed. Several examples are presented to demonstrate the design capability.

  • HMM-Based Style Control for Expressive Speech Synthesis with Arbitrary Speaker's Voice Using Model Adaptation

    Takashi NOSE  Makoto TACHIBANA  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E92-D No:3
      Page(s):
    489-497

    This paper presents methods for controlling the intensity of emotional expressions and speaking styles of an arbitrary speaker's synthetic speech by using a small amount of his/her speech data in HMM-based speech synthesis. Model adaptation approaches are introduced into the style control technique based on the multiple-regression hidden semi-Markov model (MRHSMM). Two different approaches are proposed for training a target speaker's MRHSMMs. The first one is MRHSMM-based model adaptation in which the pretrained MRHSMM is adapted to the target speaker's model. For this purpose, we formulate the MLLR adaptation algorithm for the MRHSMM. The second method utilizes simultaneous adaptation of speaker and style from an average voice model to obtain the target speaker's style-dependent HSMMs which are used for the initialization of the MRHSMM. From the result of subjective evaluation using adaptation data of 50 sentences of each style, we show that the proposed methods outperform the conventional speaker-dependent model training when using the same size of speech data of the target speaker.

  • FOREWORD Open Access

    Takao KOBAYASHI  

     
    FOREWORD

      Vol:
    E93-D No:9
      Page(s):
    2347-2347
  • Multi-Space Probability Distribution HMM

    Keiichi TOKUDA  Takashi MASUKO  Noboru MIYAZAKI  Takao KOBAYASHI  

     
    INVITED PAPER-Pattern Recognition

      Vol:
    E85-D No:3
      Page(s):
    455-464

    This paper proposes a new kind of hidden Markov model (HMM) based on multi-space probability distribution, and derives a parameter estimation algorithm for the extended HMM. HMMs are widely used statistical models for characterizing sequences of speech spectra, and have been successfully applied to speech recognition systems. HMMs are categorized into discrete HMMs and continuous HMMs, which can model sequences of discrete symbols and continuous vectors, respectively. However, we cannot apply both the conventional discrete and continuous HMMs to observation sequences which consist of continuous values and discrete symbols: F0 pattern modeling of speech is a good illustration. The proposed HMM includes discrete HMM and continuous HMM as special cases, and furthermore, can model sequences which consist of observation vectors with variable dimensionality and discrete symbols.

  • Text-Independent Speaker Identification Using Gaussian Mixture Models Based on Multi-Space Probability Distribution

    Chiyomi MIYAJIMA  Yosuke HATTORI  Keiichi TOKUDA  Takashi MASUKO  Takao KOBAYASHI  Tadashi KITAMURA  

     
    PAPER

      Vol:
    E84-D No:7
      Page(s):
    847-855

    This paper presents a new approach to modeling speech spectra and pitch for text-independent speaker identification using Gaussian mixture models based on multi-space probability distribution (MSD-GMM). MSD-GMM allows us to model continuous pitch values of voiced frames and discrete symbols for unvoiced frames in a unified framework. Spectral and pitch features are jointly modeled by a two-stream MSD-GMM. We derive maximum likelihood (ML) estimation formulae and minimum classification error (MCE) training procedure for MSD-GMM parameters. The MSD-GMM speaker models are evaluated for text-independent speaker identification tasks. The experimental results show that the MSD-GMM can efficiently model spectral and pitch features of each speaker and outperforms conventional speaker models. The results also demonstrate the utility of the MCE training of the MSD-GMM parameters and the robustness for the inter-session variability.

  • A 16 kb/s Wideband CELP-Based Speech Coder Using Mel-Generalized Cepstral Analysis

    Kazuhito KOISHIDA  Gou HIRABAYASHI  Keiichi TOKUDA  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E83-D No:4
      Page(s):
    876-883

    We propose a wideband CELP-type speech coder at 16 kb/s based on a mel-generalized cepstral (MGC) analysis technique. MGC analysis makes it possible to obtain a more accurate representation of spectral zeros compared to linear predictive (LP) analysis and take a perceptual frequency scale into account. A major advantage of the proposed coder is that the benefits of MGC representation of speech spectra can be incorporated into the CELP coding process. Subjective tests show that the proposed coder at 16 kb/s achieves a significant improvement in performance over a 16 kb/s conventional CELP coder under the same coding framework and bit allocation. Moreover, the proposed coder is found to outperform the ITU-T G. 722 standard at 64 kb/s.

1-20hit(23hit)