The search functionality is under construction.
The search functionality is under construction.

Author Search Result

[Author] Shinsuke SAKAI(3hit)

1-3hit
  • Probabilistic Concatenation Modeling for Corpus-Based Speech Synthesis

    Shinsuke SAKAI  Tatsuya KAWAHARA  Hisashi KAWAI  

     
    PAPER-Speech and Hearing

      Vol:
    E94-D No:10
      Page(s):
    2006-2014

    The measure of the goodness, or inversely the cost, of concatenating synthesis units plays an important role in concatenative speech synthesis. In this paper, we present a probabilistic approach to concatenation modeling in which the goodness of concatenation is measured by the conditional probability of observing the spectral shape of the current candidate unit given the previous unit and the current phonetic context. This conditional probability is modeled by a conditional Gaussian density whose mean vector has a form of linear transform of the past spectral shape. Decision tree-based parameter tying is performed to achieve robust training that balances between model complexity and the amount of training data available. The concatenation models are implemented for a corpus-based speech synthesizer, and the effectiveness of the proposed method was confirmed by an objective evaluation as well as a subjective listening test. We also demonstrate that the proposed method generalizes some popular conventional methods in that those methods can be derived as the special cases of the proposed method.

  • Fundamental Frequency Modeling for Speech Synthesis Based on a Statistical Learning Technique

    Shinsuke SAKAI  

     
    PAPER-Speech Synthesis and Prosody

      Vol:
    E88-D No:3
      Page(s):
    489-495

    This paper proposes a novel multi-layer approach to fundamental frequency modeling for concatenative speech synthesis based on a statistical learning technique called additive models. We define an additive F0 contour model consisting of long-term, intonational phrase-level, component and short-term, accentual phrase-level, component, along with a least-squares error criterion that includes a regularization term. A backfitting algorithm, that is derived from this error criterion, estimates both components simultaneously by iteratively applying cubic spline smoothers. When this method is applied to a 7,000 utterance Japanese speech corpus, it achieves F0 RMS errors of 28.9 and 29.8 Hz on the training and test data, respectively, with corresponding correlation coefficients of 0.806 and 0.777. The automatically determined intonational and accentual phrase components turn out to behave smoothly, systematically, and intuitively under a variety of prosodic conditions.

  • Admissible Stopping in Viterbi Beam Search for Unit Selection Speech Synthesis

    Shinsuke SAKAI  Tatsuya KAWAHARA  

     
    PAPER-Speech and Hearing

      Vol:
    E96-D No:6
      Page(s):
    1359-1367

    Corpus-based concatenative speech synthesis has been widely investigated and deployed in recent years since it provides a highly natural synthesized speech quality. The amount of computation required in the run time, however, can often be quite large. In this paper, we propose early stopping schemes for Viterbi beam search in the unit selection, with which we can stop early in the local Viterbi minimization for each unit as well as in the exploration of candidate units for a given target. It takes advantage of the fact that the space of the acoustic parameters of the database units is fixed and certain lower bounds of the concatenation costs can be precomputed. The proposed method for early stopping is admissible in that it does not change the result of the Viterbi beam search. Experiments using probability-based concatenation costs as well as distance-based costs show that the proposed methods of admissible stopping effectively reduce the amount of computation required in the Viterbi beam search while keeping its result unchanged. Furthermore, the reduction effect of computation turned out to be much larger if the available lower bound for concatenation costs is tighter.