The search functionality is under construction.
The search functionality is under construction.

Author Search Result

[Author] Shigeki SAGAYAMA(15hit)

1-15hit
  • LR Parsing with a Category Reachability Test Applied to Speech Recognition

    Kenji KITA  Tsuyoshi MORIMOTO  Shigeki SAGAYAMA  

     
    PAPER

      Vol:
    E76-D No:1
      Page(s):
    23-28

    In this paper, we propose an extended LR parsing algorithm, called LR parsing with a category reachability test (the LR-CRT algorithm). The LR-CRT algorithm enables a parser to efficiently recognize those sentences that belong to a specified grammatical category. The key point of the algorithm is to use an augmented LR parsing table in which each action entry contains a set of reachable categories. When executing a shift or reduce action, the parser checks whether the action can reach a given category using the augmented table. We apply the LR-CRT algorithm to improve a speech recognition system based on two-level LR parsing. This system uses two kinds of grammars, inter- and intra-phrase grammars, to recognize Japanese sentential speech. Two-level LR parsing guides the search of speech recognition through two-level symbol prediction, phrase category prediction and phone prediction, based on these grammars. The LR-CRT algorithm makes possible the efficient phone prediction based on the phrase category prediction. The system was evaluated using sentential speech data uttered phrase by phrase, and attained a word accuracy of 97.5% and a sentence accuracy of 91.2%

  • DNN-Based Full-Band Speech Synthesis Using GMM Approximation of Spectral Envelope

    Junya KOGUCHI  Shinnosuke TAKAMICHI  Masanori MORISE  Hiroshi SARUWATARI  Shigeki SAGAYAMA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2020/09/03
      Vol:
    E103-D No:12
      Page(s):
    2673-2681

    We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.

  • Three Different LR Parsing Algorithms for Phoneme-Context-Dependent HMM-Based Continuous Speech Recognition

    Akito NAGAI  Shigeki SAGAYAMA  Kenji KITA  Hideaki KIKUCHI  

     
    PAPER

      Vol:
    E76-D No:1
      Page(s):
    29-37

    This paper discusses three approaches for combining an efficient LR parser and phoneme-context-dependent HMMs and compares them through continuous speech recognition experiments. In continuous speech recognition, phoneme-context-dependent allophonic models are considered very helpful for enhancing the recognition accuracy. They precisely represent allophonic variations caused by the difference in phoneme-contexts. With grammatical constraints based on a context free grammar (CFG), a generalized LR parser is one of the most efficient parsing algorithms for speech recognition. Therefore, the combination of allophonic models and a generalized LR parser is a powerful scheme enabling accurate and efficient speech recognition. In this paper, three phoneme-context-dependent LR parsing algorithms are proposed, which make it possible to drive allophonic HMMs. The algorithms are outlined as follows: (1) Algorithm for predicting the phonemic context dynamically in the LR parser using a phoneme-context-independent LR table. (2) Algorithm for converting an LR table into a phoneme-context-dependent LR table. (3) Algorithm for converting a CFG into a phoneme-context-dependent CFG. This paper also includes discussion of the results of recognition experiments, and a comparison of performance and efficiency of these three algorithms.

  • Speaker Weighted Training of HMM Using Multiple Reference Speakers

    Hiroaki HATTORI  Satoshi NAKAMURA  Kiyohiro SHIKANO  Shigeki SAGAYAMA  

     
    PAPER-Speech Processing

      Vol:
    E76-D No:2
      Page(s):
    219-226

    This paper proposes a new speaker adaptation method using a speaker weighting technique for multiple reference speaker training of a hidden Markov model (HMM). The proposed method considers the similarities between an input speaker and multiple reference speakers, and use the similarities to control the influence of the reference speakers upon HMM. The evaluation experiments were carried out through the/b, d, g, m, n, N/phoneme recognition task using 8 speakers. Average recognition rates were 68.0%, 66.4%, and 65.6% respectively for three test sets which have different speech styles. These were 4.8%, 8.8%, and 10.5% higher than the rates of the spectrum mapping method, and also 1.6%, 6.7%, and 8.2% higher than the rates of the multiple reference speaker training, the supplemented HMM. The evaluation experiments clarified the effectiveness of the proposed method.

  • Discriminative Training Based on Minimum Classification Error for a Small Amount of Data Enhanced by Vector-Field-Smoothed Bayesian Learning

    Jun-ichi TAKAHASHI  Shigeki SAGAYAMA  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E79-D No:12
      Page(s):
    1700-1707

    This paper describes how to effectively use discriminative training based on Minimum Classification Error (MCE) criterion for a small amount of data in order to attain the highest level of recognition performance. This method is a combination of MCE training and Vector-Field-Smoothed Bayesian learning called MAP/VFS, which combines maximum a posteriori (MAP) estimation with Vector Field Smoothing (VFS). In the proposed method, MAP/VFS can significantly enhance MCE training in the robustness of acoustic modeling. In model training, MCE training is performed using the MAP/VFS-trained model as an initial model. The same data are used in both trainings. For speaker adaptation using several dozen training words, the proposed method has been experimentally proven to be very effective. For 50-word training data, recognition errors are drastically reduced by 47% compared with 16.5% when using only MCE. This high rate, in which 39% is due to MAP, an additional 4% is due to VFS, and a further improvement of 4% is due to MCE, can be attained by enhancing MCE training capability by MAP/VFS.

  • Unsupervised Speaker Adaptation Using All-Phoneme Ergodic Hidden Markov Network

    Yasunage MIYAZAWA  Jun-ichi TAKAMI  Shigeki SAGAYAMA  Shoichi MATSUNAGA  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E78-D No:8
      Page(s):
    1044-1050

    This paper proposes an unsupervised speaker adaptation method using an all-phoneme ergodic Hidden Markov Network" that combines allophonic (context-dependent phone) acoustic models with stochastic language constraints. Hidden Markov Network (HMnet) for allophone modeling and allophonic bigram probabilities derived from a large text database are combined to yield a single large ergodic HMM which represents arbitrary speech signals in a particular language so that the model parameters can be re-estimated using text-unknown speech samples with the Baum-Welch algorithm. When combined with the Vector Field Smoothing (VFS) technique, unsupervised speaker adaptation can be effectively performed. This method experimentally gave better performances compared with our previous unsupervised adaptation method which used conventional phonetic HMMs and phoneme bigram probabilities especially when the amount of training data was small.

  • Speaker Adaptation Based on Vector Field Smoothing

    Hiroaki HATTORI  Shigeki SAGAYAMA  

     
    PAPER-Speech Processing

      Vol:
    E76-D No:2
      Page(s):
    227-234

    This paper describes a new supervised speaker adaptation method based on vector field smoothing, for small size adaptation data. This method assumes that the correspondence of feature vectors between speakers can be viewed as a kind of smooth vector field, and interpolation and smoothing of the correspondence are introduced into the adaptation process for higher adaptation performance with small size data. The proposed adaptation method was applied to discrete HMM based speech recognition and evaluated in Japanese phoneme and phrase recognition experiments. Using 10 words as the adaptation data, the proposed method produced almost the same results as the conventional codebook mapping method with 25 words. These experiments clearly comfirmed the effectiveness of the proposed method.

  • Automatic Determination of the Number of Mixture Components for Continuous HMMs Based a Uniform Variance Criterion

    Tetsuo KOSAKA  Shigeki SAGAYAMA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    642-647

    We discuss how to determine automatically the number of mixture components in continuous mixture density HMMs (CHMMs). A notable trend has been the use of CHMMs in recent years. One of the major problems with a CHMM is how to determine its structure, that is, how many mixture components and states it has and its optimal topology. The number of mixture components has been determined heuristically so far. To solve this problem, we first investigate the influence of the number of mixture components on model parameters and the output log likelihood value. As a result, in contrast to the mixture number uniformity" which is applied in conventional approaches to determine the number of mixture components, we propose the principle of distribution size uniformity". An algorithm is introduced for automatically determining the number of mixture components. The performance of this algorithm is shown through recognition experiments involving all Japanese phonemes. Two types of experiments are carried out. One assumes that the number of mixture components for each state is the same within a phonetic model but may vary between states belonging to different phonemes. The other assumes that each state has a variable number of mixture components. These two experiments give better results than the conventional method.

  • Speaker-Consistent Parsing for Speaker-Independent Continuous Speech Recognition

    Kouichi YAMAGUCHI  Harald SINGER  Shoichi MATSUNAGA  Shigeki SAGAYAMA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    719-724

    This paper describes a novel speaker-independent speech recognition method, called speaker-consistent parsing", which is based on an intra-speaker correlation called the speaker-consistency principle. We focus on the fact that a sentence or a string of words is uttered by an individual speaker even in a speaker-independent task. Thus, the proposed method searches through speaker variations in addition to the contents of utterances. As a result of the recognition process, an appropriate standard speaker is selected for speaker adaptation. This new method is experimentally compared with a conventional speaker-independent speech recognition method. Since the speaker-consistency principle best demonstrates its effect with a large number of training and test speakers, a small-scale experiment may not fully exploit this principle. Nevertheless, even the results of our small-scale experiment show that the new method significantly outperforms the conventional method. In addition, this framework's speaker selection mechanism can drastically reduce the likelihood map computation.

  • Speech Recognition Using Function-Word N-Grams and Content-Word N-Grams

    Ryosuke ISOTANI  Shoichi MATSUNAGA  Shigeki SAGAYAMA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    692-697

    This paper proposes a new stochastic language model for speech recognition based on function-word N-grams and content-word N-grams. The conventional word N-gram models are effective for speech recognition, but they represent only local constraints within a few successive words and lack the ability to capture global syntactic or semantic relationships between words. To represent more global constraints, the proposed language model gives the N-gram probabilities of word sequences, with attention given only to function words or to content words. The sequences of function words and of content words are expected to represent syntactic and semantic constraints, respectively. Probabilities of function-word bigrams and content-word bigrams were estimated from a 10,000-sentence text database, and analysis using information theoretic measure showed that expected constraints were extracted appropriately. As an application of this model to speech recognition, a post-processor was constructed to select the optimum sentence candidate from a phrase lattice obtained by a phrase recognition system. The phrase candidate sequence with the highest total acoustic and linguistic score was sought by dynamic programming. The results of experiments carried out on the utterances of 12 speakers showed that the proposed method is more accurate than a CFG-based method, thus demonstrating its effectiveness in improving speech recognition performance.

  • Isolated Word Recognition Using Pitch Pattern Information

    Satoshi TAKAHASHI  Sho-ichi MATSUNAGA  Shigeki SAGAYAMA  

     
    PAPER-Speech

      Vol:
    E76-A No:2
      Page(s):
    231-236

    This paper describes a new technique for isolated word recognition that uses both pitch information and spectral information. In conventional methods, words with similar phoneme features tend to be misrecognized even if their phonemes are accented differently because these methods use only spectral information. It is possible to improve recognition accuracy by considering pitch patterns of words. Many phonetically-similar Japanese words are classified by pitch patterns. In this technique, a pitch pattern template is produced by averaging pitch patterns obtained from a set of words which have the same accent pattern. A measure for word recognition is proposed. This measure based on a combination of the phoneme likelihood and the pitch pattern distance which is the distance between a pitch pattern of an input speech and pitch pattern templates. Speaker-dependent word recognition experiments were carried out using 216 Japanese words uttered by five male and five female speakers. The proposed technique reduces the recognition error rate by 40% compared with the conventional method using only phoneme likelihood.

  • Bayesian Nonparametric Approach to Blind Separation of Infinitely Many Sparse Sources

    Hirokazu KAMEOKA  Misa SATO  Takuma ONO  Nobutaka ONO  Shigeki SAGAYAMA  

     
    PAPER

      Vol:
    E96-A No:10
      Page(s):
    1928-1937

    This paper deals with the problem of underdetermined blind source separation (BSS) where the number of sources is unknown. We propose a BSS approach that simultaneously estimates the number of sources, separates the sources based on the sparseness of speech, estimates the direction of arrival of each source, and performs permutation alignment. We confirmed experimentally that reasonably good separation was obtained with the present method without specifying the number of sources.

  • Spoken Sentence Recognition Based on HMM-LR with Hybrid Language Modeling

    Kenji KITA  Tsuyoshi MORIMOTO  Kazumi OHKURA  Shigeki SAGAYAMA  Yaneo YANO  

     
    PAPER

      Vol:
    E77-D No:2
      Page(s):
    258-265

    This paper describes Japanese spoken sentence recognition using hybrid language modeling, which combines the advantages of both syntactic and stochastic language models. As the baseline system, we adopted the HMM-LR speech recognition system, with which we have already achieved good performance for Japanese phrase recognition tasks. Several improvements have been made to this system aimed at handling continuously spoken sentences. The first improvement is HMM training with continuous utterances as well as word utterances. In previous implementations, HMMs were trained with only word utterances. Continuous utterances are included in the HMM training data because coarticulation effects are much stronger in continuous utterances. The second improvement is the development of a sentential grammar for Japanese. The sentential grammar was created by combining inter- and intra-phrase CFG grammars, which were developed separately. The third improvement is the incorporation of stochastic linguistic knowledge, which includes stochastic CFG and a bigram model of production rules. The system was evaluated using continuously spoken sentences from a conference registration task that included approximately 750 words. We attained a sentence accuracy of 83.9% in the speaker-dependent condition.

  • Continuous Speech Recognition Using a Dependency Grammar and Phoneme-Based HMMs

    Sho-ichi MATSUNAGA  Shigeru HOMMA  Shigeki SAGAYAMA  Sadaoki FURUI  

     
    PAPER-Continuous Speech Recognition

      Vol:
    E74-A No:7
      Page(s):
    1826-1833

    This paper describes two Japanese continuous speech recognition systems (system-1 and system-2) based on phoneme-based HMMs and a two-level grammar approach. Two grammars are an intra-phrase transition network grammar for phrase recognition, and an inter-phrase dependency grammar for sentence recognition. A joint score, combining acoustic likelihood and linguistic certainty factors derived from phonemebased HMMs and dependency rules, is maximized to obtain the best sentence recognition results. System-1 is tuned for sentences uttered phrase-by-phrase and system-2 is tuned for sentence utterances, to make the amount of computation practical. In system-1, two efficient parsing algorithms are used for each grammar. They are a bi-directional network parser and a breadth-first dependency parser. With the phrase-network parser, input phrase utterances are parsed bi-directionally both left-to-right and right-to-left, and optimal Viterbi paths are found along which the accumulated phonetic likelihood is maximized. The dependency parser utilizes efficient breadth-first search and beam search algorithms. For system-2, we have extended the dependency analysis algorithm for sentence utterances, using a technique for detecting most-likely multi-phrase candidates based on the Viterbi phrase alignment. Where the perplexity of the phrase syntax is 40, system-1 and system-2 increase phrase recognition performance in the sentence by approximately 6% and 14%, showing the effectiveness of semantic dependency analysis.

  • A Study of Line Spectrum Pair Frequency Representation for Speech Recognition

    Fikret S. GURGEN  Shigeki SAGAYAMA  Sadaoki FURUI  

     
    PAPER-Speech

      Vol:
    E75-A No:1
      Page(s):
    98-102

    This paper investigates the performance of the line spectrum pair (LSP) frequency parameter representation for speech recognition. Transitional parameters of LSP frequencies are defined using first-order regression coefficients. The transitional and the instantaneous frequency parameters are linearly combined to generate a single feature vector used for recognition. The performance of the single vector is compared with that of the cepstral coefficients (CC) representation using a minimumdistance classifier in speaker-independent isolated word recognition experiments. In the speech recognition experiments, the transitional and the instantaneous coefficients are also combined in the distance domain. Also, inverse variance weighted Euclidean measures are defined using LSP frequencies to achieve Mel-scale-like warping and the new warped-frequencies are used in recognition experiments. The performance of the single feature vector defined with transitional and instantaneous LSP frequencies is found to be the best among the measures used in the experiments.