1-15hit |
Kenji KITA Tsuyoshi MORIMOTO Shigeki SAGAYAMA
In this paper, we propose an extended LR parsing algorithm, called LR parsing with a category reachability test (the LR-CRT algorithm). The LR-CRT algorithm enables a parser to efficiently recognize those sentences that belong to a specified grammatical category. The key point of the algorithm is to use an augmented LR parsing table in which each action entry contains a set of reachable categories. When executing a shift or reduce action, the parser checks whether the action can reach a given category using the augmented table. We apply the LR-CRT algorithm to improve a speech recognition system based on two-level LR parsing. This system uses two kinds of grammars, inter- and intra-phrase grammars, to recognize Japanese sentential speech. Two-level LR parsing guides the search of speech recognition through two-level symbol prediction, phrase category prediction and phone prediction, based on these grammars. The LR-CRT algorithm makes possible the efficient phone prediction based on the phrase category prediction. The system was evaluated using sentential speech data uttered phrase by phrase, and attained a word accuracy of 97.5% and a sentence accuracy of 91.2%
Junya KOGUCHI Shinnosuke TAKAMICHI Masanori MORISE Hiroshi SARUWATARI Shigeki SAGAYAMA
We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.
Akito NAGAI Shigeki SAGAYAMA Kenji KITA Hideaki KIKUCHI
This paper discusses three approaches for combining an efficient LR parser and phoneme-context-dependent HMMs and compares them through continuous speech recognition experiments. In continuous speech recognition, phoneme-context-dependent allophonic models are considered very helpful for enhancing the recognition accuracy. They precisely represent allophonic variations caused by the difference in phoneme-contexts. With grammatical constraints based on a context free grammar (CFG), a generalized LR parser is one of the most efficient parsing algorithms for speech recognition. Therefore, the combination of allophonic models and a generalized LR parser is a powerful scheme enabling accurate and efficient speech recognition. In this paper, three phoneme-context-dependent LR parsing algorithms are proposed, which make it possible to drive allophonic HMMs. The algorithms are outlined as follows: (1) Algorithm for predicting the phonemic context dynamically in the LR parser using a phoneme-context-independent LR table. (2) Algorithm for converting an LR table into a phoneme-context-dependent LR table. (3) Algorithm for converting a CFG into a phoneme-context-dependent CFG. This paper also includes discussion of the results of recognition experiments, and a comparison of performance and efficiency of these three algorithms.
Hiroaki HATTORI Satoshi NAKAMURA Kiyohiro SHIKANO Shigeki SAGAYAMA
This paper proposes a new speaker adaptation method using a speaker weighting technique for multiple reference speaker training of a hidden Markov model (HMM). The proposed method considers the similarities between an input speaker and multiple reference speakers, and use the similarities to control the influence of the reference speakers upon HMM. The evaluation experiments were carried out through the/b, d, g, m, n, N/phoneme recognition task using 8 speakers. Average recognition rates were 68.0%, 66.4%, and 65.6% respectively for three test sets which have different speech styles. These were 4.8%, 8.8%, and 10.5% higher than the rates of the spectrum mapping method, and also 1.6%, 6.7%, and 8.2% higher than the rates of the multiple reference speaker training, the supplemented HMM. The evaluation experiments clarified the effectiveness of the proposed method.
Jun-ichi TAKAHASHI Shigeki SAGAYAMA
This paper describes how to effectively use discriminative training based on Minimum Classification Error (MCE) criterion for a small amount of data in order to attain the highest level of recognition performance. This method is a combination of MCE training and Vector-Field-Smoothed Bayesian learning called MAP/VFS, which combines maximum a posteriori (MAP) estimation with Vector Field Smoothing (VFS). In the proposed method, MAP/VFS can significantly enhance MCE training in the robustness of acoustic modeling. In model training, MCE training is performed using the MAP/VFS-trained model as an initial model. The same data are used in both trainings. For speaker adaptation using several dozen training words, the proposed method has been experimentally proven to be very effective. For 50-word training data, recognition errors are drastically reduced by 47% compared with 16.5% when using only MCE. This high rate, in which 39% is due to MAP, an additional 4% is due to VFS, and a further improvement of 4% is due to MCE, can be attained by enhancing MCE training capability by MAP/VFS.
Yasunage MIYAZAWA Jun-ichi TAKAMI Shigeki SAGAYAMA Shoichi MATSUNAGA
This paper proposes an unsupervised speaker adaptation method using an all-phoneme ergodic Hidden Markov Network" that combines allophonic (context-dependent phone) acoustic models with stochastic language constraints. Hidden Markov Network (HMnet) for allophone modeling and allophonic bigram probabilities derived from a large text database are combined to yield a single large ergodic HMM which represents arbitrary speech signals in a particular language so that the model parameters can be re-estimated using text-unknown speech samples with the Baum-Welch algorithm. When combined with the Vector Field Smoothing (VFS) technique, unsupervised speaker adaptation can be effectively performed. This method experimentally gave better performances compared with our previous unsupervised adaptation method which used conventional phonetic HMMs and phoneme bigram probabilities especially when the amount of training data was small.
Hiroaki HATTORI Shigeki SAGAYAMA
This paper describes a new supervised speaker adaptation method based on vector field smoothing, for small size adaptation data. This method assumes that the correspondence of feature vectors between speakers can be viewed as a kind of smooth vector field, and interpolation and smoothing of the correspondence are introduced into the adaptation process for higher adaptation performance with small size data. The proposed adaptation method was applied to discrete HMM based speech recognition and evaluated in Japanese phoneme and phrase recognition experiments. Using 10 words as the adaptation data, the proposed method produced almost the same results as the conventional codebook mapping method with 25 words. These experiments clearly comfirmed the effectiveness of the proposed method.
Tetsuo KOSAKA Shigeki SAGAYAMA
We discuss how to determine automatically the number of mixture components in continuous mixture density HMMs (CHMMs). A notable trend has been the use of CHMMs in recent years. One of the major problems with a CHMM is how to determine its structure, that is, how many mixture components and states it has and its optimal topology. The number of mixture components has been determined heuristically so far. To solve this problem, we first investigate the influence of the number of mixture components on model parameters and the output log likelihood value. As a result, in contrast to the mixture number uniformity" which is applied in conventional approaches to determine the number of mixture components, we propose the principle of distribution size uniformity". An algorithm is introduced for automatically determining the number of mixture components. The performance of this algorithm is shown through recognition experiments involving all Japanese phonemes. Two types of experiments are carried out. One assumes that the number of mixture components for each state is the same within a phonetic model but may vary between states belonging to different phonemes. The other assumes that each state has a variable number of mixture components. These two experiments give better results than the conventional method.
Kouichi YAMAGUCHI Harald SINGER Shoichi MATSUNAGA Shigeki SAGAYAMA
This paper describes a novel speaker-independent speech recognition method, called speaker-consistent parsing", which is based on an intra-speaker correlation called the speaker-consistency principle. We focus on the fact that a sentence or a string of words is uttered by an individual speaker even in a speaker-independent task. Thus, the proposed method searches through speaker variations in addition to the contents of utterances. As a result of the recognition process, an appropriate standard speaker is selected for speaker adaptation. This new method is experimentally compared with a conventional speaker-independent speech recognition method. Since the speaker-consistency principle best demonstrates its effect with a large number of training and test speakers, a small-scale experiment may not fully exploit this principle. Nevertheless, even the results of our small-scale experiment show that the new method significantly outperforms the conventional method. In addition, this framework's speaker selection mechanism can drastically reduce the likelihood map computation.
Ryosuke ISOTANI Shoichi MATSUNAGA Shigeki SAGAYAMA
This paper proposes a new stochastic language model for speech recognition based on function-word N-grams and content-word N-grams. The conventional word N-gram models are effective for speech recognition, but they represent only local constraints within a few successive words and lack the ability to capture global syntactic or semantic relationships between words. To represent more global constraints, the proposed language model gives the N-gram probabilities of word sequences, with attention given only to function words or to content words. The sequences of function words and of content words are expected to represent syntactic and semantic constraints, respectively. Probabilities of function-word bigrams and content-word bigrams were estimated from a 10,000-sentence text database, and analysis using information theoretic measure showed that expected constraints were extracted appropriately. As an application of this model to speech recognition, a post-processor was constructed to select the optimum sentence candidate from a phrase lattice obtained by a phrase recognition system. The phrase candidate sequence with the highest total acoustic and linguistic score was sought by dynamic programming. The results of experiments carried out on the utterances of 12 speakers showed that the proposed method is more accurate than a CFG-based method, thus demonstrating its effectiveness in improving speech recognition performance.
Satoshi TAKAHASHI Sho-ichi MATSUNAGA Shigeki SAGAYAMA
This paper describes a new technique for isolated word recognition that uses both pitch information and spectral information. In conventional methods, words with similar phoneme features tend to be misrecognized even if their phonemes are accented differently because these methods use only spectral information. It is possible to improve recognition accuracy by considering pitch patterns of words. Many phonetically-similar Japanese words are classified by pitch patterns. In this technique, a pitch pattern template is produced by averaging pitch patterns obtained from a set of words which have the same accent pattern. A measure for word recognition is proposed. This measure based on a combination of the phoneme likelihood and the pitch pattern distance which is the distance between a pitch pattern of an input speech and pitch pattern templates. Speaker-dependent word recognition experiments were carried out using 216 Japanese words uttered by five male and five female speakers. The proposed technique reduces the recognition error rate by 40% compared with the conventional method using only phoneme likelihood.
Hirokazu KAMEOKA Misa SATO Takuma ONO Nobutaka ONO Shigeki SAGAYAMA
This paper deals with the problem of underdetermined blind source separation (BSS) where the number of sources is unknown. We propose a BSS approach that simultaneously estimates the number of sources, separates the sources based on the sparseness of speech, estimates the direction of arrival of each source, and performs permutation alignment. We confirmed experimentally that reasonably good separation was obtained with the present method without specifying the number of sources.
Kenji KITA Tsuyoshi MORIMOTO Kazumi OHKURA Shigeki SAGAYAMA Yaneo YANO
This paper describes Japanese spoken sentence recognition using hybrid language modeling, which combines the advantages of both syntactic and stochastic language models. As the baseline system, we adopted the HMM-LR speech recognition system, with which we have already achieved good performance for Japanese phrase recognition tasks. Several improvements have been made to this system aimed at handling continuously spoken sentences. The first improvement is HMM training with continuous utterances as well as word utterances. In previous implementations, HMMs were trained with only word utterances. Continuous utterances are included in the HMM training data because coarticulation effects are much stronger in continuous utterances. The second improvement is the development of a sentential grammar for Japanese. The sentential grammar was created by combining inter- and intra-phrase CFG grammars, which were developed separately. The third improvement is the incorporation of stochastic linguistic knowledge, which includes stochastic CFG and a bigram model of production rules. The system was evaluated using continuously spoken sentences from a conference registration task that included approximately 750 words. We attained a sentence accuracy of 83.9% in the speaker-dependent condition.
Sho-ichi MATSUNAGA Shigeru HOMMA Shigeki SAGAYAMA Sadaoki FURUI
This paper describes two Japanese continuous speech recognition systems (system-1 and system-2) based on phoneme-based HMMs and a two-level grammar approach. Two grammars are an intra-phrase transition network grammar for phrase recognition, and an inter-phrase dependency grammar for sentence recognition. A joint score, combining acoustic likelihood and linguistic certainty factors derived from phonemebased HMMs and dependency rules, is maximized to obtain the best sentence recognition results. System-1 is tuned for sentences uttered phrase-by-phrase and system-2 is tuned for sentence utterances, to make the amount of computation practical. In system-1, two efficient parsing algorithms are used for each grammar. They are a bi-directional network parser and a breadth-first dependency parser. With the phrase-network parser, input phrase utterances are parsed bi-directionally both left-to-right and right-to-left, and optimal Viterbi paths are found along which the accumulated phonetic likelihood is maximized. The dependency parser utilizes efficient breadth-first search and beam search algorithms. For system-2, we have extended the dependency analysis algorithm for sentence utterances, using a technique for detecting most-likely multi-phrase candidates based on the Viterbi phrase alignment. Where the perplexity of the phrase syntax is 40, system-1 and system-2 increase phrase recognition performance in the sentence by approximately 6% and 14%, showing the effectiveness of semantic dependency analysis.
Fikret S. GURGEN Shigeki SAGAYAMA Sadaoki FURUI
This paper investigates the performance of the line spectrum pair (LSP) frequency parameter representation for speech recognition. Transitional parameters of LSP frequencies are defined using first-order regression coefficients. The transitional and the instantaneous frequency parameters are linearly combined to generate a single feature vector used for recognition. The performance of the single vector is compared with that of the cepstral coefficients (CC) representation using a minimumdistance classifier in speaker-independent isolated word recognition experiments. In the speech recognition experiments, the transitional and the instantaneous coefficients are also combined in the distance domain. Also, inverse variance weighted Euclidean measures are defined using LSP frequencies to achieve Mel-scale-like warping and the new warped-frequencies are used in recognition experiments. The performance of the single feature vector defined with transitional and instantaneous LSP frequencies is found to be the best among the measures used in the experiments.