1-5hit |
Takatoshi JITSUHIRO Satoshi NAKAMURA
We propose a new method both for automatically creating non-uniform, context-dependent HMM topologies, and selecting the number of mixture components based on the Variational Bayesian (VB) approach. Although the Maximum Likelihood (ML) criterion is generally used to create HMM topologies, it has an over-fitting problem. Recently, to avoid this problem, the VB approach has been applied to create acoustic models for speech recognition. We introduce the VB approach to the Successive State Splitting (SSS) algorithm, which can create both contextual and temporal variations for HMMs. Experimental results indicate that the proposed method can automatically create a more efficient model than the original method. We evaluated a method to increase the number of mixture components by using the VB approach and considering temporal structures. The VB approach obtained almost the same performance as the smaller number of mixture components in comparison with that obtained by using ML-based methods.
Takatoshi JITSUHIRO Tomoko MATSUI Satoshi NAKAMURA
We propose a new method to introduce the Minimum Description Length (MDL) criterion to the automatic generation of non-uniform, context-dependent HMM topologies. Phonetic decision tree clustering is widely used, based on the Maximum Likelihood (ML) criterion, and only creates contextual variations. However, the ML criterion needs to predetermine control parameters, such as the total number of states, empirically for use as stop criteria. Information criteria have been applied to solve this problem for decision tree clustering. However, decision tree clustering cannot create topologies with various state lengths automatically. Therefore, we propose a method that applies the MDL criterion as split and stop criteria to the Successive State Splitting (SSS) algorithm as a means of generating contextual and temporal variations. This proposed method, the MDL-SSS algorithm, can automatically create adequate topologies without such predetermined parameters. Experimental results for travel arrangement dialogs and lecture speech show that the MDL-SSS can automatically stop splitting and obtain more appropriate HMM topologies than the original one.
Shigeki MATSUDA Takatoshi JITSUHIRO Konstantin MARKOV Satoshi NAKAMURA
In this paper, we describe a parallel decoding-based ASR system developed of ATR that is robust to noise type, SNR and speaking style. It is difficult to recognize speech affected by various factors, especially when an ASR system contains only a single acoustic model. One solution is to employ multiple acoustic models, one model for each different condition. Even though the robustness of each acoustic model is limited, the whole ASR system can handle various conditions appropriately. In our system, there are two recognition sub-systems which use different features such as MFCC and Differential MFCC (DMFCC). Each sub-system has several acoustic models depending on SNR, speaker gender and speaking style, and during recognition each acoustic model is adapted by fast noise adaptation. From each sub-system, one hypothesis is selected based on posterior probability. The final recognition result is obtained by combining the best hypotheses from the two sub-systems. On the AURORA-2J task used widely for the evaluation of noise robustness, our system achieved higher recognition performance than a system which contains only a single model. Also, our system was tested using normal and hyper-articulated speech contaminated by several background noises, and exhibited high robustness to noise and speaking styles.
Takatoshi JITSUHIRO Tomoji TORIYAMA Kiyoshi KOGURE
We propose a noise suppression method based on multi-model compositions and multi-pass search. In real environments, input speech for speech recognition includes many kinds of noise signals. To obtain good recognized candidates, suppressing many kinds of noise signals at once and finding target speech is important. Before noise suppression, to find speech and noise label sequences, we introduce multi-pass search with acoustic models including many kinds of noise models and their compositions, their n-gram models, and their lexicon. Noise suppression is frame-synchronously performed using the multiple models selected by recognized label sequences with time alignments. We evaluated this method using the E-Nightingale task, which contains voice memoranda spoken by nurses during actual work at hospitals. The proposed method obtained higher performance than the conventional method.
Takatoshi JITSUHIRO Hirofumi YAMAMOTO Setsuo YAMADA Genichiro KIKUI Yoshinori SAGISAKA
We propose new language models to represent phrasal structures by patterns extracted from parse trees. First, modified word trigram models are proposed. They are extracted from sentences analyzed by the preprocessing of the parser with knowledge. Since sentences are analyzed to create sub-trees of a few words, these trigram models can represent relations among a few neighbor words more strongly than conventional word trigram models. Second, word pattern models are used on these modified word trigram models. The word patterns are extracted from parse trees and can represent phrasal structures and much longer word-dependency than trigram models. Experimental results show that modified trigram models are more effective than traditional trigram models and that pattern models attain slight improvements over modified trigram models. Furthermore, additional experiments show that pattern models are more effective for long sentences.