1-7hit |
Akira TAMAMORI Yoshihiko NANKAKU Keiichi TOKUDA
In this paper, a novel statistical model based on 2-D HMMs for image recognition is proposed. Recently, separable lattice 2-D HMMs (SL2D-HMMs) were proposed to model invariance to size and location deformation. However, their modeling accuracy is still insufficient because of the following two assumptions, which are inherited from 1-D HMMs: i) the stationary statistics within each state and ii) the conditional independent assumption of state output probabilities. To overcome these shortcomings in 1-D HMMs, trajectory HMMs were proposed and successfully applied to speech recognition and speech synthesis. This paper derives 2-D trajectory HMMs by reformulating the likelihood of SL2D-HMMs through the imposition of explicit relationships between static and dynamic features. The proposed model can efficiently capture dependencies between adjacent observations without increasing the number of model parameters. The effectiveness of the proposed model was evaluated in face recognition experiments on the XM2VTS database.
Taichi ASAMI Koji IWANO Sadaoki FURUI
We have previously proposed a noise-robust speaker verification method using fundamental frequency (F0) extracted using the Hough transform. The method also incorporates an automatic stream-weight and decision threshold estimation technique. It has been confirmed that the proposed method is effective for white noise at various SNR conditions. This paper evaluates the proposed method in more practical in-car and elevator-hall noise conditions. The paper first describes the noise-robust F0 extraction method and details of our robust speaker verification method using multi-stream HMMs for integrating the extracted F0 and cepstral features. Details of the automatic stream-weight and threshold estimation method for multi-stream speaker verification framework are also explained. This method simultaneously optimizes stream-weights and a decision threshold by combining the linear discriminant analysis (LDA) and the Adaboost technique. Experiments were conducted using Japanese connected digit speech contaminated by white, in-car, or elevator-hall noise at various SNRs. Experimental results show that the F0 features improve the verification performance in various noisy environments, and that our stream-weight and threshold optimization method effectively estimates control parameters so that FARs and FRRs are adjusted to achieve equal error rates (EERs) under various noisy conditions.
Since their inception almost fifty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems--today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabilities, each of these ways having both advantages and disadvantages. In an effort to better understand what HMMs can do, this tutorial article analyzes HMMs by exploring a definition of HMMs in terms of random variables and conditional independence assumptions. We prefer this definition as it allows us to reason more throughly about the capabilities of HMMs. In particular, it is possible to deduce that there are, in theory at least, no limitations to the class of probability distributions representable by HMMs. This paper concludes that, in search of a model to supersede the HMM (say for ASR), rather than trying to correct for HMM limitations in the general case, new models should be found based on their potential for better parsimony, computational requirements, and noise insensitivity.
Yohei ITAYA Heiga ZEN Yoshihiko NANKAKU Chiyomi MIYAJIMA Keiichi TOKUDA Tadashi KITAMURA
This paper investigates the effectiveness of the DAEM (Deterministic Annealing EM) algorithm in acoustic modeling for speaker and speech recognition. Although the EM algorithm has been widely used to approximate the ML estimates, it has the problem of initialization dependence. To relax this problem, the DAEM algorithm has been proposed and confirmed the effectiveness in artificial small tasks. In this paper, we applied the DAEM algorithm to practical speech recognition tasks: speaker recognition based on GMMs and continuous speech recognition based on HMMs. Experimental results show that the DAEM algorithm can improve the recognition performance as compared to the standard EM algorithm with conventional initialization algorithms, especially in the flat start training for continuous speech recognition.
Hiroyuki SUZUKI Heiga ZEN Yoshihiko NANKAKU Chiyomi MIYAJIMA Keiichi TOKUDA Tadashi KITAMURA
This paper describes continuous speech recognition incorporating the additional complement information, e.g., voice characteristics, speaking styles, linguistic information and noise environment, into HMM-based acoustic modeling. In speech recognition systems, context-dependent HMMs, i.e., triphone, and the tree-based context clustering have commonly been used. Several attempts to utilize not only phonetic contexts, but additional complement information based on context (factor) dependent HMMs have been made in recent years. However, when the additional factors for testing data are unobserved, methods for obtaining factor labels is required before decoding. In this paper, we propose a model integration technique based on general factor dependent HMMs for decoding. The integrated HMMs can be used by a conventional decoder as standard triphone HMMs with Gaussian mixture densities. Moreover, by using the results of context clustering, the proposed method can determine an optimal number of mixture components for each state dependently of the degree of influence from additional factors. Phoneme recognition experiments using voice characteristic labels show significant improvements with a small number of model parameters, and a 19.3% error reduction was obtained in noise environment experiments.
Jin-Song ZHANG Konstantin MARKOV Tomoko MATSUI Satoshi NAKAMURA
This paper presents a study on modeling inter-word pauses to improve the robustness of acoustic models for recognizing noisy conversational speech. When precise contextual modeling is used for pauses, the frequent appearances and varying acoustics of pauses in noisy conversational speech make it a problem to automatically generate an accurate phonetic transcription of the training data for developing robust acoustic models. This paper presents a proposal to exploit the reliable phonetic heuristics of pauses in speech to aid the detection of varying pauses. Based on it, a stepwise approach to optimize pause HMMs was applied to the data of the DARPA SPINE2 project, and more correct phonetic transcription was achieved. The cross-word triphone HMMs developed using this method got an absolute 9.2% word error reduction when compared to the conventional method with only context free modeling of pauses. For the same pause modeling method, the use of the optimized phonetic segmentation brought about an absolute 5.2% improvements.
Kenji KITA Tsuyoshi MORIMOTO Shigeki SAGAYAMA
In this paper, we propose an extended LR parsing algorithm, called LR parsing with a category reachability test (the LR-CRT algorithm). The LR-CRT algorithm enables a parser to efficiently recognize those sentences that belong to a specified grammatical category. The key point of the algorithm is to use an augmented LR parsing table in which each action entry contains a set of reachable categories. When executing a shift or reduce action, the parser checks whether the action can reach a given category using the augmented table. We apply the LR-CRT algorithm to improve a speech recognition system based on two-level LR parsing. This system uses two kinds of grammars, inter- and intra-phrase grammars, to recognize Japanese sentential speech. Two-level LR parsing guides the search of speech recognition through two-level symbol prediction, phrase category prediction and phone prediction, based on these grammars. The LR-CRT algorithm makes possible the efficient phone prediction based on the phrase category prediction. The system was evaluated using sentential speech data uttered phrase by phrase, and attained a word accuracy of 97.5% and a sentence accuracy of 91.2%