Hiroshi SEKI Kazumasa YAMAMOTO Tomoyosi AKIBA Seiichi NAKAGAWA
Deep neural networks (DNNs) have achieved significant success in the field of automatic speech recognition. One main advantage of DNNs is automatic feature extraction without human intervention. However, adaptation under limited available data remains a major challenge for DNN-based systems because of their enormous free parameters. In this paper, we propose a filterbank-incorporated DNN that incorporates a filterbank layer that presents the filter shape/center frequency and a DNN-based acoustic model. The filterbank layer and the following networks of the proposed model are trained jointly by exploiting the advantages of the hierarchical feature extraction, while most systems use pre-defined mel-scale filterbank features as input acoustic features to DNNs. Filters in the filterbank layer are parameterized to represent speaker characteristics while minimizing a number of parameters. The optimization of one type of parameters corresponds to the Vocal Tract Length Normalization (VTLN), and another type corresponds to feature-space Maximum Linear Likelihood Regression (fMLLR) and feature-space Discriminative Linear Regression (fDLR). Since the filterbank layer consists of just a few parameters, it is advantageous in adaptation under limited available data. In the experiment, filterbank-incorporated DNNs showed effectiveness in speaker/gender adaptations under limited adaptation data. Experimental results on CSJ task demonstrate that the adaptation of proposed model showed 5.8% word error reduction ratio with 10 utterances against the un-adapted model.
Tsubasa OCHIAI Shigeki MATSUDA Hideyuki WATANABE Xugang LU Chiori HORI Hisashi KAWAI Shigeru KATAGIRI
Among various training concepts for speaker adaptation, Speaker Adaptive Training (SAT) has been successfully applied to a standard Hidden Markov Model (HMM) speech recognizer, whose state is associated with Gaussian Mixture Models (GMMs). On the other hand, focusing on the high discriminative power of Deep Neural Networks (DNNs), a new type of speech recognizer structure, which combines DNNs and HMMs, has been vigorously investigated in the speaker adaptation research field. Along these two lines, it is natural to conceive of further improvement to a DNN-HMM recognizer by employing the training concept of SAT. In this paper, we propose a novel speaker adaptation scheme that applies SAT to a DNN-HMM recognizer. Our SAT scheme allocates a Speaker Dependent (SD) module to one of the intermediate layers of DNN, treats its remaining layers as a Speaker Independent (SI) module, and jointly trains the SD and SI modules while switching the SD module in a speaker-by-speaker manner. We implement the scheme using a DNN-HMM recognizer, whose DNN has seven layers, and elaborate its utility over TED Talks corpus data. Our experimental results show that in the supervised adaptation scenario, our Speaker-Adapted (SA) SAT-based recognizer reduces the word error rate of the baseline SI recognizer and the lowest word error rate of the SA SI recognizer by 8.4% and 0.7%, respectively, and by 6.4% and 0.6% in the unsupervised adaptation scenario. The error reductions gained by our SA-SAT-based recognizers proved to be significant by statistical testing. The results also show that our SAT-based adaptation outperforms, regardless of the SD module layer selection, its counterpart SI-based adaptation, and that the inner layers of DNN seem more suitable for SD module allocation than the outer layers.
We propose a speaker adaptation method based on the probabilistic principal component analysis (PPCA) of acoustic models. We define a training matrix which is represented in a two-way array and decompose the training models by PPCA to construct bases. In the two-way array representation, each training model is represented as a matrix and the columns of each training matrix are treated as training vectors. We formulate the adaptation equation in the maximum a posteriori (MAP) framework using the bases and the prior.
We present the adaptation of the acoustic models of hidden Markov models (HMMs) to the target speaker and noise environment using bilinear models. Acoustic models trained from various speakers and noise conditions are decomposed to build the bases that capture the interaction between the two factors. The model for the target speaker and noise is represented as a product of bases and two weight vectors. In experiments using the AURORA4 corpus, the bilinear model outperforms the linear model.
Yaohui QI Fuping PAN Fengpei GE Qingwei ZHAO Yonghong YAN
A smoothing method for minimum phone error linear regression (MPELR) is proposed in this paper. We show that the objective function for minimum phone error (MPE) can be combined with a prior mean distribution. When the prior mean distribution is based on maximum likelihood (ML) estimates, the proposed method is the same as the previous smoothing technique for MPELR. Instead of ML estimates, maximum a posteriori (MAP) parameter estimate is used to define the mode of prior mean distribution to improve the performance of MPELR. Experiments on a large vocabulary speech recognition task show that the proposed method can obtain 8.4% relative reduction in word error rate when the amount of data is limited, while retaining the same asymptotic performance as conventional MPELR. When compared with discriminative maximum a posteriori linear regression (DMAPLR), the proposed method shows improvement except for the case of limited adaptation data for supervised adaptation.
Yongwon JEONG Sangjun LIM Young Kuk KIM Hyung Soon KIM
We present an acoustic model adaptation method where the transformation matrix for a new speaker is given by the product of bases and a weight matrix. The bases are built from the parallel factor analysis 2 (PARAFAC2) of training speakers' transformation matrices. We perform continuous speech recognition experiments using the WSJ0 corpus.
I propose an acoustic model adaptation method using bases constructed through the sparse principal component analysis (SPCA) of acoustic models trained in a clean environment. I perform experiments on adaptation to a new speaker and noise. The SPCA-based method outperforms the PCA-based method in the presence of babble noise.
Hiroko MURAKAMI Koichi SHINODA Sadaoki FURUI
We propose an active learning framework for speech recognition that reduces the amount of data required for acoustic modeling. This framework consists of two steps. We first obtain a phone-error distribution using an acoustic model estimated from transcribed speech data. Then, from a text corpus we select a sentence whose phone-occurrence distribution is close to the phone-error distribution and collect its speech data. We repeat this process to increase the amount of transcribed speech data. We applied this framework to speaker adaptation and acoustic model training. Our evaluation results showed that it significantly reduced the amount of transcribed data while maintaining the same level of accuracy.
This study develops a fuzzy logic control mechanism in eigenspace-based MLLR speaker adaptation. Specifically, this mechanism can determine hidden Markov model parameters to enhance overall recognition performance despite ordinary or adverse conditions in both training and operating stages. The proposed mechanism regulates the influence of eigenspace-based MLLR adaptation given insufficient training data from a new speaker. This mechanism accounts for the amount of adaptation data available in transformation matrix parameter smoothing, and thus ensures the robustness of eigenspace-based MLLR adaptation against data scarcity. The proposed adaptive learning mechanism is computationally inexpensive. Experimental results show that eigenspace-based MLLR adaptation with fuzzy control outperforms conventional eigenspace-based MLLR, and especially when the adaptation data acquired from a new speaker is insufficient.
Dean LUO Yu QIAO Nobuaki MINEMATSU Keikichi HIROSE
This study focuses on speaker adaptation techniques for Computer-Assisted Language Learning (CALL). We first investigate the effects and problems of Maximum Likelihood Linear Regression (MLLR) speaker adaptation when used in pronunciation evaluation. Automatic scoring and error detection experiments are conducted on two publicly available databases of Japanese learners' English pronunciation. As we expected, over-adaptation causes misjudgment of pronunciation accuracy. Following the analysis, we propose a novel method, Regularized Maximum Likelihood Regression (Regularized-MLLR) adaptation, to solve the problem of the adverse effects of MLLR adaptation. This method uses a group of teachers' data to regularize learners' transformation matrices so that erroneous pronunciations will not be erroneously transformed as correct ones. We implement this idea in two ways: one is using the average of the teachers' transformation matrices as a constraint to MLLR, and the other is using linear combinations of the teachers' matrices to represent learners' transformations. Experimental results show that the proposed methods can better utilize MLLR adaptation and avoid over-adaptation.
Tetsuo KOSAKA Yuui TAKEDA Takashi ITO Masaharu KATO Masaki KOHDA
In this paper, we propose a new speaker-class modeling and its adaptation method for the LVCSR system and evaluate the method on the Corpus of Spontaneous Japanese (CSJ). In this method, closer speakers are selected from training speakers and the acoustic models are trained by using their utterances for each evaluation speaker. One of the major issues of the speaker-class model is determining the selection range of speakers. In order to solve the problem, several models which have a variety of speaker range are prepared for each evaluation speaker in advance, and the most proper model is selected on a likelihood basis in the recognition step. In addition, we improved the recognition performance using unsupervised speaker adaptation with the speaker-class models. In the recognition experiments, a significant improvement could be obtained by using the proposed speaker adaptation based on speaker-class models compared with the conventional adaptation method.
Seong-Jun HAHM Yuichi OHKAWA Masashi ITO Motoyuki SUZUKI Akinori ITO Shozo MAKINO
We propose an improved reference speaker weighting (RSW) and speaker cluster weighting (SCW) approach that uses an aspect model. The concept of the approach is that the adapted model is a linear combination of a few latent reference models obtained from a set of reference speakers. The aspect model has specific latent-space characteristics that differ from orthogonal basis vectors of eigenvoice. The aspect model is a "mixture-of-mixture" model. We first calculate a small number of latent reference models as mixtures of distributions of the reference speaker's models, and then the latent reference models are mixed to obtain the adapted distribution. The mixture weights are calculated based on the expectation maximization (EM) algorithm. We use the obtained mixture weights for interpolating mean parameters of the distributions. Both training and adaptation are performed based on likelihood maximization with respect to the training and adaptation data, respectively. We conduct a continuous speech recognition experiment using a Korean database (KAIST-TRADE). The results are compared to those of a conventional MAP, MLLR, RSW, eigenvoice and SCW. Absolute word accuracy improvement of 2.06 point was achieved using the proposed method, even though we use only 0.3 s of adaptation data.
Yusuke IJIMA Takashi NOSE Makoto TACHIBANA Takao KOBAYASHI
In this paper, we propose a rapid model adaptation technique for emotional speech recognition which enables us to extract paralinguistic information as well as linguistic information contained in speech signals. This technique is based on style estimation and style adaptation using a multiple-regression HMM (MRHMM). In the MRHMM, the mean parameters of the output probability density function are controlled by a low-dimensional parameter vector, called a style vector, which corresponds to a set of the explanatory variables of the multiple regression. The recognition process consists of two stages. In the first stage, the style vector that represents the emotional expression category and the intensity of its expressiveness for the input speech is estimated on a sentence-by-sentence basis. Next, the acoustic models are adapted using the estimated style vector, and then standard HMM-based speech recognition is performed in the second stage. We assess the performance of the proposed technique in the recognition of simulated emotional speech uttered by both professional narrators and non-professional speakers.
Junichi YAMAGISHI Takao KOBAYASHI
In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.
Randy GOMEZ Akinobu LEE Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO
This paper describes the method of using multi-template unsupervised speaker adaptation based on HMM-Sufficient Statistics to push up the adaptation performance while keeping adaptation time within few seconds with just one arbitrary utterance. This adaptation scheme is mainly composed of two processes. The first part is done offline which involves the training of multiple class-dependent acoustic models and the creation of speakers' HMM-Sufficient Statistics based on gender and age. The second part is performed online where adaptation begins using the single utterance of a test speaker. From this utterance, the system will classify the speaker's class and consequently select the N-best neighbor speakers close to the utterance using Gaussian Mixture Models (GMM). The classified speakers' class template model is then adopted as a base model. From this template model, the adapted model is rapidly constructed using the N-best neighbor speakers' HMM-Sufficient Statistics. Experiments in noisy environment conditions with 20 dB, 15 dB and 10 dB SNR office, crowd, booth, and car noise are performed. The proposed multi-template method achieved 89.5% word accuracy rate compared with 88.1% of the conventional single-template method, while the baseline recognition rate without adaptation is 86.4%. Moreover, experiments using Vocal Tract Length Normalization (VTLN) and supervised Maximum Likelihood Linear Regression (MLLR) are also compared.
Appropriate language modeling is one of the major issues for automatic transcription of spontaneous speech. We propose an adaptation method for statistical language models based on both topic and speaker characteristics. This approach is applied for automatic transcription of meetings and panel discussions, in which multiple participants speak on a given topic in their own speaking style. A baseline language model is a mixture of two models, which are trained with different corpora covering various topics and speakers, respectively. Then, probabilistic latent semantic analysis (PLSA) is performed on the same respective corpora and the initial ASR result to provide two sets of unigram probabilities conditioned on input speech, with regard to topics and speaker characteristics, respectively. Finally, the baseline model is adapted by scaling N-gram probabilities with these unigram probabilities. For speaker adaptation purpose, we make use of a portion of the Corpus of Spontaneous Japanese (CSJ) in which a large number of speakers gave talks for given topics. Experimental evaluation with real discussions showed that both topic and speaker adaptation reduced test-set perplexity, and in total, an average reduction rate of 8.5% was obtained. Furthermore, improvement on word accuracy was also achieved by the proposed adaptation method.
Seiichi NAKAGAWA Tomohiro WATANABE Hiromitsu NISHIZAKI Takehito UTSURO
This paper describes an accurate unsupervised speaker adaptation method for lecture style spontaneous speech recognition using multiple LVCSR systems. In an unsupervised speaker adaptation framework, the improvement of recognition performance by adapting acoustic models remarkably depends on the accuracy of labels such as phonemes and syllables. Therefore, extraction of the adaptation data guided by confidence measure is effective for unsupervised adaptation. In this paper, we looked for the high confidence portions based on the agreement between two LVCSR systems, adapted acoustic models using the portions attached with high accurate labels, and then improved the recognition accuracy. We applied our method to the Corpus of Spontaneous Japanese (CSJ) and the method improved the recognition rate by about 2.1% in comparison with a traditional method.
We present a speaker adaptation method that makes it possible to determine articulatory parameters from an unknown speaker's speech spectrum using an HMM (Hidden Markov Model)-based speech production model. The model consists of HMMs of articulatory parameters for each phoneme and an articulatory-to-acoustic mapping that transforms the articulatory parameters into a speech spectrum for each HMM state. The model is statistically constructed by using actual articulatory-acoustic data. In the adaptation method, geometrical differences in the vocal tract as well as the articulatory behavior in the reference model are statistically adjusted to an unknown speaker. First, the articulatory parameters are estimated from an unknown speaker's speech spectrum using the reference model. Secondly, the articulatory-to-acoustic mapping is adjusted by maximizing the output probability of the acoustic parameters for the estimated articulatory parameters of the unknown speaker. With the adaptation method, the RMS error between the estimated articulatory parameters and the observed ones is 1.65 mm. The improvement rate over the speaker independent model is 56.1 %.
Junichi YAMAGISHI Masatsune TAMURA Takashi MASUKO Keiichi TOKUDA Takao KOBAYASHI
This paper describes a new training method of average voice model for speech synthesis in which arbitrary speaker's voice is generated based on speaker adaptation. When the amount of training data is limited, the distributions of average voice model often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. In the proposed method, to reduce the influence of speaker dependence, we incorporate a context clustering technique called shared decision tree context clustering and speaker adaptive training into the training procedure of average voice model. From the results of subjective tests, we show that the average voice model trained using the proposed method generates more natural sounding speech than the conventional average voice model. Moreover, it is shown that voice characteristics and prosodic features of synthetic speech generated from the adapted model using the proposed method are closer to the target speaker than the conventional method.
This paper evaluates an on-line incremental speaker adaptation method for co-channel conversation including multiple speakers with the assumption that the speaker is unknown and changes frequently. After performing the speaker clustering treatment based on the Vector Quantization (VQ) distortion for every utterance, acoustic models for each cluster are adapted by Maximum Likelihood Linear Regression (MLLR) or Maximum A Posteriori probability (MAP). The performance of continuous speech recognition could be improved. In this paper, to prove the efficiency of the speaker clustering method for improving the performance of continuous speech recognition, the continuous speech recognition experiments with supervised and unsupervised cluster adaptation were conducted, respectively. Finally, evaluation experiments based on other prepared test data were performed on continuous syllable recognition and large vocabulary continuous speech recognition (LVCSR). The efficiency of the speaker adaptation and clustering methods presented in this paper was supported strongly by the experimental results.