1-13hit |
Statistical speech recognition using continuous-density hidden Markov models (CDHMMs) has yielded many practical applications. However, in general, mismatches between the training data and input data significantly degrade recognition accuracy. Various acoustic model adaptation techniques using a few input utterances have been employed to overcome this problem. In this article, we survey these adaptation techniques, including maximum a posteriori (MAP) estimation, maximum likelihood linear regression (MLLR), and eigenvoice. We also present a schematic view called the adaptation pyramid to illustrate how these methods relate to each other.
Nguyen Huu BACH Koichi SHINODA Sadaoki FURUI
In this paper, we propose a robust statistical framework for extracting scenes from a baseball broadcast video. We apply multi-stream hidden Markov models (HMMs) to control the weights among different features. To achieve a large robustness against new scenes, we used a common simple structure for all the HMMs. In addition, scene segmentation and unsupervised adaptation were applied to achieve greater robustness against differences in environmental conditions among games. The F-measure of scene-extracting experiments for eight types of scene from 4.5 hours of digest data was 77.4% and was increased to 78.7% by applying scene segmentation. Furthermore, the unsupervised adaptation method improved precision by 2.7 points to 81.4%. These results confirm the effectiveness of our framework.
Yuan LIANG Koji IWANO Koichi SHINODA
Most error correction interfaces for speech recognition applications on smartphones require the user to first mark an error region and choose the correct word from a candidate list. We propose a simple multimodal interface to make the process more efficient. We develop Long Context Match (LCM) to get candidates that complement the conventional word confusion network (WCN). Assuming that not only the preceding words but also the succeeding words of the error region are validated by users, we use such contexts to search higher-order n-grams corpora for matching word sequences. For this purpose, we also utilize the Web text data. Furthermore, we propose a combination of LCM and WCN (“LCM + WCN”) to provide users with candidate lists that are more relevant than those yielded by WCN alone. We compare our interface with the WCN-based interface on the Corpus of Spontaneous Japanese (CSJ). Our proposed “LCM + WCN” method improved the 1-best accuracy by 23%, improved the Mean Reciprocal Rank (MRR) by 28%, and our interface reduced the user's load by 12%.
Mariana RODRIGUES MAKIUCHI Tifani WARNITA Nakamasa INOUE Koichi SHINODA Michitaka YOSHIMURA Momoko KITAZAWA Kei FUNAKI Yoko EGUCHI Taishiro KISHIMOTO
We propose a non-invasive and cost-effective method to automatically detect dementia by utilizing solely speech audio data. We extract paralinguistic features for a short speech segment and use Gated Convolutional Neural Networks (GCNN) to classify it into dementia or healthy. We evaluate our method on the Pitt Corpus and on our own dataset, the PROMPT Database. Our method yields the accuracy of 73.1% on the Pitt Corpus using an average of 114 seconds of speech data. In the PROMPT Database, our method yields the accuracy of 74.7% using 4 seconds of speech data and it improves to 80.8% when we use all the patient's speech data. Furthermore, we evaluate our method on a three-class classification problem in which we included the Mild Cognitive Impairment (MCI) class and achieved the accuracy of 60.6% with 40 seconds of speech data.
Hilman PARDEDE Koji IWANO Koichi SHINODA
Spectral subtraction (SS) is an additive noise removal method which is derived in an extensive framework. In spectral subtraction, it is assumed that speech and noise spectra follow Gaussian distributions and are independent with each other. Hence, noisy speech also follows a Gaussian distribution. Spectral subtraction formula is obtained by maximizing the likelihood of noisy speech distribution with respect to its variance. However, it is well known that noisy speech observed in real situations often follows a heavy-tailed distribution, not a Gaussian distribution. In this paper, we introduce a q-Gaussian distribution in the non-extensive statistics to represent the distribution of noisy speech and derive a new spectral subtraction method based on it. We found that the q-Gaussian distribution fits the noisy speech distribution better than the Gaussian distribution does. Our speech recognition experiments using the Aurora-2 database showed that the proposed method, q-spectral subtraction (q-SS), outperformed the conventional SS method.
Tomohiro MASHITA Koichi SHINTANI Kiyoshi KIYOKAWA
This paper introduces a user study regarding the effects of hand- and ocular-dominances to pointing gestures. The result of this study is applicable for designing new gesture interfaces which are close to a user's cognition, intuitive, and easy to use. The user study investigates the relationship between the participant's dominances and pointing gestures. Four participant groups—right-handed right-eye dominant, right-handed left-eye dominant, left-handed right-eye dominant and left-handed left-eye dominant—were prepared, and participants were asked to point at the targets on a screen by their left and right hands. The pointing errors among the different participant groups are calculated and compared. The result of this user study shows that using dominant eyes produces better results than using non-dominant eyes and the accuracy increases when the targets are located at the same side of dominant eye. Based on these interesting properties, a method to find the dominant eye for pointing gestures is proposed. This method can find the dominant eye of an individual with more than 90% accuracy.
Masahiro NISHI Koichi SHIN Teruaki YOSHIDA
In the digital terrestrial TV broadcasting system, it is important to evaluate both quantitative levels and sources of overreach interference, because it can degrade the TV service quality. This paper newly proposes an overreach measurement method that simultaneously monitors RSSI (Received Signal Strength Indicator) and CNR (Carrier to Noise power Ratio) of the TV waves and RSSI of FM waves. The results of measurements conducted in Hiroshima prefecture show that our proposed method can evaluate the level of overreach interference in the TV waves and also identify the source of the interference. Total 43 overreach interference events were found in the proposed method from one-year measurement in 2012. Based on M profile data, this paper also shows that the main factor of the overreach interference in this measurement is duct propagation due to meteorological condition.
Muhammad Rasyid AQMAR Koichi SHINODA Sadaoki FURUI
Variations in walking speed have a strong impact on gait-based person identification. We propose a method that is robust against walking-speed variations. It is based on a combination of cubic higher-order local auto-correlation (CHLAC), gait silhouette-based principal component analysis (GSP), and a statistical framework using hidden Markov models (HMMs). The CHLAC features capture the within-phase spatio-temporal characteristics of each individual, the GSP features retain more shape/phase information for better gait sequence alignment, and the HMMs classify the ID of each gait even when walking speed changes nonlinearly. We compared the performance of our method with other conventional methods using five different databases, SOTON, USF-NIST, CMU-MoBo, TokyoTech A and TokyoTech B. The proposed method was equal to or better than the others when the speed did not change greatly, and it was significantly better when the speed varied across and within a gait sequence.
Takafumi KOSHINAKA Kentaro NAGATOMO Koichi SHINODA
A novel online speaker clustering method based on a generative model is proposed. It employs an incremental variant of variational Bayesian learning and provides probabilistic (non-deterministic) decisions for each input utterance, on the basis of the history of preceding utterances. It can be expected to be robust against errors in cluster estimation and the classification of utterances, and hence to be applicable to many real-time applications. Experimental results show that it produces 50% fewer classification errors than does a conventional online method. They also show that it is possible to reduce the number of speech recognition errors by combining the method with unsupervised speaker adaptation.
Hiroko MURAKAMI Koichi SHINODA Sadaoki FURUI
We propose an active learning framework for speech recognition that reduces the amount of data required for acoustic modeling. This framework consists of two steps. We first obtain a phone-error distribution using an acoustic model estimated from transcribed speech data. Then, from a text corpus we select a sentence whose phone-occurrence distribution is close to the phone-error distribution and collect its speech data. We repeat this process to increase the amount of transcribed speech data. We applied this framework to speaker adaptation and acoustic model training. Our evaluation results showed that it significantly reduced the amount of transcribed data while maintaining the same level of accuracy.
Koji TANIGUCHI Masaru NAKAKITA Yoshihiro UENO Kaoru MATSUOKA Koichi SHINOHARA
A method of evaluating the gas viscous friction force acting on head/disk interface has been developed. In the past, the effect of the gas viscous friction force has been almost negligible, due to its small value compared with the contact friction force. Recently the gas viscous friction force has tended to increase with the decrease in spacing and the increase in relative speed between the slider and the disk, therefore it is becoming necessary to consider its effect on motor load or slider posture. Few experimental studies of the gas viscous friction force, however, have been performed. In this study, the measurement of the gas viscous friction force has been realized by developing a sensitive friction force sensor. Furthermore a method of evaluating the gas viscous and contact friction forces separately has been also established.
Yuzo HAMANAKA Koichi SHINODA Takuya TSUTAOKA Sadaoki FURUI Tadashi EMORI Takafumi KOSHINAKA
We propose a committee-based method of active learning for large vocabulary continuous speech recognition. Multiple recognizers are trained in this approach, and the recognition results obtained from these are used for selecting utterances. Those utterances whose recognition results differ the most among recognizers are selected and transcribed. Progressive alignment and voting entropy are used to measure the degree of disagreement among recognizers on the recognition result. Our method was evaluated by using 191-hour speech data in the Corpus of Spontaneous Japanese. It proved to be significantly better than random selection. It only required 63 h of data to achieve a word accuracy of 74%, while standard training (i.e., random selection) required 103 h of data. It also proved to be significantly better than conventional uncertainty sampling using word posterior probabilities.
Saburo TANAKA Ryouji SHIMIZU Yusuke SAITO Koichi SHIN
A portable cryo-system using a high-Tc SQUID for the measurement of the remanant magnetic field of a rock specimen was designed and fabricated. The sensing surface of the SQUID faces upward in our system, although the system for bio-magnetics faces down. The SQUID is cooled by liquid nitrogen via a sapphire heat transfer rod. The total heat transfer of the system was measured by means of a boiling-off method and was found to be 1.65 W. It was demonstrated that the system can be operated for more than 17 hours without any maintenance such as filling with liquid nitrogen. The system was applied to the measurement of the remanent magnetic field distributions of rock samples cored from deep underground. We have successfully measured the distributions.