Existing vision substitute systems have insufficient spatial resolution to provide environmental information. To present detailed spatial information, we propose two stimulation methods to enhance transfer information using a 2-D tactile stimulator array. First, stimulators are divided into several groups. Since each stimulator group is activated alternately, the interval of stimulations can be shortened to less than the two-point discrimination threshold. In the case that stimulators are divided into two and four groups, the number of stimulators increases to twice and four times, respectively, that in the case of the two-point discrimination threshold. Further, a user selects the measurement range and the system presents targets within the range. The user acquires spatial information of the entire measurement area by changing the measurement range. This method can accurately present a range of targets. We examine and confirm these methods experimentally.
Koichiro MISU Koji IBATA Shusou WADAKA Takao CHIBA Minoru K. KUROSAWA
Acoustic field analysis results of surface acoustic wave dispersive delay lines using inclined chirp IDTs on a Y-Z LiNbO3 substrate are described. The calculated results are compared with optical measurements. The angular spectrum of the plane wave method is applied to calculation of the acoustic fields considering the anisotropy of the SAW velocity by using the polynomial approximation. Acoustic field propagating along the Z-axis of the substrate, which is the main beam excited by the inclined chirp IDT, shows asymmetric distribution between the +Z and -Z directions. Furthermore the SAW beam propagating in a slanted direction with an angle of +18 from the Z axis to the X-axis is observed. It is described that the SAW beam propagating in a slanted direction is the first side lobe excited by the inclined chirp IDT. The acoustic field shows asymmetric distribution along the X-axis because of the asymmetric structure of the inclined chirp IDT. Finally, acoustic field of a two-IDT connected structure which consists of the same IDTs electrically connected in series is presented. The acoustic field of the two-IDT connected structure is calculated to be superposed onto the calculated result of the acoustic field exited by one IDT on the calculated result shifted along the X-axis. Two SAW beams excited by IDTs are observed. The distributions of the SAW beams are not in parallel. The calculated results show good agreement with the optical measurement results.
Suehiro SHIMAUCHI Yoichi HANEDA Akitoshi KATAOKA Akinori NISHIHARA
We propose a gradient-limited affine projection algorithm (GL-APA), which can achieve fast and double-talk-robust convergence in acoustic echo cancellation. GL-APA is derived from the M-estimation-based nonlinear cost function extended for evaluating multiple error signals dealt with in the affine projection algorithm (APA). By considering the nonlinearity of the gradient, we carefully formulate an update equation consistent with multiple input-output relationships, which the conventional APA inherently satisfies to achieve fast convergence. We also newly introduce a scaling rule for the nonlinearity, so we can easily implement GL-APA by using a predetermined primary function as a basis of scaling with any projection order. This guarantees a linkage between GL-APA and the gradient-limited normalized least-mean-squares algorithm (GL-NLMS), which is a conventional algorithm that corresponds to the GL-APA of the first order. The performance of GL-APA is demonstrated with simulation results.
Tetsuya TAKIGUCHI Masafumi NISHIMURA Yasuo ARIKI
This paper describes a hands-free speech recognition technique based on acoustic model adaptation to reverberant speech. In hands-free speech recognition, the recognition accuracy is degraded by reverberation, since each segment of speech is affected by the reflection energy of the preceding segment. To compensate for the reflection signal we introduce a frame-by-frame adaptation method adding the reflection signal to the means of the acoustic model. The reflection signal is approximated by a first-order linear prediction from the observation signal at the preceding frame, and the linear prediction coefficient is estimated with a maximum likelihood method by using the EM algorithm, which maximizes the likelihood of the adaptation data. Its effectiveness is confirmed by word recognition experiments on reverberant speech.
Shigeki MATSUDA Takatoshi JITSUHIRO Konstantin MARKOV Satoshi NAKAMURA
In this paper, we describe a parallel decoding-based ASR system developed of ATR that is robust to noise type, SNR and speaking style. It is difficult to recognize speech affected by various factors, especially when an ASR system contains only a single acoustic model. One solution is to employ multiple acoustic models, one model for each different condition. Even though the robustness of each acoustic model is limited, the whole ASR system can handle various conditions appropriately. In our system, there are two recognition sub-systems which use different features such as MFCC and Differential MFCC (DMFCC). Each sub-system has several acoustic models depending on SNR, speaker gender and speaking style, and during recognition each acoustic model is adapted by fast noise adaptation. From each sub-system, one hypothesis is selected based on posterior probability. The final recognition result is obtained by combining the best hypotheses from the two sub-systems. On the AURORA-2J task used widely for the evaluation of noise robustness, our system achieved higher recognition performance than a system which contains only a single model. Also, our system was tested using normal and hyper-articulated speech contaminated by several background noises, and exhibited high robustness to noise and speaking styles.
Sakriani SAKTI Konstantin MARKOV Satoshi NAKAMURA
The most widely used acoustic unit in current automatic speech recognition systems is the triphone, which includes the immediate preceding and following phonetic contexts. Although triphones have proved to be an efficient choice, it is believed that they are insufficient in capturing all of the coarticulation effects. A wider phonetic context seems to be more appropriate, but often suffers from the data sparsity problem and memory constraints. Therefore, an efficient modeling of wider contexts needs to be addressed to achieve a realistic application for an automatic speech recognition system. This paper presents a new method of modeling pentaphone-context units using the hybrid HMM/BN acoustic modeling framework. Rather than modeling pentaphones explicitly, in this approach the probabilistic dependencies between the triphone context unit and the second preceding/following contexts are incorporated into the triphone state output distributions by means of the BN. The advantages of this approach are that we are able to extend the modeled phonetic context within the triphone framework, and we can use a standard decoding system by assuming the next preceding/following context variables hidden during the recognition. To handle the increased parameter number, tying using knowledge-based phoneme classes and a data-driven clustering method is applied. The evaluation experiments indicate that the proposed model outperforms the standard HMM based triphone model, achieving a 9-10% relative word error rate (WER) reduction.
Minimum Bayes risk estimation and decoding strategies based on lattice segmentation techniques can be used to refine large vocabulary continuous speech recognition systems through the estimation of the parameters of the underlying hidden Markov models and through the identification of smaller recognition tasks which provides the opportunity to incorporate novel modeling and decoding procedures in LVCSR. These techniques are discussed in the context of going 'beyond HMMs', showing in particular that this process of subproblem identification makes it possible to train and apply small-domain binary pattern classifiers, such as Support Vector Machines, to large vocabulary continuous speech recognition.
Tetsuji OGAWA Tetsunori KOBAYASHI
A discriminative modeling is applied to optimize the structure of a Partly-Hidden Markov Model (PHMM). PHMM was proposed in our previous work to deal with the complicated temporal changes of acoustic features. It can represent observation dependent behaviors in both observations and state transitions. In the formulation of the previous PHMM, we used a common structure for all models. However, it is expected that the optimal structure which gives the best performance differs from category to category. In this paper, we designed a new structure optimization method in which the dependence of the states and the observations of PHMM are optimally defined according to each model using the weighted likelihood-ratio maximization (WLRM) criterion. The WLRM criterion gives high discriminability between the correct category and the incorrect categories. Therefore it gives model structures with good discriminative performance. We define the model structure combination which satisfy the WLRM criterion for any possible structure combinations as the optimal structures. A genetic algorithm is also applied to the adequate approximation of a full search. With results of continuous lecture talk speech recognition, the effectiveness of the proposed structure optimization is shown: it reduced the word errors compared to HMM and PHMM with a common structure for all models.
Sakriani SAKTI Satoshi NAKAMURA Konstantin MARKOV
Over the last decade, the Bayesian approach has increased in popularity in many application areas. It uses a probabilistic framework which encodes our beliefs or actions in situations of uncertainty. Information from several models can also be combined based on the Bayesian framework to achieve better inference and to better account for modeling uncertainty. The approach we adopted here is to utilize the benefits of the Bayesian framework to improve acoustic model precision in speech recognition systems, which modeling a wider-than-triphone context by approximating it using several less context-dependent models. Such a composition was developed in order to avoid the crucial problem of limited training data and to reduce the model complexity. To enhance the model reliability due to unseen contexts and limited training data, flooring and smoothing techniques are applied. Experimental results show that the proposed Bayesian pentaphone model improves word accuracy in comparison with the standard triphone model.
Konstantin MARKOV Satoshi NAKAMURA
In recent years, the number of studies investigating new directions in speech modeling that goes beyond the conventional HMM has increased considerably. One promising approach is to use Bayesian Networks (BN) as speech models. Full recognition systems based on Dynamic BN as well as acoustic models using BN have been proposed lately. Our group at ATR has been developing a hybrid HMM/BN model, which is an HMM where the state probability distribution is modeled by a BN, instead of commonly used mixtures of Gaussian functions. In this paper, we describe how to use the hybrid HMM/BN acoustic models, especially emphasizing some design and implementation issues. The most essential part of HMM/BN model building is the choice of the state BN topology. As it is manually chosen, there are some factors that should be considered in this process. They include, but are not limited to, the type of data, the task and the available additional information. When context-dependent models are used, the state-level structure can be obtained by traditional methods. The HMM/BN parameter learning is based on the Viterbi training paradigm and consists of two alternating steps - BN training and HMM transition updates. For recognition, in some cases, BN inference is computationally equivalent to a mixture of Gaussians, which allows HMM/BN model to be used in existing decoders without any modification. We present two examples of HMM/BN model applications in speech recognition systems. Evaluations under various conditions and for different tasks showed that the HMM/BN model gives consistently better performance than the conventional HMM.
Tobias CINCAREK Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO
To obtain a robust acoustic model for a certain speech recognition task, a large amount of speech data is necessary. However, the preparation of speech data including recording and transcription is very costly and time-consuming. Although there are attempts to build generic acoustic models which are portable among different applications, speech recognition performance is typically task-dependent. This paper introduces a method for automatically building task-dependent acoustic models based on selective training. Instead of setting up a new database, only a small amount of task-specific development data needs to be collected. Based on the likelihood of the target model parameters given this development data, utterances which are acoustically close to the development data are selected from existing speech data resources. Since there are too many possibilities for selecting a data subset from a larger database in general, a heuristic has to be employed. The proposed algorithm deletes single utterances temporarily or alternates between successive deletion and addition of multiple utterances. In order to make selective training computationally practical, model retraining and likelihood calculation need to be fast. It is shown, that the model likelihood can be calculated fast and easily based on sufficient statistics without the need for explicit reconstruction of model parameters. The algorithm is applied to obtain an infant- and elderly-dependent acoustic model with only very few development data available. There is an improvement in word accuracy of up to 9% in comparison to conventional EM training without selection. Furthermore, the approach was also better than MLLR and MAP adaptation with the development data.
Shoei SATO Kazuo ONOE Akio KOBAYASHI Toru IMAI
This paper proposes a new compensation method of acoustic scores in the Viterbi search for robust speech recognition. This method introduces noise models to represent a wide variety of noises and realizes robust decoding together with conventional techniques of subtraction and adaptation. This method uses likelihoods of noise models in two ways. One is to calculate a confidence factor for each input frame by comparing likelihoods of speech models and noise models. Then the weight of the acoustic score for a noisy frame is reduced according to the value of the confidence factor for compensation. The other is to use the likelihood of noise model as an alternative that of a silence model when given noisy input. Since a lower confidence factor compresses acoustic scores, the decoder rather relies on language scores and keeps more hypotheses within a fixed search depth for a noisy frame. An experiment using commentary transcriptions of a broadcast sports program (MLB: Major League Baseball) showed that the proposed method obtained a 6.7% relative word error reduction. The method also reduced the relative error rate of key words by 17.9%, and this is expected lead to an improvement metadata extraction accuracy.
Takuya YOSHIOKA Takafumi HIKICHI Masato MIYOSHI Hiroshi G. OKUNO
This paper describes a method for estimating the amplitude characteristics of poles common to multiple room transfer functions from musical audio signals received by multiple microphones. Knowledge of these pole characteristics would make it easier to manipulate audio equalizers, since they correspond to the room resonance. It has been proven that an estimate of the poles can be calculated precisely when a source signal is white. However, if a source signal is colored as in the case of a musical audio signal, the estimate is degraded by the frequency characteristics originally contained in the source signal. In this paper, we consider that an amplitude spectrum of a musical audio signal consists of its envelope and fine structure. We assume that musical pieces can be classified into several categories according to their average amplitude spectral envelopes. Based on this assumption, the amplitude spectral envelope of the musical audio signal can be obtained from prior knowledge of the average amplitude spectral envelope of a musical piece category into which the target piece is classified. On the other hand, the fine structure is identified based on its time variance. By removing both the spectral envelope and the fine structure from the amplitude spectrum estimated with the conventional method, the amplitude characteristics of the acoustical poles can be extracted. Simulation results for 20 popular songs revealed that our method was capable of estimating the amplitude characteristics of the acoustical poles with a spectral distortion of 3.11 dB. In particular, most of the spectral peaks, corresponding to the room resonance modes, were successfully detected.
Ming WU Zhibin LIN Xiaojun QIU
This letter proposes a novel nonlinear distortion for the unique identification of receiving room impulses in stereo acoustic echo cancellation when applying the frequency-domain adaptive filtering technique. This nonlinear distortion is effective in reducing the coherence between the two incoming audio channels and its influence on audio quality is inaudible.
Ikumi ENOMORI Kunimasa SAITOH Masanori KOSHIBA
Propagation characteristics of acoustic waves in photonic crystal fibers (PCFs) have been theoretically investigated in details. In order to evaluate acoustic band structures and guided modes for out-of-plane propagation in PCFs, analysis methods based on the finite element method are newly formulated. It is shown through numerical results that complete acoustic band-gaps (ABGs) exist in the cladding region of PCFs and that acoustic guided modes could be localized in the defect region of PCFs by the ABG effect. Furthermore, it is shown that acoustic guided modes could also be localized in the defect region of PCFs by the total internal reflection. These confinement mechanisms of acoustic waves propagating along the fiber length are completely different to those of lightwaves.
Yohei ITAYA Heiga ZEN Yoshihiko NANKAKU Chiyomi MIYAJIMA Keiichi TOKUDA Tadashi KITAMURA
This paper investigates the effectiveness of the DAEM (Deterministic Annealing EM) algorithm in acoustic modeling for speaker and speech recognition. Although the EM algorithm has been widely used to approximate the ML estimates, it has the problem of initialization dependence. To relax this problem, the DAEM algorithm has been proposed and confirmed the effectiveness in artificial small tasks. In this paper, we applied the DAEM algorithm to practical speech recognition tasks: speaker recognition based on GMMs and continuous speech recognition based on HMMs. Experimental results show that the DAEM algorithm can improve the recognition performance as compared to the standard EM algorithm with conventional initialization algorithms, especially in the flat start training for continuous speech recognition.
Junichi YAMAGISHI Koji ONISHI Takashi MASUKO Takao KOBAYASHI
This paper describes the modeling of various emotional expressions and speaking styles in synthetic speech using HMM-based speech synthesis. We show two methods for modeling speaking styles and emotional expressions. In the first method called style-dependent modeling, each speaking style and emotional expression is modeled individually. In the second one called style-mixed modeling, each speaking style and emotional expression is treated as one of contexts as well as phonetic, prosodic, and linguistic features, and all speaking styles and emotional expressions are modeled simultaneously by using a single acoustic model. We chose four styles of read speech -- neutral, rough, joyful, and sad -- and compared the above two modeling methods using these styles. The results of subjective evaluation tests show that both modeling methods have almost the same accuracy, and that it is possible to synthesize speech with the speaking style and emotional expression similar to those of the target speech. In a test of classification of styles in synthesized speech, more than 80% of speech samples generated using both the models were judged to be similar to the target styles. We also show that the style-mixed modeling method gives fewer output and duration distributions than the style-dependent modeling method.
Takatoshi JITSUHIRO Satoshi NAKAMURA
We propose a new method both for automatically creating non-uniform, context-dependent HMM topologies, and selecting the number of mixture components based on the Variational Bayesian (VB) approach. Although the Maximum Likelihood (ML) criterion is generally used to create HMM topologies, it has an over-fitting problem. Recently, to avoid this problem, the VB approach has been applied to create acoustic models for speech recognition. We introduce the VB approach to the Successive State Splitting (SSS) algorithm, which can create both contextual and temporal variations for HMMs. Experimental results indicate that the proposed method can automatically create a more efficient model than the original method. We evaluated a method to increase the number of mixture components by using the VB approach and considering temporal structures. The VB approach obtained almost the same performance as the smaller number of mixture components in comparison with that obtained by using ML-based methods.
Masaki OKAMOTO Yoshihiro INOUE Koichi YOSHIHARA Toshio KAWAHARA Jun MORIMOTO
Photoacoustic (PA) spectra on the 3, 4, 9, 10-perylenetetracarboxylic dianhydride (PTCDA) films deposited by the vacuum evaporation were measured. The films have layered structures constructed from the perylene molecule plane structures. The crystal quality depended on the deposited substrate and the photoacoustic spectroscopy (PAS) seems to be the very useful tools to evaluate these properties from the non-radiative features. The films deposited on the three different substrate had the almost same PL spectra, but the films deposited on the glass substrate had the large non-radiative peaks in the PA spectra contrary to the films deposited on the alumina or crystal Si (100) those had the non-radiative peaks only observed at the short wavelength region.
Takahiro SHINOZAKI Sadaoki FURUI
One of the most important issues in spontaneous speech recognition is how to cope with the degradation of recognition accuracy due to speaking rate fluctuation within an utterance. This paper proposes an acoustic model for adjusting mixture weights and transition probabilities of the HMM for each frame according to the local speaking rate. The proposed model is implemented along with variants and conventional models using the Bayesian network framework. The proposed model has a hidden variable representing variation of the "mode" of the speaking rate, and its value controls the parameters of the underlying HMM. Model training and maximum probability assignment of the variables are conducted using the EM/GEM and inference algorithms for the Bayesian networks. Utterances from meetings and lectures are used for evaluation where the Bayesian network-based acoustic models are used to rescore the likelihood of the N-best lists. In the experiments, the proposed model indicated consistently higher performance than conventional HMMs and regression HMMs using the same speaking rate information.