IEICE global.ieice.org Site

Author Search Result

[Author] Kazuya TAKEDA(29hit)

1-20hit(29hit)

Speech Enhancement Using Nonlinear Microphone Array Based on Noise Adaptive Complementary Beamforming
Hiroshi SARUWATARI Shoji KAJITA Kazuya TAKEDA Fumitada ITAKURA

PAPER-Engineering Acoustics

Vol:
E83-A No:5
Page(s):
866-876
This paper describes an improved complementary beamforming microphone array based on the new noise adaptation algorithm. Complementary beamforming is based on two types of beamformers designed to obtain complementary directivity patterns with respect to each other. In this system, during a pause in the target speech, two directivity patterns of the beamformers are adapted to the noise directions of arrival so that the expectation values of each noise power spectrum are minimized in the array output. Using this technique, we can realize the directional nulls for each noise even when the number of sound sources exceeds that of microphones. To evaluate the effectiveness, speech enhancement experiments and speech recognition experiments are performed based on computer simulations with a two-element array and three sound sources under various noise conditions. In comparison with the conventional adaptive beamformer and the conventional spectral subtraction method cascaded with the adaptive beamformer, it is shown that (1) the proposed array improves the signal-to-noise ratio (SNR) of degraded speech by more than 6 dB when the interfering noise is two speakers with the input SNR of below 0 dB, (2) the proposed array improves the SNR by about 2 dB when the interfering noise is bubble noise, and (3) an improvement in the recognition rate of more than 18% is obtained when the interfering noise is two speakers or two overlapped signals of some speakers under the condition that the input SNR is 10 dB.
Single-Channel Multiple Regression for In-Car Speech Enhancement
Weifeng LI Katsunobu ITOU Kazuya TAKEDA Fumitada ITAKURA

PAPER-Speech Enhancement

Vol:
E89-D No:3
Page(s):
1032-1039
We address issues for improving hands-free speech enhancement and speech recognition performance in different car environments using a single distant microphone. This paper describes a new single-channel in-car speech enhancement method that estimates the log spectra of speech at a close-talking microphone based on the nonlinear regression of the log spectra of noisy signal captured by a distant microphone and the estimated noise. The proposed method provides significant overall quality improvements in our subjective evaluation on the regression-enhanced speech, and performed best in most objective measures. Based on our isolated word recognition experiments conducted under 15 real car environments, the proposed adaptive nonlinear regression approach shows an advantage in average relative word error rate (WER) reductions of 50.8% and 13.1%, respectively, compared to original noisy speech and ETSI advanced front-end (ETSI ES 202 050).
CENSREC-3: An Evaluation Framework for Japanese Speech Recognition in Real Car-Driving Environments
Masakiyo FUJIMOTO Kazuya TAKEDA Satoshi NAKAMURA

PAPER-Speech and Hearing

Vol:
E89-D No:11
Page(s):
2783-2793
This paper introduces a common database, an evaluation framework, and its baseline recognition results for in-car speech recognition, CENSREC-3, as an outcome of the IPSJ-SIG SLP Noisy Speech Recognition Evaluation Working Group. CENSREC-3, which is a sequel to AURORA-2J, has been designed as the evaluation framework of isolated word recognition in real car-driving environments. Speech data were collected using two microphones, a close-talking microphone and a hands-free microphone, under 16 carefully controlled driving conditions, i.e., combinations of three car speeds and six car conditions. CENSREC-3 provides six evaluation environments designed using speech data collected in these conditions.
Acoustic Model Training Using Pseudo-Speaker Features Generated by MLLR Transformations for Robust Speaker-Independent Speech Recognition
Arata ITOH Sunao HARA Norihide KITAOKA Kazuya TAKEDA

PAPER-Speech and Hearing

Vol:
E95-D No:10
Page(s):
2479-2485
A novel speech feature generation-based acoustic model training method for robust speaker-independent speech recognition is proposed. For decades, speaker adaptation methods have been widely used. All of these adaptation methods need adaptation data. However, our proposed method aims to create speaker-independent acoustic models that cover not only known but also unknown speakers. We achieve this by adopting inverse maximum likelihood linear regression (MLLR) transformation-based feature generation, and then we train our models using these features. First we obtain MLLR transformation matrices from a limited number of existing speakers. Then we extract the bases of the MLLR transformation matrices using PCA. The distribution of the weight parameters to express the transformation matrices for the existing speakers are estimated. Next, we construct pseudo-speaker transformations by sampling the weight parameters from the distribution, and apply the transformation to the normalized features of the existing speaker to generate the features of the pseudo-speakers. Finally, using these features, we train the acoustic models. Evaluation results show that the acoustic models trained using our proposed method are robust for unknown speakers.
Driver Identification Using Driving Behavior Signals
Toshihiro WAKITA Koji OZAWA Chiyomi MIYAJIMA Kei IGARASHI Katunobu ITOU Kazuya TAKEDA Fumitada ITAKURA

PAPER-Human-computer Interaction

Vol:
E89-D No:3
Page(s):
1188-1194
In this paper, we propose a driver identification method that is based on the driving behavior signals that are observed while the driver is following another vehicle. Driving behavior signals, such as the use of the accelerator pedal, brake pedal, vehicle velocity, and distance from the vehicle in front, were measured using a driving simulator. We compared the identification rate obtained using different identification models. As a result, we found the Gaussian Mixture Model to be superior to the Helly model and the optimal velocity model. Also, the driver's operation signals were found to be better than road environment signals and car behavior signals for the Gaussian Mixture Model. The identification rate for thirty driver using actual vehicle driving in a city area was 73%.
Acoustic Feature Transformation Based on Discriminant Analysis Preserving Local Structure for Speech Recognition
Makoto SAKAI Norihide KITAOKA Kazuya TAKEDA

PAPER-Speech and Hearing

Vol:
E93-D No:5
Page(s):
1244-1252
To improve speech recognition performance, feature transformation based on discriminant analysis has been widely used to reduce the redundant dimensions of acoustic features. Linear discriminant analysis (LDA) and heteroscedastic discriminant analysis (HDA) are often used for this purpose, and a generalization method for LDA and HDA, called power LDA (PLDA), has been proposed. However, these methods may result in an unexpected dimensionality reduction for multimodal data. It is important to preserve the local structure of the data when reducing the dimensionality of multimodal data. In this paper we introduce two methods, locality-preserving HDA and locality-preserving PLDA, to reduce dimensionality of multimodal data appropriately. We also propose an approximate calculation scheme to calculate sub-optimal projections rapidly. Experimental results show that the locality-preserving methods yield better performance than the traditional ones in speech recognition.
FOREWORD Open Access
Kazuya TAKEDA

FOREWORD

Vol:
E91-D No:3
Page(s):
391-392
An Acoustically Oriented Vocal-Tract Model
Hani C. YEHIA Kazuya TAKEDA Fumitada ITAKURA

PAPER-Speech Processing and Acoustics

Vol:
E79-D No:8
Page(s):
1198-1208
The objective of this paper is to find a parametric representation for the vocal-tract log-area function that is directly and simply related to basic acoustic characteristics of the human vocal-tract. The importance of this representation is associated with the solution of the articulatory-to-acoustic inverse problem, where a simple mapping from the articulatory space onto the acoustic space can be very useful. The method is as follows: Firstly, given a corpus of log-area functions, a parametric model is derived following a factor analysis technique. After that, the articulatory space, defined by the parametric model, is filled with approximately uniformly distributed points, and the corresponding first three formant frequencies are calculated. These formants define an acoustic space onto which the articulatory space maps. In the next step, an independent component analysis technique is used to determine acoustic and articulatory coordinate systems whose components are as independent as possible. Finally, using singular value decomposition, acoustic and articulatory coordinate systems are rotated so that each of the first three components of the articulatory space has major influence on one, and only one, component of the acoustic space. An example showing how the proposed model can be applied to the solution of the articulatory-to-acoustic inverse problem is given at the end of the paper.
Gamma Modeling of Speech Power and Its On-Line Estimation for Statistical Speech Enhancement
Tran Huy DAT Kazuya TAKEDA Fumitada ITAKURA

PAPER-Speech Enhancement

Vol:
E89-D No:3
Page(s):
1040-1049
This study shows the effectiveness of using gamma distribution in the speech power domain as a more general prior distribution for the model-based speech enhancement approaches. This model is a super-set of the conventional Gaussian model of the complex spectrum and provides more accurate prior modeling when the optimal parameters are estimated. We develop a method to adapt the modeled distribution parameters from each actual noisy speech in a frame-by-frame manner. Next, we derive and investigate the minimum mean square error (MMSE) and maximum a posterior probability (MAP) estimations in different domains of speech spectral magnitude, generalized power and its logarithm, using the proposed gamma modeling. Finally, a comparative evaluation of the MAP and MMSE filters is conducted. As the MMSE estimations tend to more complicated using more general prior distributions, the MAP estimations are given in closed-form extractions and therefore are suitable in the implementation. The adaptive estimation of the modeled distribution parameters provides more accurate prior modeling and this is the principal merit of the proposed method and the reason for the better performance. From the experiments, the MAP estimation is recommended due to its high efficiency and low complexity. Among the MAP based systems, the estimation in log-magnitude domain is shown to be the best for the speech recognition as the estimation in power domain is superior for the noise reduction.
Multiple Regression of Log Spectra for In-Car Speech Recognition Using Multiple Distributed Microphones
Weifeng LI Tetsuya SHINDE Hiroshi FUJIMURA Chiyomi MIYAJIMA Takanori NISHINO Katunobu ITOU Kazuya TAKEDA Fumitada ITAKURA

PAPER-Feature Extraction and Acoustic Medelings

Vol:
E88-D No:3
Page(s):
384-390
This paper describes a new multi-channel method of noisy speech recognition, which estimates the log spectrum of speech at a close-talking microphone based on the multiple regression of the log spectra (MRLS) of noisy signals captured by distributed microphones. The advantages of the proposed method are as follows: 1) The method does not require a sensitive geometric layout, calibration of the sensors nor additional pre-processing for tracking the speech source; 2) System works in very small computation amounts; and 3) Regression weights can be statistically optimized over the given training data. Once the optimal regression weights are obtained by regression learning, they can be utilized to generate the estimated log spectrum in the recognition phase, where the speech of close-talking is no longer required. The performance of the proposed method is illustrated by speech recognition of real in-car dialogue data. In comparison to the nearest distant microphone and multi-microphone adaptive beamformer, the proposed approach obtains relative word error rate (WER) reductions of 9.8% and 3.6%, respectively.
AURORA-2J: An Evaluation Framework for Japanese Noisy Speech Recognition
Satoshi NAKAMURA Kazuya TAKEDA Kazumasa YAMAMOTO Takeshi YAMADA Shingo KUROIWA Norihide KITAOKA Takanobu NISHIURA Akira SASOU Mitsunori MIZUMACHI Chiyomi MIYAJIMA Masakiyo FUJIMOTO Toshiki ENDO

PAPER-Speech Corpora and Related Topics

Vol:
E88-D No:3
Page(s):
535-544
This paper introduces an evaluation framework for Japanese noisy speech recognition named AURORA-2J. Speech recognition systems must still be improved to be robust to noisy environments, but this improvement requires development of the standard evaluation corpus and assessment technologies. Recently, the Aurora 2, 3 and 4 corpora and their evaluation scenarios have had significant impact on noisy speech recognition research. The AURORA-2J is a Japanese connected digits corpus and its evaluation scripts are designed in the same way as Aurora 2 with the help of European Telecommunications Standards Institute (ETSI) AURORA group. This paper describes the data collection, baseline scripts, and its baseline performance. We also propose a new performance analysis method that considers differences in recognition performance among speakers. This method is based on the word accuracy per speaker, revealing the degree of the individual difference of the recognition performance. We also propose categorization of modifications, applied to the original HTK baseline system, which helps in comparing the systems and in recognizing technologies that improve the performance best within the same category.
Speech Recognition Using Finger Tapping Timings
Hiromitsu BAN Chiyomi MIYAJIMA Katsunobu ITOU Kazuya TAKEDA Fumitada ITAKURA

LETTER-Speech and Hearing

Vol:
E88-D No:3
Page(s):
667-670
Behavioral synchronization between speech and finger tapping provides a novel approach to improving speech recognition accuracy. We combine a sequence of finger tapping timings recorded alongside an utterance using two distinct methods: in the first method, HMM state transition probabilities at the word boundaries are controlled by the timing of the finger tapping; in the second, the probability (relative frequency) of the finger tapping is used as a 'feature' and combined with MFCC in a HMM recognition system. We evaluate these methods through connected digit recognition under different noise conditions (AURORA-2J). Leveraging the synchrony between speech and finger tapping provides a 46% relative improvement in connected digit recognition experiments.
Speech Enhancement Using Nonlinear Microphone Array Based on Complementary Beamforming
Hiroshi SARUWATARI Shoji KAJITA Kazuya TAKEDA Fumitada ITAKURA

PAPER

Vol:
E82-A No:8
Page(s):
1501-1510
This paper describes a spatial spectral subtraction method by using the complementary beamforming microphone array to enhance noisy speech signals for speech recognition. The complementary beamforming is based on two types of beamformers designed to obtain complementary directivity patterns with respect to each other. In this paper, it is shown that the nonlinear subtraction processing with complementary beamforming can result in a kind of the spectral subtraction without the need for speech pause detection. In addition, the optimization algorithm for the directivity pattern is also described. To evaluate the effectiveness, speech enhancement experiments and speech recognition experiments are performed based on computer simulations under both stationary and nonstationary noise conditions. In comparison with the optimized conventional delay-and-sum (DS) array, it is shown that: (1) the proposed array improves the signal-to-noise ratio (SNR) of degraded speech by about 2 dB and performs more than 20% better in word recognition rates under the conditions that the white Gaussian noise with the input SNR of -5 or -10 dB is used, (2) the proposed array performs more than 5% better in word recognition rates under the nonstationary noise conditions. Also, it is shown that these improvements of the proposed array are same as or superior to those of the conventional spectral subtraction method cascaded with the DS array.
CIAIR In-Car Speech Corpus--Influence of Driving Status--
Nobuo KAWAGUCHI Shigeki MATSUBARA Kazuya TAKEDA Fumitada ITAKURA

LETTER

Vol:
E88-D No:3
Page(s):
578-582
CIAIR, Nagoya University, has been compiling an in-car speech database since 1999. This paper discusses the basic information contained in this database and an analysis on the effects of driving status based on the database. We have developed a system called the Data Collection Vehicle (DCV), which supports synchronous recording of multi-channel audio data from 12 microphones which can be placed throughout the vehicle, multi-channel video recording from three cameras, and the collection of vehicle-related data. In the compilation process, each subject had conversations with three types of dialog system: a human, a "Wizard of Oz" system, and a spoken dialog system. Vehicle information such as speed, engine RPM, accelerator/brake-pedal pressure, and steering-wheel motion were also recorded. In this paper, we report on the effect that driving status has on phenomena specific to spoken language
Investigation of DNN-Based Audio-Visual Speech Recognition
Satoshi TAMURA Hiroshi NINOMIYA Norihide KITAOKA Shin OSUGA Yurie IRIBE Kazuya TAKEDA Satoru HAYAMIZU

PAPER-Acoustic modeling

Pubricized:
2016/07/19
Vol:
E99-D No:10
Page(s):
2444-2451
Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandem-based method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.
Multichannel Speech Enhancement Based on Generalized Gamma Prior Distribution with Its Online Adaptive Estimation
Tran HUY DAT Kazuya TAKEDA Fumitada ITAKURA

PAPER-Speech Enhancement

Vol:
E91-D No:3
Page(s):
439-447
We present a multichannel speech enhancement method based on MAP speech spectral magnitude estimation using a generalized gamma model of speech prior distribution, where the model parameters are adapted from actual noisy speech in a frame-by-frame manner. The utilization of a more general prior distribution with its online adaptive estimation is shown to be effective for speech spectral estimation in noisy environments. Furthermore, the multi-channel information in terms of cross-channel statistics are shown to be useful to better adapt the prior distribution parameters to the actual observation, resulting in better performance of speech enhancement algorithm. We tested the proposed algorithm in an in-car speech database and obtained significant improvements of the speech recognition performance, particularly under non-stationary noise conditions such as music, air-conditioner and open window.
Selective Listening Point Audio Based on Blind Signal Separation and Stereophonic Technology
Kenta NIWA Takanori NISHINO Kazuya TAKEDA

PAPER-Speech and Hearing

Vol:
E92-D No:3
Page(s):
469-476
A sound field reproduction method is proposed that uses blind source separation and a head-related transfer function. In the proposed system, multichannel acoustic signals captured at distant microphones are decomposed to a set of location/signal pairs of virtual sound sources based on frequency-domain independent component analysis. After estimating the locations and the signals of the virtual sources by convolving the controlled acoustic transfer functions with each signal, the spatial sound is constructed at the selected point. In experiments, a sound field made by six sound sources is captured using 48 distant microphones and decomposed into sets of virtual sound sources. Since subjective evaluation shows no significant difference between natural and reconstructed sound when six virtual sources and are used, the effectiveness of the decomposing algorithm as well as the virtual source representation are confirmed.
Noise Robust Speech Recognition Using Subband-Crosscorrelation Analysis
Shoji KAJITA Kazuya TAKEDA Fumitada ITAKURA

PAPER-Speech Processing and Acoustics

Vol:
E81-D No:10
Page(s):
1079-1086
This paper describes subband-crosscorrelation analysis (SBXCOR) using two input channel signals. SBXCOR is an extended signal processing technique of subband-autocorrelation analysis (SBCOR) that extracts periodicities associated with the inverse of center frequencies present in speech signals. In addition, to extract more periodicity information associated with the inverse of center frequencies, the multi-delay weighting (MDW) processing is applied to SBXCOR. In experiments, the noise robustness of SBXCOR is evaluated using a DTW word recognizer under (1) a simulated acoustic condition with white noise and (2) a real acoustic condition in a sound proof room with human speech-like noise. As the results, under the simulated acoustic condition, it is shown that SBXCOR is more robust than the conventional one-channel SBCOR, but less robust than SBCOR extracted from the two-channel-summed signal. Furthermore, by applying MDW processing, the performance of SBXCOR improved about 2% at SNR 0 dB. The resultant performance of SBXCOR with MDW processing was much better than those of smoothed group delay spectrum (SGDS) and mel-filterbank cepstral coefficient (MFCC) below SNR 10 dB. The results under the real acoustic condition were almost the same as the simulated acoustic condition.
Direction of Arrival Estimation Using Nonlinear Microphone Array
Hidekazu KAMIYANAGIDA Hiroshi SARUWATARI Kazuya TAKEDA Fumitada ITAKURA Kiyohiro SHIKANO

PAPER

Vol:
E84-A No:4
Page(s):
999-1010
This paper describes a new method for estimating the direction of arrival (DOA) using a nonlinear microphone array system based on complementary beamforming. Complementary beamforming is based on two types of beamformers designed to obtain complementary directivity patterns with respect to each other. In this system, since the resultant directivity pattern is proportional to the product of these directivity patterns, the proposed method can be used to estimate DOAs of 2(K-1) sound sources with K-element microphone array. First, DOA-estimation experiments are performed using both computer simulation and actual devices in real acoustic environments. The results clarify that DOA estimation for two sound sources can be accomplished by the proposed method with two microphones. Also, by comparing the resolutions of DOA estimation by the proposed method and by the conventional minimum variance method, we can show that the performance of the proposed method is superior to that of the minimum variance method under all reverberant conditions.
Adaptive Nonlinear Regression Using Multiple Distributed Microphones for In-Car Speech Recognition
Weifeng LI Chiyomi MIYAJIMA Takanori NISHINO Katsunobu ITOU Kazuya TAKEDA Fumitada ITAKURA

PAPER-Speech Enhancement

Vol:
E88-A No:7
Page(s):
1716-1723
In this paper, we address issues in improving hands-free speech recognition performance in different car environments using multiple spatially distributed microphones. In the previous work, we proposed the multiple linear regression of the log spectra (MRLS) for estimating the log spectra of speech at a close-talking microphone. In this paper, the concept is extended to nonlinear regressions. Regressions in the cepstrum domain are also investigated. An effective algorithm is developed to adapt the regression weights automatically to different noise environments. Compared to the nearest distant microphone and adaptive beamformer (Generalized Sidelobe Canceller), the proposed adaptive nonlinear regression approach shows an advantage in the average relative word error rate (WER) reductions of 58.5% and 10.3%, respectively, for isolated word recognition under 15 real car environments.

1-20hit(29hit)

Author Search Result

[Author] Kazuya TAKEDA(29hit)

Speech Enhancement Using Nonlinear Microphone Array Based on Noise Adaptive Complementary Beamforming

Single-Channel Multiple Regression for In-Car Speech Enhancement

CENSREC-3: An Evaluation Framework for Japanese Speech Recognition in Real Car-Driving Environments

Acoustic Model Training Using Pseudo-Speaker Features Generated by MLLR Transformations for Robust Speaker-Independent Speech Recognition

Driver Identification Using Driving Behavior Signals

Acoustic Feature Transformation Based on Discriminant Analysis Preserving Local Structure for Speech Recognition

FOREWORD Open Access

An Acoustically Oriented Vocal-Tract Model

Gamma Modeling of Speech Power and Its On-Line Estimation for Statistical Speech Enhancement

Multiple Regression of Log Spectra for In-Car Speech Recognition Using Multiple Distributed Microphones

AURORA-2J: An Evaluation Framework for Japanese Noisy Speech Recognition

Speech Recognition Using Finger Tapping Timings

Speech Enhancement Using Nonlinear Microphone Array Based on Complementary Beamforming

CIAIR In-Car Speech Corpus--Influence of Driving Status--

Investigation of DNN-Based Audio-Visual Speech Recognition

Multichannel Speech Enhancement Based on Generalized Gamma Prior Distribution with Its Online Adaptive Estimation

Selective Listening Point Audio Based on Blind Signal Separation and Stereophonic Technology

Noise Robust Speech Recognition Using Subband-Crosscorrelation Analysis

Direction of Arrival Estimation Using Nonlinear Microphone Array

Adaptive Nonlinear Regression Using Multiple Distributed Microphones for In-Car Speech Recognition

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles