IEICE global.ieice.org Site

Author Search Result

[Author] Kazuya TAKEDA(29hit)

1-20hit(29hit)

Multichannel Speech Enhancement Based on Generalized Gamma Prior Distribution with Its Online Adaptive Estimation
Tran HUY DAT Kazuya TAKEDA Fumitada ITAKURA

PAPER-Speech Enhancement

Vol:
E91-D No:3
Page(s):
439-447
We present a multichannel speech enhancement method based on MAP speech spectral magnitude estimation using a generalized gamma model of speech prior distribution, where the model parameters are adapted from actual noisy speech in a frame-by-frame manner. The utilization of a more general prior distribution with its online adaptive estimation is shown to be effective for speech spectral estimation in noisy environments. Furthermore, the multi-channel information in terms of cross-channel statistics are shown to be useful to better adapt the prior distribution parameters to the actual observation, resulting in better performance of speech enhancement algorithm. We tested the proposed algorithm in an in-car speech database and obtained significant improvements of the speech recognition performance, particularly under non-stationary noise conditions such as music, air-conditioner and open window.
Selective Listening Point Audio Based on Blind Signal Separation and Stereophonic Technology
Kenta NIWA Takanori NISHINO Kazuya TAKEDA

PAPER-Speech and Hearing

Vol:
E92-D No:3
Page(s):
469-476
A sound field reproduction method is proposed that uses blind source separation and a head-related transfer function. In the proposed system, multichannel acoustic signals captured at distant microphones are decomposed to a set of location/signal pairs of virtual sound sources based on frequency-domain independent component analysis. After estimating the locations and the signals of the virtual sources by convolving the controlled acoustic transfer functions with each signal, the spatial sound is constructed at the selected point. In experiments, a sound field made by six sound sources is captured using 48 distant microphones and decomposed into sets of virtual sound sources. Since subjective evaluation shows no significant difference between natural and reconstructed sound when six virtual sources and are used, the effectiveness of the decomposing algorithm as well as the virtual source representation are confirmed.
Noise Robust Speech Recognition Using Subband-Crosscorrelation Analysis
Shoji KAJITA Kazuya TAKEDA Fumitada ITAKURA

PAPER-Speech Processing and Acoustics

Vol:
E81-D No:10
Page(s):
1079-1086
This paper describes subband-crosscorrelation analysis (SBXCOR) using two input channel signals. SBXCOR is an extended signal processing technique of subband-autocorrelation analysis (SBCOR) that extracts periodicities associated with the inverse of center frequencies present in speech signals. In addition, to extract more periodicity information associated with the inverse of center frequencies, the multi-delay weighting (MDW) processing is applied to SBXCOR. In experiments, the noise robustness of SBXCOR is evaluated using a DTW word recognizer under (1) a simulated acoustic condition with white noise and (2) a real acoustic condition in a sound proof room with human speech-like noise. As the results, under the simulated acoustic condition, it is shown that SBXCOR is more robust than the conventional one-channel SBCOR, but less robust than SBCOR extracted from the two-channel-summed signal. Furthermore, by applying MDW processing, the performance of SBXCOR improved about 2% at SNR 0 dB. The resultant performance of SBXCOR with MDW processing was much better than those of smoothed group delay spectrum (SGDS) and mel-filterbank cepstral coefficient (MFCC) below SNR 10 dB. The results under the real acoustic condition were almost the same as the simulated acoustic condition.
Direction of Arrival Estimation Using Nonlinear Microphone Array
Hidekazu KAMIYANAGIDA Hiroshi SARUWATARI Kazuya TAKEDA Fumitada ITAKURA Kiyohiro SHIKANO

PAPER

Vol:
E84-A No:4
Page(s):
999-1010
This paper describes a new method for estimating the direction of arrival (DOA) using a nonlinear microphone array system based on complementary beamforming. Complementary beamforming is based on two types of beamformers designed to obtain complementary directivity patterns with respect to each other. In this system, since the resultant directivity pattern is proportional to the product of these directivity patterns, the proposed method can be used to estimate DOAs of 2(K-1) sound sources with K-element microphone array. First, DOA-estimation experiments are performed using both computer simulation and actual devices in real acoustic environments. The results clarify that DOA estimation for two sound sources can be accomplished by the proposed method with two microphones. Also, by comparing the resolutions of DOA estimation by the proposed method and by the conventional minimum variance method, we can show that the performance of the proposed method is superior to that of the minimum variance method under all reverberant conditions.
Adaptive Nonlinear Regression Using Multiple Distributed Microphones for In-Car Speech Recognition
Weifeng LI Chiyomi MIYAJIMA Takanori NISHINO Katsunobu ITOU Kazuya TAKEDA Fumitada ITAKURA

PAPER-Speech Enhancement

Vol:
E88-A No:7
Page(s):
1716-1723
In this paper, we address issues in improving hands-free speech recognition performance in different car environments using multiple spatially distributed microphones. In the previous work, we proposed the multiple linear regression of the log spectra (MRLS) for estimating the log spectra of speech at a close-talking microphone. In this paper, the concept is extended to nonlinear regressions. Regressions in the cepstrum domain are also investigated. An effective algorithm is developed to adapt the regression weights automatically to different noise environments. Compared to the nearest distant microphone and adaptive beamformer (Generalized Sidelobe Canceller), the proposed adaptive nonlinear regression approach shows an advantage in the average relative word error rate (WER) reductions of 58.5% and 10.3%, respectively, for isolated word recognition under 15 real car environments.
A Single-Dimensional Interface for Arranging Multiple Audio Sources in Three-Dimensional Space
Kento OHTANI Kenta NIWA Kazuya TAKEDA

PAPER-Music Information Processing

Pubricized:
2017/06/26
Vol:
E100-D No:10
Page(s):
2635-2643
A single-dimensional interface which enables users to obtain diverse localizations of audio sources is proposed. In many conventional interfaces for arranging audio sources, there are multiple arrangement parameters, some of which allow users to control positions of audio sources. However, it is difficult for users who are unfamiliar with these systems to optimize the arrangement parameters since the number of possible settings is huge. We propose a simple, single-dimensional interface for adjusting arrangement parameters, allowing users to sample several diverse audio source arrangements and easily find their preferred auditory localizations. To select subsets of arrangement parameters from all of the possible choices, auditory-localization space vectors (ASVs) are defined to represent the auditory localization of each arrangement parameter. By selecting subsets of ASVs which are approximately orthogonal, we can choose arrangement parameters which will produce diverse auditory localizations. Experimental evaluations were conducted using music composed of three audio sources. Subjective evaluations confirmed that novice users can obtain diverse localizations using the proposed interface.
Blind Source Separation Using Dodecahedral Microphone Array under Reverberant Conditions
Motoki OGASAWARA Takanori NISHINO Kazuya TAKEDA

PAPER-Engineering Acoustics

Vol:
E94-A No:3
Page(s):
897-906
The separation and localization of sound source signals are important techniques for many applications, such as highly realistic communication and speech recognition systems. These systems are expected to work without such prior information as the number of sound sources and the environmental conditions. In this paper, we developed a dodecahedral microphone array and proposed a novel separation method with our developed device. This method refers to human sound localization cues and uses acoustical characteristics obtained by the shape of the dodecahedral microphone array. Moreover, this method includes an estimation method of the number of sound sources that can operate without prior information. The sound source separation performances were evaluated under simulated and actual reverberant conditions, and the results were compared with the conventional method. The experimental results showed that our separation performance outperformed the conventional method.
Effective Frame Selection for Blind Source Separation Based on Frequency Domain Independent Component Analysis
Yusuke MIZUNO Kazunobu KONDO Takanori NISHINO Norihide KITAOKA Kazuya TAKEDA

PAPER-Engineering Acoustics

Vol:
E97-A No:3
Page(s):
784-791
Blind source separation is a technique that can separate sound sources without such information as source location, the number of sources, and the utterance content. Multi-channel source separation using many microphones separates signals with high accuracy, even if there are many sources. However, these methods have extremely high computational complexity, which must be reduced. In this paper, we propose a computational complexity reduction method for blind source separation based on frequency domain independent component analysis (FDICA) and examine temporal data that are effective for source separation. A frame with many sound sources is effective for FDICA source separation. We assume that a frame with a low kurtosis has many sound sources and preferentially select such frames. In our proposed method, we used the log power spectrum and the kurtosis of the magnitude distribution of the observed data as selection criteria and conducted source separation experiments using speech signals from twelve speakers. We evaluated the separation performances by the signal-to-interference ratio (SIR) improvement score. From our results, the SIR improvement score was 24.3dB when all the frames were used, and 23.3dB when the 300 frames selected by our criteria were used. These results clarified that our proposed selection criteria based on kurtosis and magnitude is effective. Furthermore, we significantly reduced the computational complexity because it is proportional to the number of selected frames.
Evaluation of Combinational Use of Discriminant Analysis-Based Acoustic Feature Transformation and Discriminative Training
Makoto SAKAI Norihide KITAOKA Yuya HATTORI Seiichi NAKAGAWA Kazuya TAKEDA

LETTER-Speech and Hearing

Vol:
E93-D No:2
Page(s):
395-398
To improve speech recognition performance, acoustic feature transformation based on discriminant analysis has been widely used. For the same purpose, discriminative training of HMMs has also been used. In this letter we investigate the effectiveness of these two techniques and their combination. We also investigate the robustness of matched and mismatched noise conditions between training and evaluation environments.
Acoustic Feature Transformation Combining Average and Maximum Classification Error Minimization Criteria
Makoto SAKAI Norihide KITAOKA Kazuya TAKEDA

LETTER-Speech and Hearing

Vol:
E93-D No:7
Page(s):
2005-2008
Acoustic feature transformation is widely used to reduce dimensionality and improve speech recognition performance. In this letter we focus on dimensionality reduction methods that minimize the average classification error. Unfortunately, minimization of the average classification error may cause considerable overlaps between distributions of some classes. To mitigate risks of considerable overlaps, we propose a dimensionality reduction method that minimizes the maximum classification error. We also propose two interpolated methods that can describe the average and maximum classification errors. Experimental results show that these proposed methods improve speech recognition performance.
Error Analysis of Field Trial Results of a Spoken Dialogue System for Telecommunications Applications
Shingo KUROIWA Kazuya TAKEDA Masaki NAITO Naomi INOUE Seiichi YAMAMOTO

PAPER

Vol:
E78-D No:6
Page(s):
636-641
We carried out a one year field trial of a voice-activated automatic telephone exchange service at KDD Laboratories which has about 200 branch phones. This system has DSP-based continuous speech recognition hardware which can process incoming calls in real time using a vocabulary of 300 words. The recognition accuracy was found to be 92.5% for speech read from a written text under laboratory conditions independent of the speaker. In this paper, we describe the performance of the system obtained as a result of the field trial. Apart from recognition accuracy, there was about 20% error due to out-of-vocabulary input and incorrect detection of speech endpoints which had not been allowed for in the laboratory experiments. Also, we found that the recognition accuracy for actual speech was about 18% lower than for speech read from text even if there were no out-of-vocabulary words. In this paper, we examine error variations for individual data in order to try and pinpoint the cause of incorrect recognition. It was found from experiments on the collected data that the pause model used, filled pause grammar and differences of channel frequency response seriously affected recognition accuracy. With the help of simple techniques to overcome these problems, we finally obtained a recognition accuracy of 88.7% for real data.
Construction and Evaluation of a Large In-Car Speech Corpus
Kazuya TAKEDA Hiroshi FUJIMURA Katsunobu ITOU Nobuo KAWAGUCHI Shigeki MATSUBARA Fumitada ITAKURA

PAPER-Speech Corpora and Related Topics

Vol:
E88-D No:3
Page(s):
553-561
In this paper, we discuss the construction of a large in-car spoken dialogue corpus and the result of its analysis. We have developed a system specially built into a Data Collection Vehicle (DCV) which supports the synchronous recording of multichannel audio data from 16 microphones that can be placed in flexible positions, multichannel video data from 3 cameras, and vehicle related data. Multimedia data has been collected for three sessions of spoken dialogue with different modes of navigation, during approximately a 60 minute drive by each of 800 subjects. We have characterized the collected dialogues across the three sessions. Some characteristics such as sentence complexity and SNR are found to differ significantly among the sessions. Linear regression analysis results also clarify the relative importance of various corpus characteristics.
Daily Activity Recognition with Large-Scaled Real-Life Recording Datasets Based on Deep Neural Network Using Multi-Modal Signals
Tomoki HAYASHI Masafumi NISHIDA Norihide KITAOKA Tomoki TODA Kazuya TAKEDA

PAPER-Engineering Acoustics

Vol:
E101-A No:1
Page(s):
199-210
In this study, toward the development of smartphone-based monitoring system for life logging, we collect over 1,400 hours of data by recording including both the outdoor and indoor daily activities of 19 subjects, under practical conditions with a smartphone and a small camera. We then construct a huge human activity database which consists of an environmental sound signal, triaxial acceleration signals and manually annotated activity tags. Using our constructed database, we evaluate the activity recognition performance of deep neural networks (DNNs), which have achieved great performance in various fields, and apply DNN-based adaptation techniques to improve the performance with only a small amount of subject-specific training data. We experimentally demonstrate that; 1) the use of multi-modal signal, including environmental sound and triaxial acceleration signals with a DNN is effective for the improvement of activity recognition performance, 2) the DNN can discriminate specified activities from a mixture of ambiguous activities, and 3) DNN-based adaptation methods are effective even if only a small amount of subject-specific training data is available.
Stereophonic Music Separation Based on Non-Negative Tensor Factorization with Cepstral Distance Regularization
Shogo SEKI Tomoki TODA Kazuya TAKEDA

PAPER-Engineering Acoustics

Vol:
E101-A No:7
Page(s):
1057-1064
This paper proposes a semi-supervised source separation method for stereophonic music signals containing multiple recorded or processed signals, where synthesized music is focused on the stereophonic music. As the synthesized music signals are often generated as linear combinations of many individual source signals and their respective mixing gains, phase or phase difference information between inter-channel signals, which represent spatial characteristics of recording environments, cannot be utilized as acoustic clues for source separation. Non-negative Tensor Factorization (NTF) is an effective technique which can be used to resolve this problem by decomposing amplitude spectrograms of stereo channel music signals into basis vectors and activations of individual music source signals, along with their corresponding mixing gains. However, it is difficult to achieve sufficient separation performance using this method alone, as the acoustic clues available for separation are limited. To address this issue, this paper proposes a Cepstral Distance Regularization (CDR) method for NTF-based stereo channel separation, which involves making the cepstrum of the separated source signals follow Gaussian Mixture Models (GMMs) of the corresponding the music source signal. These GMMs are trained in advance using available samples. Experimental evaluations separating three and four sound sources are conducted to investigate the effectiveness of the proposed method in both supervised and semi-supervised separation frameworks, and performance is also compared with that of a conventional NTF method. Experimental results demonstrate that the proposed method yields significant improvements within both separation frameworks, and that cepstral distance regularization provides better separation parameters.
Speech Enhancement Using Nonlinear Microphone Array Based on Noise Adaptive Complementary Beamforming
Hiroshi SARUWATARI Shoji KAJITA Kazuya TAKEDA Fumitada ITAKURA

PAPER-Engineering Acoustics

Vol:
E83-A No:5
Page(s):
866-876
This paper describes an improved complementary beamforming microphone array based on the new noise adaptation algorithm. Complementary beamforming is based on two types of beamformers designed to obtain complementary directivity patterns with respect to each other. In this system, during a pause in the target speech, two directivity patterns of the beamformers are adapted to the noise directions of arrival so that the expectation values of each noise power spectrum are minimized in the array output. Using this technique, we can realize the directional nulls for each noise even when the number of sound sources exceeds that of microphones. To evaluate the effectiveness, speech enhancement experiments and speech recognition experiments are performed based on computer simulations with a two-element array and three sound sources under various noise conditions. In comparison with the conventional adaptive beamformer and the conventional spectral subtraction method cascaded with the adaptive beamformer, it is shown that (1) the proposed array improves the signal-to-noise ratio (SNR) of degraded speech by more than 6 dB when the interfering noise is two speakers with the input SNR of below 0 dB, (2) the proposed array improves the SNR by about 2 dB when the interfering noise is bubble noise, and (3) an improvement in the recognition rate of more than 18% is obtained when the interfering noise is two speakers or two overlapped signals of some speakers under the condition that the input SNR is 10 dB.
Single-Channel Multiple Regression for In-Car Speech Enhancement
Weifeng LI Katsunobu ITOU Kazuya TAKEDA Fumitada ITAKURA

PAPER-Speech Enhancement

Vol:
E89-D No:3
Page(s):
1032-1039
We address issues for improving hands-free speech enhancement and speech recognition performance in different car environments using a single distant microphone. This paper describes a new single-channel in-car speech enhancement method that estimates the log spectra of speech at a close-talking microphone based on the nonlinear regression of the log spectra of noisy signal captured by a distant microphone and the estimated noise. The proposed method provides significant overall quality improvements in our subjective evaluation on the regression-enhanced speech, and performed best in most objective measures. Based on our isolated word recognition experiments conducted under 15 real car environments, the proposed adaptive nonlinear regression approach shows an advantage in average relative word error rate (WER) reductions of 50.8% and 13.1%, respectively, compared to original noisy speech and ETSI advanced front-end (ETSI ES 202 050).
CENSREC-3: An Evaluation Framework for Japanese Speech Recognition in Real Car-Driving Environments
Masakiyo FUJIMOTO Kazuya TAKEDA Satoshi NAKAMURA

PAPER-Speech and Hearing

Vol:
E89-D No:11
Page(s):
2783-2793
This paper introduces a common database, an evaluation framework, and its baseline recognition results for in-car speech recognition, CENSREC-3, as an outcome of the IPSJ-SIG SLP Noisy Speech Recognition Evaluation Working Group. CENSREC-3, which is a sequel to AURORA-2J, has been designed as the evaluation framework of isolated word recognition in real car-driving environments. Speech data were collected using two microphones, a close-talking microphone and a hands-free microphone, under 16 carefully controlled driving conditions, i.e., combinations of three car speeds and six car conditions. CENSREC-3 provides six evaluation environments designed using speech data collected in these conditions.
Acoustic Model Training Using Pseudo-Speaker Features Generated by MLLR Transformations for Robust Speaker-Independent Speech Recognition
Arata ITOH Sunao HARA Norihide KITAOKA Kazuya TAKEDA

PAPER-Speech and Hearing

Vol:
E95-D No:10
Page(s):
2479-2485
A novel speech feature generation-based acoustic model training method for robust speaker-independent speech recognition is proposed. For decades, speaker adaptation methods have been widely used. All of these adaptation methods need adaptation data. However, our proposed method aims to create speaker-independent acoustic models that cover not only known but also unknown speakers. We achieve this by adopting inverse maximum likelihood linear regression (MLLR) transformation-based feature generation, and then we train our models using these features. First we obtain MLLR transformation matrices from a limited number of existing speakers. Then we extract the bases of the MLLR transformation matrices using PCA. The distribution of the weight parameters to express the transformation matrices for the existing speakers are estimated. Next, we construct pseudo-speaker transformations by sampling the weight parameters from the distribution, and apply the transformation to the normalized features of the existing speaker to generate the features of the pseudo-speakers. Finally, using these features, we train the acoustic models. Evaluation results show that the acoustic models trained using our proposed method are robust for unknown speakers.
Driver Identification Using Driving Behavior Signals
Toshihiro WAKITA Koji OZAWA Chiyomi MIYAJIMA Kei IGARASHI Katunobu ITOU Kazuya TAKEDA Fumitada ITAKURA

PAPER-Human-computer Interaction

Vol:
E89-D No:3
Page(s):
1188-1194
In this paper, we propose a driver identification method that is based on the driving behavior signals that are observed while the driver is following another vehicle. Driving behavior signals, such as the use of the accelerator pedal, brake pedal, vehicle velocity, and distance from the vehicle in front, were measured using a driving simulator. We compared the identification rate obtained using different identification models. As a result, we found the Gaussian Mixture Model to be superior to the Helly model and the optimal velocity model. Also, the driver's operation signals were found to be better than road environment signals and car behavior signals for the Gaussian Mixture Model. The identification rate for thirty driver using actual vehicle driving in a city area was 73%.
Acoustic Feature Transformation Based on Discriminant Analysis Preserving Local Structure for Speech Recognition
Makoto SAKAI Norihide KITAOKA Kazuya TAKEDA

PAPER-Speech and Hearing

Vol:
E93-D No:5
Page(s):
1244-1252
To improve speech recognition performance, feature transformation based on discriminant analysis has been widely used to reduce the redundant dimensions of acoustic features. Linear discriminant analysis (LDA) and heteroscedastic discriminant analysis (HDA) are often used for this purpose, and a generalization method for LDA and HDA, called power LDA (PLDA), has been proposed. However, these methods may result in an unexpected dimensionality reduction for multimodal data. It is important to preserve the local structure of the data when reducing the dimensionality of multimodal data. In this paper we introduce two methods, locality-preserving HDA and locality-preserving PLDA, to reduce dimensionality of multimodal data appropriately. We also propose an approximate calculation scheme to calculate sub-optimal projections rapidly. Experimental results show that the locality-preserving methods yield better performance than the traditional ones in speech recognition.

1-20hit(29hit)

Author Search Result

[Author] Kazuya TAKEDA(29hit)

Multichannel Speech Enhancement Based on Generalized Gamma Prior Distribution with Its Online Adaptive Estimation

Selective Listening Point Audio Based on Blind Signal Separation and Stereophonic Technology

Noise Robust Speech Recognition Using Subband-Crosscorrelation Analysis

Direction of Arrival Estimation Using Nonlinear Microphone Array

Adaptive Nonlinear Regression Using Multiple Distributed Microphones for In-Car Speech Recognition

A Single-Dimensional Interface for Arranging Multiple Audio Sources in Three-Dimensional Space

Blind Source Separation Using Dodecahedral Microphone Array under Reverberant Conditions

Effective Frame Selection for Blind Source Separation Based on Frequency Domain Independent Component Analysis

Evaluation of Combinational Use of Discriminant Analysis-Based Acoustic Feature Transformation and Discriminative Training

Acoustic Feature Transformation Combining Average and Maximum Classification Error Minimization Criteria

Error Analysis of Field Trial Results of a Spoken Dialogue System for Telecommunications Applications

Construction and Evaluation of a Large In-Car Speech Corpus

Daily Activity Recognition with Large-Scaled Real-Life Recording Datasets Based on Deep Neural Network Using Multi-Modal Signals

Stereophonic Music Separation Based on Non-Negative Tensor Factorization with Cepstral Distance Regularization

Speech Enhancement Using Nonlinear Microphone Array Based on Noise Adaptive Complementary Beamforming

Single-Channel Multiple Regression for In-Car Speech Enhancement

CENSREC-3: An Evaluation Framework for Japanese Speech Recognition in Real Car-Driving Environments

Acoustic Model Training Using Pseudo-Speaker Features Generated by MLLR Transformations for Robust Speaker-Independent Speech Recognition

Driver Identification Using Driving Behavior Signals

Acoustic Feature Transformation Based on Discriminant Analysis Preserving Local Structure for Speech Recognition

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles