The search functionality is under construction.

Author Search Result

[Author] Kazuya TAKEDA(29hit)

1-20hit(29hit)

  • Multichannel Speech Enhancement Based on Generalized Gamma Prior Distribution with Its Online Adaptive Estimation

    Tran HUY DAT  Kazuya TAKEDA  Fumitada ITAKURA  

     
    PAPER-Speech Enhancement

      Vol:
    E91-D No:3
      Page(s):
    439-447

    We present a multichannel speech enhancement method based on MAP speech spectral magnitude estimation using a generalized gamma model of speech prior distribution, where the model parameters are adapted from actual noisy speech in a frame-by-frame manner. The utilization of a more general prior distribution with its online adaptive estimation is shown to be effective for speech spectral estimation in noisy environments. Furthermore, the multi-channel information in terms of cross-channel statistics are shown to be useful to better adapt the prior distribution parameters to the actual observation, resulting in better performance of speech enhancement algorithm. We tested the proposed algorithm in an in-car speech database and obtained significant improvements of the speech recognition performance, particularly under non-stationary noise conditions such as music, air-conditioner and open window.

  • Selective Listening Point Audio Based on Blind Signal Separation and Stereophonic Technology

    Kenta NIWA  Takanori NISHINO  Kazuya TAKEDA  

     
    PAPER-Speech and Hearing

      Vol:
    E92-D No:3
      Page(s):
    469-476

    A sound field reproduction method is proposed that uses blind source separation and a head-related transfer function. In the proposed system, multichannel acoustic signals captured at distant microphones are decomposed to a set of location/signal pairs of virtual sound sources based on frequency-domain independent component analysis. After estimating the locations and the signals of the virtual sources by convolving the controlled acoustic transfer functions with each signal, the spatial sound is constructed at the selected point. In experiments, a sound field made by six sound sources is captured using 48 distant microphones and decomposed into sets of virtual sound sources. Since subjective evaluation shows no significant difference between natural and reconstructed sound when six virtual sources and are used, the effectiveness of the decomposing algorithm as well as the virtual source representation are confirmed.

  • Noise Robust Speech Recognition Using Subband-Crosscorrelation Analysis

    Shoji KAJITA  Kazuya TAKEDA  Fumitada ITAKURA  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E81-D No:10
      Page(s):
    1079-1086

    This paper describes subband-crosscorrelation analysis (SBXCOR) using two input channel signals. SBXCOR is an extended signal processing technique of subband-autocorrelation analysis (SBCOR) that extracts periodicities associated with the inverse of center frequencies present in speech signals. In addition, to extract more periodicity information associated with the inverse of center frequencies, the multi-delay weighting (MDW) processing is applied to SBXCOR. In experiments, the noise robustness of SBXCOR is evaluated using a DTW word recognizer under (1) a simulated acoustic condition with white noise and (2) a real acoustic condition in a sound proof room with human speech-like noise. As the results, under the simulated acoustic condition, it is shown that SBXCOR is more robust than the conventional one-channel SBCOR, but less robust than SBCOR extracted from the two-channel-summed signal. Furthermore, by applying MDW processing, the performance of SBXCOR improved about 2% at SNR 0 dB. The resultant performance of SBXCOR with MDW processing was much better than those of smoothed group delay spectrum (SGDS) and mel-filterbank cepstral coefficient (MFCC) below SNR 10 dB. The results under the real acoustic condition were almost the same as the simulated acoustic condition.

  • Direction of Arrival Estimation Using Nonlinear Microphone Array

    Hidekazu KAMIYANAGIDA  Hiroshi SARUWATARI  Kazuya TAKEDA  Fumitada ITAKURA  Kiyohiro SHIKANO  

     
    PAPER

      Vol:
    E84-A No:4
      Page(s):
    999-1010

    This paper describes a new method for estimating the direction of arrival (DOA) using a nonlinear microphone array system based on complementary beamforming. Complementary beamforming is based on two types of beamformers designed to obtain complementary directivity patterns with respect to each other. In this system, since the resultant directivity pattern is proportional to the product of these directivity patterns, the proposed method can be used to estimate DOAs of 2(K-1) sound sources with K-element microphone array. First, DOA-estimation experiments are performed using both computer simulation and actual devices in real acoustic environments. The results clarify that DOA estimation for two sound sources can be accomplished by the proposed method with two microphones. Also, by comparing the resolutions of DOA estimation by the proposed method and by the conventional minimum variance method, we can show that the performance of the proposed method is superior to that of the minimum variance method under all reverberant conditions.

  • Adaptive Nonlinear Regression Using Multiple Distributed Microphones for In-Car Speech Recognition

    Weifeng LI  Chiyomi MIYAJIMA  Takanori NISHINO  Katsunobu ITOU  Kazuya TAKEDA  Fumitada ITAKURA  

     
    PAPER-Speech Enhancement

      Vol:
    E88-A No:7
      Page(s):
    1716-1723

    In this paper, we address issues in improving hands-free speech recognition performance in different car environments using multiple spatially distributed microphones. In the previous work, we proposed the multiple linear regression of the log spectra (MRLS) for estimating the log spectra of speech at a close-talking microphone. In this paper, the concept is extended to nonlinear regressions. Regressions in the cepstrum domain are also investigated. An effective algorithm is developed to adapt the regression weights automatically to different noise environments. Compared to the nearest distant microphone and adaptive beamformer (Generalized Sidelobe Canceller), the proposed adaptive nonlinear regression approach shows an advantage in the average relative word error rate (WER) reductions of 58.5% and 10.3%, respectively, for isolated word recognition under 15 real car environments.

  • A Single-Dimensional Interface for Arranging Multiple Audio Sources in Three-Dimensional Space

    Kento OHTANI  Kenta NIWA  Kazuya TAKEDA  

     
    PAPER-Music Information Processing

      Pubricized:
    2017/06/26
      Vol:
    E100-D No:10
      Page(s):
    2635-2643

    A single-dimensional interface which enables users to obtain diverse localizations of audio sources is proposed. In many conventional interfaces for arranging audio sources, there are multiple arrangement parameters, some of which allow users to control positions of audio sources. However, it is difficult for users who are unfamiliar with these systems to optimize the arrangement parameters since the number of possible settings is huge. We propose a simple, single-dimensional interface for adjusting arrangement parameters, allowing users to sample several diverse audio source arrangements and easily find their preferred auditory localizations. To select subsets of arrangement parameters from all of the possible choices, auditory-localization space vectors (ASVs) are defined to represent the auditory localization of each arrangement parameter. By selecting subsets of ASVs which are approximately orthogonal, we can choose arrangement parameters which will produce diverse auditory localizations. Experimental evaluations were conducted using music composed of three audio sources. Subjective evaluations confirmed that novice users can obtain diverse localizations using the proposed interface.

  • Blind Source Separation Using Dodecahedral Microphone Array under Reverberant Conditions

    Motoki OGASAWARA  Takanori NISHINO  Kazuya TAKEDA  

     
    PAPER-Engineering Acoustics

      Vol:
    E94-A No:3
      Page(s):
    897-906

    The separation and localization of sound source signals are important techniques for many applications, such as highly realistic communication and speech recognition systems. These systems are expected to work without such prior information as the number of sound sources and the environmental conditions. In this paper, we developed a dodecahedral microphone array and proposed a novel separation method with our developed device. This method refers to human sound localization cues and uses acoustical characteristics obtained by the shape of the dodecahedral microphone array. Moreover, this method includes an estimation method of the number of sound sources that can operate without prior information. The sound source separation performances were evaluated under simulated and actual reverberant conditions, and the results were compared with the conventional method. The experimental results showed that our separation performance outperformed the conventional method.

  • Effective Frame Selection for Blind Source Separation Based on Frequency Domain Independent Component Analysis

    Yusuke MIZUNO  Kazunobu KONDO  Takanori NISHINO  Norihide KITAOKA  Kazuya TAKEDA  

     
    PAPER-Engineering Acoustics

      Vol:
    E97-A No:3
      Page(s):
    784-791

    Blind source separation is a technique that can separate sound sources without such information as source location, the number of sources, and the utterance content. Multi-channel source separation using many microphones separates signals with high accuracy, even if there are many sources. However, these methods have extremely high computational complexity, which must be reduced. In this paper, we propose a computational complexity reduction method for blind source separation based on frequency domain independent component analysis (FDICA) and examine temporal data that are effective for source separation. A frame with many sound sources is effective for FDICA source separation. We assume that a frame with a low kurtosis has many sound sources and preferentially select such frames. In our proposed method, we used the log power spectrum and the kurtosis of the magnitude distribution of the observed data as selection criteria and conducted source separation experiments using speech signals from twelve speakers. We evaluated the separation performances by the signal-to-interference ratio (SIR) improvement score. From our results, the SIR improvement score was 24.3dB when all the frames were used, and 23.3dB when the 300 frames selected by our criteria were used. These results clarified that our proposed selection criteria based on kurtosis and magnitude is effective. Furthermore, we significantly reduced the computational complexity because it is proportional to the number of selected frames.

  • Evaluation of Combinational Use of Discriminant Analysis-Based Acoustic Feature Transformation and Discriminative Training

    Makoto SAKAI  Norihide KITAOKA  Yuya HATTORI  Seiichi NAKAGAWA  Kazuya TAKEDA  

     
    LETTER-Speech and Hearing

      Vol:
    E93-D No:2
      Page(s):
    395-398

    To improve speech recognition performance, acoustic feature transformation based on discriminant analysis has been widely used. For the same purpose, discriminative training of HMMs has also been used. In this letter we investigate the effectiveness of these two techniques and their combination. We also investigate the robustness of matched and mismatched noise conditions between training and evaluation environments.

  • Acoustic Feature Transformation Combining Average and Maximum Classification Error Minimization Criteria

    Makoto SAKAI  Norihide KITAOKA  Kazuya TAKEDA  

     
    LETTER-Speech and Hearing

      Vol:
    E93-D No:7
      Page(s):
    2005-2008

    Acoustic feature transformation is widely used to reduce dimensionality and improve speech recognition performance. In this letter we focus on dimensionality reduction methods that minimize the average classification error. Unfortunately, minimization of the average classification error may cause considerable overlaps between distributions of some classes. To mitigate risks of considerable overlaps, we propose a dimensionality reduction method that minimizes the maximum classification error. We also propose two interpolated methods that can describe the average and maximum classification errors. Experimental results show that these proposed methods improve speech recognition performance.

  • Error Analysis of Field Trial Results of a Spoken Dialogue System for Telecommunications Applications

    Shingo KUROIWA  Kazuya TAKEDA  Masaki NAITO  Naomi INOUE  Seiichi YAMAMOTO  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    636-641

    We carried out a one year field trial of a voice-activated automatic telephone exchange service at KDD Laboratories which has about 200 branch phones. This system has DSP-based continuous speech recognition hardware which can process incoming calls in real time using a vocabulary of 300 words. The recognition accuracy was found to be 92.5% for speech read from a written text under laboratory conditions independent of the speaker. In this paper, we describe the performance of the system obtained as a result of the field trial. Apart from recognition accuracy, there was about 20% error due to out-of-vocabulary input and incorrect detection of speech endpoints which had not been allowed for in the laboratory experiments. Also, we found that the recognition accuracy for actual speech was about 18% lower than for speech read from text even if there were no out-of-vocabulary words. In this paper, we examine error variations for individual data in order to try and pinpoint the cause of incorrect recognition. It was found from experiments on the collected data that the pause model used, filled pause grammar and differences of channel frequency response seriously affected recognition accuracy. With the help of simple techniques to overcome these problems, we finally obtained a recognition accuracy of 88.7% for real data.

  • Construction and Evaluation of a Large In-Car Speech Corpus

    Kazuya TAKEDA  Hiroshi FUJIMURA  Katsunobu ITOU  Nobuo KAWAGUCHI  Shigeki MATSUBARA  Fumitada ITAKURA  

     
    PAPER-Speech Corpora and Related Topics

      Vol:
    E88-D No:3
      Page(s):
    553-561

    In this paper, we discuss the construction of a large in-car spoken dialogue corpus and the result of its analysis. We have developed a system specially built into a Data Collection Vehicle (DCV) which supports the synchronous recording of multichannel audio data from 16 microphones that can be placed in flexible positions, multichannel video data from 3 cameras, and vehicle related data. Multimedia data has been collected for three sessions of spoken dialogue with different modes of navigation, during approximately a 60 minute drive by each of 800 subjects. We have characterized the collected dialogues across the three sessions. Some characteristics such as sentence complexity and SNR are found to differ significantly among the sessions. Linear regression analysis results also clarify the relative importance of various corpus characteristics.

  • Daily Activity Recognition with Large-Scaled Real-Life Recording Datasets Based on Deep Neural Network Using Multi-Modal Signals

    Tomoki HAYASHI  Masafumi NISHIDA  Norihide KITAOKA  Tomoki TODA  Kazuya TAKEDA  

     
    PAPER-Engineering Acoustics

      Vol:
    E101-A No:1
      Page(s):
    199-210

    In this study, toward the development of smartphone-based monitoring system for life logging, we collect over 1,400 hours of data by recording including both the outdoor and indoor daily activities of 19 subjects, under practical conditions with a smartphone and a small camera. We then construct a huge human activity database which consists of an environmental sound signal, triaxial acceleration signals and manually annotated activity tags. Using our constructed database, we evaluate the activity recognition performance of deep neural networks (DNNs), which have achieved great performance in various fields, and apply DNN-based adaptation techniques to improve the performance with only a small amount of subject-specific training data. We experimentally demonstrate that; 1) the use of multi-modal signal, including environmental sound and triaxial acceleration signals with a DNN is effective for the improvement of activity recognition performance, 2) the DNN can discriminate specified activities from a mixture of ambiguous activities, and 3) DNN-based adaptation methods are effective even if only a small amount of subject-specific training data is available.

  • Stereophonic Music Separation Based on Non-Negative Tensor Factorization with Cepstral Distance Regularization

    Shogo SEKI  Tomoki TODA  Kazuya TAKEDA  

     
    PAPER-Engineering Acoustics

      Vol:
    E101-A No:7
      Page(s):
    1057-1064

    This paper proposes a semi-supervised source separation method for stereophonic music signals containing multiple recorded or processed signals, where synthesized music is focused on the stereophonic music. As the synthesized music signals are often generated as linear combinations of many individual source signals and their respective mixing gains, phase or phase difference information between inter-channel signals, which represent spatial characteristics of recording environments, cannot be utilized as acoustic clues for source separation. Non-negative Tensor Factorization (NTF) is an effective technique which can be used to resolve this problem by decomposing amplitude spectrograms of stereo channel music signals into basis vectors and activations of individual music source signals, along with their corresponding mixing gains. However, it is difficult to achieve sufficient separation performance using this method alone, as the acoustic clues available for separation are limited. To address this issue, this paper proposes a Cepstral Distance Regularization (CDR) method for NTF-based stereo channel separation, which involves making the cepstrum of the separated source signals follow Gaussian Mixture Models (GMMs) of the corresponding the music source signal. These GMMs are trained in advance using available samples. Experimental evaluations separating three and four sound sources are conducted to investigate the effectiveness of the proposed method in both supervised and semi-supervised separation frameworks, and performance is also compared with that of a conventional NTF method. Experimental results demonstrate that the proposed method yields significant improvements within both separation frameworks, and that cepstral distance regularization provides better separation parameters.

  • Speech Enhancement Using Nonlinear Microphone Array Based on Noise Adaptive Complementary Beamforming

    Hiroshi SARUWATARI  Shoji KAJITA  Kazuya TAKEDA  Fumitada ITAKURA  

     
    PAPER-Engineering Acoustics

      Vol:
    E83-A No:5
      Page(s):
    866-876

    This paper describes an improved complementary beamforming microphone array based on the new noise adaptation algorithm. Complementary beamforming is based on two types of beamformers designed to obtain complementary directivity patterns with respect to each other. In this system, during a pause in the target speech, two directivity patterns of the beamformers are adapted to the noise directions of arrival so that the expectation values of each noise power spectrum are minimized in the array output. Using this technique, we can realize the directional nulls for each noise even when the number of sound sources exceeds that of microphones. To evaluate the effectiveness, speech enhancement experiments and speech recognition experiments are performed based on computer simulations with a two-element array and three sound sources under various noise conditions. In comparison with the conventional adaptive beamformer and the conventional spectral subtraction method cascaded with the adaptive beamformer, it is shown that (1) the proposed array improves the signal-to-noise ratio (SNR) of degraded speech by more than 6 dB when the interfering noise is two speakers with the input SNR of below 0 dB, (2) the proposed array improves the SNR by about 2 dB when the interfering noise is bubble noise, and (3) an improvement in the recognition rate of more than 18% is obtained when the interfering noise is two speakers or two overlapped signals of some speakers under the condition that the input SNR is 10 dB.

  • Single-Channel Multiple Regression for In-Car Speech Enhancement

    Weifeng LI  Katsunobu ITOU  Kazuya TAKEDA  Fumitada ITAKURA  

     
    PAPER-Speech Enhancement

      Vol:
    E89-D No:3
      Page(s):
    1032-1039

    We address issues for improving hands-free speech enhancement and speech recognition performance in different car environments using a single distant microphone. This paper describes a new single-channel in-car speech enhancement method that estimates the log spectra of speech at a close-talking microphone based on the nonlinear regression of the log spectra of noisy signal captured by a distant microphone and the estimated noise. The proposed method provides significant overall quality improvements in our subjective evaluation on the regression-enhanced speech, and performed best in most objective measures. Based on our isolated word recognition experiments conducted under 15 real car environments, the proposed adaptive nonlinear regression approach shows an advantage in average relative word error rate (WER) reductions of 50.8% and 13.1%, respectively, compared to original noisy speech and ETSI advanced front-end (ETSI ES 202 050).

  • CENSREC-3: An Evaluation Framework for Japanese Speech Recognition in Real Car-Driving Environments

    Masakiyo FUJIMOTO  Kazuya TAKEDA  Satoshi NAKAMURA  

     
    PAPER-Speech and Hearing

      Vol:
    E89-D No:11
      Page(s):
    2783-2793

    This paper introduces a common database, an evaluation framework, and its baseline recognition results for in-car speech recognition, CENSREC-3, as an outcome of the IPSJ-SIG SLP Noisy Speech Recognition Evaluation Working Group. CENSREC-3, which is a sequel to AURORA-2J, has been designed as the evaluation framework of isolated word recognition in real car-driving environments. Speech data were collected using two microphones, a close-talking microphone and a hands-free microphone, under 16 carefully controlled driving conditions, i.e., combinations of three car speeds and six car conditions. CENSREC-3 provides six evaluation environments designed using speech data collected in these conditions.

  • Acoustic Model Training Using Pseudo-Speaker Features Generated by MLLR Transformations for Robust Speaker-Independent Speech Recognition

    Arata ITOH  Sunao HARA  Norihide KITAOKA  Kazuya TAKEDA  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:10
      Page(s):
    2479-2485

    A novel speech feature generation-based acoustic model training method for robust speaker-independent speech recognition is proposed. For decades, speaker adaptation methods have been widely used. All of these adaptation methods need adaptation data. However, our proposed method aims to create speaker-independent acoustic models that cover not only known but also unknown speakers. We achieve this by adopting inverse maximum likelihood linear regression (MLLR) transformation-based feature generation, and then we train our models using these features. First we obtain MLLR transformation matrices from a limited number of existing speakers. Then we extract the bases of the MLLR transformation matrices using PCA. The distribution of the weight parameters to express the transformation matrices for the existing speakers are estimated. Next, we construct pseudo-speaker transformations by sampling the weight parameters from the distribution, and apply the transformation to the normalized features of the existing speaker to generate the features of the pseudo-speakers. Finally, using these features, we train the acoustic models. Evaluation results show that the acoustic models trained using our proposed method are robust for unknown speakers.

  • Driver Identification Using Driving Behavior Signals

    Toshihiro WAKITA  Koji OZAWA  Chiyomi MIYAJIMA  Kei IGARASHI  Katunobu ITOU  Kazuya TAKEDA  Fumitada ITAKURA  

     
    PAPER-Human-computer Interaction

      Vol:
    E89-D No:3
      Page(s):
    1188-1194

    In this paper, we propose a driver identification method that is based on the driving behavior signals that are observed while the driver is following another vehicle. Driving behavior signals, such as the use of the accelerator pedal, brake pedal, vehicle velocity, and distance from the vehicle in front, were measured using a driving simulator. We compared the identification rate obtained using different identification models. As a result, we found the Gaussian Mixture Model to be superior to the Helly model and the optimal velocity model. Also, the driver's operation signals were found to be better than road environment signals and car behavior signals for the Gaussian Mixture Model. The identification rate for thirty driver using actual vehicle driving in a city area was 73%.

  • Acoustic Feature Transformation Based on Discriminant Analysis Preserving Local Structure for Speech Recognition

    Makoto SAKAI  Norihide KITAOKA  Kazuya TAKEDA  

     
    PAPER-Speech and Hearing

      Vol:
    E93-D No:5
      Page(s):
    1244-1252

    To improve speech recognition performance, feature transformation based on discriminant analysis has been widely used to reduce the redundant dimensions of acoustic features. Linear discriminant analysis (LDA) and heteroscedastic discriminant analysis (HDA) are often used for this purpose, and a generalization method for LDA and HDA, called power LDA (PLDA), has been proposed. However, these methods may result in an unexpected dimensionality reduction for multimodal data. It is important to preserve the local structure of the data when reducing the dimensionality of multimodal data. In this paper we introduce two methods, locality-preserving HDA and locality-preserving PLDA, to reduce dimensionality of multimodal data appropriately. We also propose an approximate calculation scheme to calculate sub-optimal projections rapidly. Experimental results show that the locality-preserving methods yield better performance than the traditional ones in speech recognition.

1-20hit(29hit)