The search functionality is under construction.

Author Search Result

[Author] Kiyohiro SHIKANO(45hit)

1-20hit(45hit)

  • Task Adaptation in Syllable Trigram Models for Continuous Speech Recognition

    Sho-ichi MATSUNAGA  Tomokazu YAMADA  Kiyohiro SHIKANO  

     
    PAPER

      Vol:
    E76-D No:1
      Page(s):
    38-43

    In speech recognition systeme dealing with unlimited vocabulary and based on stochastic language models, when the target recognition task is changed, recognition performance decreases because the language model is no longer appropriate. This paper describes two approaches for adapting a specific/general syllable trigram model to a new task. One uses a amall amount of text data similar to the target task, and the other uses supervised learning using the most recent input phrases and similar text. In this paper, these adaptation methods are called preliminary learning" and successive learning", respectively. These adaptation are evaluated using syllable perplexity and phrase recognition rates. The perplexity was reduced from 24.5 to 14.3 for the adaptation using 1000 phrases of similar text by preliminary learning, and was reduced to 12.1 using 1000 phrases including the 100 most recent phrases by successive learning. The recognition rates were also improved from 42.3% to 51.3% and 52.9%, respectively. Text similarity for the approaches is also studied in this paper.

  • On-Line Relaxation Algorithm Applicable to Acoustic Fluctuation for Inverse Filter in Multichannel Sound Reproduction System

    Yosuke TATEKURA  Shigefumi URATA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Sound Field Reproduction

      Vol:
    E88-A No:7
      Page(s):
    1747-1756

    In this paper, we propose a new on-line adaptive relaxation algorithm for an inverse filter in a multichannel sound reproduction system. The fluctuation of room transfer functions degrades reproduced sound in conventional sound reproduction systems in which the coefficients of the inverse filter are fixed. In order to resolve this problem, an iterative relaxation algorithm for an inverse filter performed by truncated singular value decomposition (adaptive TSVD) has been proposed. However, it is difficult to apply this method within the time duration of the sound of speech or music in the original signals. Therefore, we extend adaptive TSVD to an on-line-type algorithm based on the observed signal at only one control point, normalizing the observed signal with the original sound. The result of the simulation using real environmental data reveals that the proposed method can always carry out the relaxation process against acoustic fluctuation, for any time duration. Also, subjective evaluation in the real acoustic environment indicates that the sound quality improves without degrading the localization.

  • Maximum Likelihood Successive State Splitting Algorithm for Tied-Mixture HMnet

    Alexandre GIRARDI  Harald SINGER  Kiyohiro SHIKANO  Satoshi NAKAMURA  

     
    PAPER-Speech and Hearing

      Vol:
    E83-D No:10
      Page(s):
    1890-1897

    This paper shows how a divisive state clustering algorithm that generates acoustic Hidden Markov models (HMM) can benefit from a tied-mixture representation of the probability density function (pdf) of a state and increase the recognition performance. Popular decision tree based clustering algorithms, like for example the Successive State Splitting algorithm (SSS) make use of a simplification when clustering data. They represent a state using a single Gaussian pdf. We show that this approximation of the true pdf by a single Gaussian is too coarse, for example a single Gaussian cannot represent the differences in the symmetric parts of the pdf's of the new hypothetical states generated when evaluating the state split gain (which will determine the state split). The use of more sophisticated representations would lead to intractable computational problems that we solve by using a tied-mixture pdf representation. Additionally, we constrain the codebook to be immutable during the split. Between state splits, this constraint is relaxed and the codebook is updated. In this paper, we thus propose an extension to the SSS algorithm, the so-called Tied-mixture Successive State Splitting algorithm (TM-SSS). TM-SSS shows up to about 31% error reduction in comparison with Maximum-Likelihood Successive State Split algorithm (ML-SSS) for a word recognition experiment.

  • Adaptive Training for Voice Conversion Based on Eigenvoices

    Yamato OHTANI  Tomoki TODA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Speech and Hearing

      Vol:
    E93-D No:6
      Page(s):
    1589-1598

    In this paper, we describe a novel model training method for one-to-many eigenvoice conversion (EVC). One-to-many EVC is a technique for converting a specific source speaker's voice into an arbitrary target speaker's voice. An eigenvoice Gaussian mixture model (EV-GMM) is trained in advance using multiple parallel data sets consisting of utterance-pairs of the source speaker and many pre-stored target speakers. The EV-GMM can be adapted to new target speakers using only a few of their arbitrary utterances by estimating a small number of adaptive parameters. In the adaptation process, several parameters of the EV-GMM to be fixed for different target speakers strongly affect the conversion performance of the adapted model. In order to improve the conversion performance in one-to-many EVC, we propose an adaptive training method of the EV-GMM. In the proposed training method, both the fixed parameters and the adaptive parameters are optimized by maximizing a total likelihood function of the EV-GMMs adapted to individual pre-stored target speakers. We conducted objective and subjective evaluations to demonstrate the effectiveness of the proposed training method. The experimental results show that the proposed adaptive training yields significant quality improvements in the converted speech.

  • Building an Effective Speech Corpus by Utilizing Statistical Multidimensional Scaling Method

    Goshu NAGINO  Makoto SHOZAKAI  Tomoki TODA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Corpus

      Vol:
    E91-D No:3
      Page(s):
    607-614

    This paper proposes a technique for building an effective speech corpus with lower cost by utilizing a statistical multidimensional scaling method. The statistical multidimensional scaling method visualizes multiple HMM acoustic models into two-dimensional space. At first, a small number of voice samples per speaker is collected; speaker adapted acoustic models trained with collected utterances, are mapped into two-dimensional space by utilizing the statistical multidimensional scaling method. Next, speakers located in the periphery of the distribution, in a plotted map are selected; a speech corpus is built by collecting enough voice samples for the selected speakers. In an experiment for building an isolated-word speech corpus, the performance of an acoustic model trained with 200 selected speakers was equivalent to that of an acoustic model trained with 533 non-selected speakers. It means that a cost reduction of more than 62% was achieved. In an experiment for building a continuous word speech corpus, the performance of an acoustic model trained with 500 selected speakers was equivalent to that of an acoustic model trained with 1179 non-selected speakers. It means that a cost reduction of more than 57% was achieved.

  • Rapid Compensation of Temperature Fluctuation Effect for Multichannel Sound Field Reproduction System

    Yuki YAI  Shigeki MIYABE  Hiroshi SARUWATARI  Kiyohiro SHIKANO  Yosuke TATEKURA  

     
    PAPER

      Vol:
    E91-A No:6
      Page(s):
    1329-1336

    In this paper, we propose a computationally efficient method of compensating temperature for the transaural stereo. The conventional method can be used to estimate the change in impulse responses caused by the fluctuation of temperature with high accuracy. However, the large amount of computation required makes real-time implementation difficult. Focusing on the fact that the amount of compensation depends on the length of the impulse response, we reduce the computation required by segmenting the impulse response. We segment the impulse responses in the time domain and estimate the effect of temperature fluctuation for each of the segments. By joining the processed segments, we obtain the compensated impulse response of the whole length. Experimental results show that the proposed method can reduce the computation required by a factor of nine without degradation of the accuracy.

  • Fast Convergence Blind Source Separation Using Frequency Subband Interpolation by Null Beamforming

    Keiichi OSAKO  Yoshimitsu MORI  Yu TAKAHASHI  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    LETTER

      Vol:
    E91-A No:6
      Page(s):
    1357-1361

    We propose a new algorithm for the blind source separation (BSS) approach in which independent component analysis (ICA) and frequency subband beamforming interpolation are combined. The slow convergence of the optimization of the separation filters is a problem in ICA. Our approach to resolving this problem is based on the relationship between ICA and null beamforming (NBF). The proposed method consists of the following three parts: (I) a frequency subband selector part for learning ICA, (II) a frequency domain ICA part with direction-of-arrivals (DOA) estimation of sound sources, and (III) an interpolation part in which null beamforming constructed with the estimated DOA is used. The results of the signal separation experiments under a reverberant condition reveal that the convergence speed is superior to that of the conventional ICA-based BSS methods.

  • Non-Audible Murmur (NAM) Recognition

    Yoshitaka NAKAJIMA  Hideki KASHIOKA  Nick CAMPBELL  Kiyohiro SHIKANO  

     
    PAPER

      Vol:
    E89-D No:1
      Page(s):
    1-8

    We propose a new practical input interface for the recognition of Non-Audible Murmur (NAM), which is defined as articulated respiratory sound without vocal-fold vibration transmitted through the soft tissues of the head. We developed a microphone attachment, which adheres to the skin, by applying the principle of a medical stethoscope, found the ideal position for sampling flesh-conducted NAM sound vibration and retrained an acoustic model with NAM samples. Then using the Julius Japanese Dictation Toolkit, we tested the feasibility of using this method in place of an external microphone for analyzing air-conducted voice sound.

  • Improving Rapid Unsupervised Speaker Adaptation Based on HMM-Sufficient Statistics in Noisy Environments Using Multi-Template Models

    Randy GOMEZ  Akinobu LEE  Tomoki TODA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Speech Recognition

      Vol:
    E89-D No:3
      Page(s):
    998-1005

    This paper describes the method of using multi-template unsupervised speaker adaptation based on HMM-Sufficient Statistics to push up the adaptation performance while keeping adaptation time within few seconds with just one arbitrary utterance. This adaptation scheme is mainly composed of two processes. The first part is done offline which involves the training of multiple class-dependent acoustic models and the creation of speakers' HMM-Sufficient Statistics based on gender and age. The second part is performed online where adaptation begins using the single utterance of a test speaker. From this utterance, the system will classify the speaker's class and consequently select the N-best neighbor speakers close to the utterance using Gaussian Mixture Models (GMM). The classified speakers' class template model is then adopted as a base model. From this template model, the adapted model is rapidly constructed using the N-best neighbor speakers' HMM-Sufficient Statistics. Experiments in noisy environment conditions with 20 dB, 15 dB and 10 dB SNR office, crowd, booth, and car noise are performed. The proposed multi-template method achieved 89.5% word accuracy rate compared with 88.1% of the conventional single-template method, while the baseline recognition rate without adaptation is 86.4%. Moreover, experiments using Vocal Tract Length Normalization (VTLN) and supervised Maximum Likelihood Linear Regression (MLLR) are also compared.

  • Reducing Computation Time of the Rapid Unsupervised Speaker Adaptation Based on HMM-Sufficient Statistics

    Randy GOMEZ  Tomoki TODA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:2
      Page(s):
    554-561

    In real-time speech recognition applications, there is a need to implement a fast and reliable adaptation algorithm. We propose a method to reduce adaptation time of the rapid unsupervised speaker adaptation based on HMM-Sufficient Statistics. We use only a single arbitrary utterance without transcriptions in selecting the N-best speakers' Sufficient Statistics created offline to provide data for adaptation to a target speaker. Further reduction of N-best implies a reduction in adaptation time. However, it degrades recognition performance due to insufficiency of data needed to robustly adapt the model. Linear interpolation of the global HMM-Sufficient Statistics offsets this negative effect and achieves a 50% reduction in adaptation time without compromising the recognition performance. Furthermore, we compared our method with Vocal Tract Length Normalization (VTLN), Maximum A Posteriori (MAP) and Maximum Likelihood Linear Regression (MLLR). Moreover, we tested in office, car, crowd and booth noise environments in 10 dB, 15 dB, 20 dB and 25 dB SNRs.

  • Blind Source Separation of Acoustic Signals Based on Multistage ICA Combining Frequency-Domain ICA and Time-Domain ICA

    Tsuyoki NISHIKAWA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Digital Signal Processing

      Vol:
    E86-A No:4
      Page(s):
    846-858

    We propose a new algorithm for blind source separation (BSS), in which frequency-domain independent component analysis (FDICA) and time-domain ICA (TDICA) are combined to achieve a superior source-separation performance under reverberant conditions. Generally speaking, conventional TDICA fails to separate source signals under heavily reverberant conditions because of the low convergence in the iterative learning of the inverse of the mixing system. On the other hand, the separation performance of conventional FDICA also degrades significantly because the independence assumption of narrow-band signals collapses when the number of subbands increases. In the proposed method, the separated signals of FDICA are regarded as the input signals for TDICA, and we can remove the residual crosstalk components of FDICA by using TDICA. The experimental results obtained under the reverberant condition reveal that the separation performance of the proposed method is superior to those of TDICA- and FDICA-based BSS methods.

  • Japanese Phonetic Typewriter Using HMM Phone Recognition and Stochastic Phone-Sequence Modeling

    Takeshi KAWABATA  Toshiyuki HANAZAWA  Katsunobu ITOH  Kiyohiro SHIKANO  

     
    PAPER-Dictation Systems

      Vol:
    E74-A No:7
      Page(s):
    1783-1787

    A phonetic typewriter is an unlimitedvocabulary continuous speech recognition system recognizing each phone in speech without the need for lexical information. This paper describes a Japanese phonetic typewriter system based on HMM phone recognition and syllable-based stochastic phone sequence modeling. Even though HMM methods have considerable capacity for recognizing speech, it is difficult to recognize individual phones in continuous speech without lexical information. HMM phone recognition is improved by incorporating syllable trigrams for phone sequence modeling. HMM phone units are trained using an isolated word database, and their duration parameters are modified according to speaking rate. Syllable trigram tables are made from a text database of over 300,000 syllables, and phone sequence probabilities calculated from the trigrams are combined with HMM probabilities. Using these probabilities, to limit the number of intermediate candidates leads to an accurate phonetic typewriter system without requiring excessive computation time. An interpolated n-gram approach to phone sequence modeling, is shown to be more effective than a simple trigram method.

  • Connectionist Approaches to Large Vocabulary Continuous Speech Recognition

    Hidefumi SAWAI  Yasuhiro MINAMI  Masanori MIYATAKE  Alex WAIBEL  Kiyohiro SHIKANO  

     
    PAPER-Continuous Speech Recognition

      Vol:
    E74-A No:7
      Page(s):
    1834-1844

    This paper describes recent progress in a connectionist large-vocabulary continuous speech recognition system integrating speech recognition and language processing. The speech recognition part consists of Large Phonemic Time-Delay Neural Networks (TDNNs) which can automatically spot all 24 Japanese phonemes (i.e., 18 consonants /b/, /d/, /g/, /p/, /t/, /k/, /m/, /n/, /N/, /s/, /sh/ ([]), /h/, /z/, /ch/ ([t]), /ts/, /r/, /w/, /y/([j]) and 5 vowels /a/, /i/, /u/, /e/, /o/ and a double consonant /Q/ or silence) by simply scanning among input speech without any specific segmentation techniques. On the other hand, the language processing part is made up of a predictive LR parser in which the LR parser is guided by the LR parsing table automatically generated from context-free grammar rules, and proceeds left-to-right without backtracking. Time alignment between the predicted phonemes and a sequence of the TDNN phoneme outputs is carried out by the DTW matching method. We call this 'hybrid' integrated recognition system the 'TDNN-LR' method. We report that large-vocabulary isolated word and continuous speech recognition using the TDNN-LR method provided excellent speaker-dependent recognition performance, where incremental training using a small number of training tokens is found to be very effective for adaptation of speaking rate. Furthermore, we report some new achievements as extensions of the TDNN-LR method: (1) two proposed NN architectures provide robust phoneme recognition performance on variations of speaking manner, (2) a speaker-adaptation technique can be realized using a NN mapping function between input and standard speakers and (3) new architectures proposed for speaker-independent recognition provide performance that nearly matches speaker-dependent recognition performance.

  • Overdetermined Blind Separation for Real Convolutive Mixtures of Speech Based on Multistage ICA Using Subarray Processing

    Tsuyoki NISHIKAWA  Hiroshi ABE  Hiroshi SARUWATARI  Kiyohiro SHIKANO  Atsunobu KAMINUMA  

     
    PAPER-Speech/Acoustic Signal Processing

      Vol:
    E87-A No:8
      Page(s):
    1924-1932

    We propose a new algorithm for overdetermined blind source separation (BSS) based on multistage independent component analysis (MSICA). To improve the separation performance, we have proposed MSICA in which frequency-domain ICA and time-domain ICA are cascaded. In the original MSICA, the specific mixing model, where the number of microphones is equal to that of sources, was assumed. However, additional microphones are required to achieve an improved separation performance under reverberant environments. This leads to alternative problems, e.g., a complication of the permutation problem. In order to solve them, we propose a new extended MSICA using subarray processing, where the number of microphones and that of sources are set to be the same in every subarray. The experimental results obtained under the real environment reveal that the separation performance of the proposed MSICA is improved as the number of microphones is increased.

  • Utterance-Based Selective Training for the Automatic Creation of Task-Dependent Acoustic Models

    Tobias CINCAREK  Tomoki TODA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Speech Recognition

      Vol:
    E89-D No:3
      Page(s):
    962-969

    To obtain a robust acoustic model for a certain speech recognition task, a large amount of speech data is necessary. However, the preparation of speech data including recording and transcription is very costly and time-consuming. Although there are attempts to build generic acoustic models which are portable among different applications, speech recognition performance is typically task-dependent. This paper introduces a method for automatically building task-dependent acoustic models based on selective training. Instead of setting up a new database, only a small amount of task-specific development data needs to be collected. Based on the likelihood of the target model parameters given this development data, utterances which are acoustically close to the development data are selected from existing speech data resources. Since there are too many possibilities for selecting a data subset from a larger database in general, a heuristic has to be employed. The proposed algorithm deletes single utterances temporarily or alternates between successive deletion and addition of multiple utterances. In order to make selective training computationally practical, model retraining and likelihood calculation need to be fast. It is shown, that the model likelihood can be calculated fast and easily based on sufficient statistics without the need for explicit reconstruction of model parameters. The algorithm is applied to obtain an infant- and elderly-dependent acoustic model with only very few development data available. There is an improvement in word accuracy of up to 9% in comparison to conventional EM training without selection. Furthermore, the approach was also better than MLLR and MAP adaptation with the development data.

  • Speaker Weighted Training of HMM Using Multiple Reference Speakers

    Hiroaki HATTORI  Satoshi NAKAMURA  Kiyohiro SHIKANO  Shigeki SAGAYAMA  

     
    PAPER-Speech Processing

      Vol:
    E76-D No:2
      Page(s):
    219-226

    This paper proposes a new speaker adaptation method using a speaker weighting technique for multiple reference speaker training of a hidden Markov model (HMM). The proposed method considers the similarities between an input speaker and multiple reference speakers, and use the similarities to control the influence of the reference speakers upon HMM. The evaluation experiments were carried out through the/b, d, g, m, n, N/phoneme recognition task using 8 speakers. Average recognition rates were 68.0%, 66.4%, and 65.6% respectively for three test sets which have different speech styles. These were 4.8%, 8.8%, and 10.5% higher than the rates of the spectrum mapping method, and also 1.6%, 6.7%, and 8.2% higher than the rates of the multiple reference speaker training, the supplemented HMM. The evaluation experiments clarified the effectiveness of the proposed method.

  • Multistage SIMO-Model-Based Blind Source Separation Combining Frequency-Domain ICA and Time-Domain ICA

    Satoshi UKAI  Tomoya TAKATANI  Hiroshi SARUWATARI  Kiyohiro SHIKANO  Ryo MUKAI  Hiroshi SAWADA  

     
    PAPER

      Vol:
    E88-A No:3
      Page(s):
    642-650

    In this paper, single-input multiple-output (SIMO)-model-based blind source separation (BSS) is addressed, where unknown mixed source signals are detected at microphones, and can be separated, not into monaural source signals but into SIMO-model-based signals from independent sources as they are at the microphones. This technique is highly applicable to high-fidelity signal processing such as binaural signal processing. First, we provide an experimental comparison between two kinds of SIMO-model-based BSS methods, namely, conventional frequency-domain ICA with projection-back processing (FDICA-PB), and SIMO-ICA which was recently proposed by the authors. Secondly, we propose a new combination technique of the FDICA-PB and SIMO-ICA, which can achieve a higher separation performance than the two methods. The experimental results reveal that the accuracy of the separated SIMO signals in the simple SIMO-ICA is inferior to that of the signals obtained by FDICA-PB under low-quality initial value conditions, but the proposed combination technique can outperform both simple FDICA-PB and SIMO-ICA.

  • Esophageal Speech Enhancement Based on Statistical Voice Conversion with Gaussian Mixture Models

    Hironori DOI  Keigo NAKAMURA  Tomoki TODA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Voice Conversion

      Vol:
    E93-D No:9
      Page(s):
    2472-2482

    This paper presents a novel method of enhancing esophageal speech using statistical voice conversion. Esophageal speech is one of the alternative speaking methods for laryngectomees. Although it doesn't require any external devices, generated voices usually sound unnatural compared with normal speech. To improve the intelligibility and naturalness of esophageal speech, we propose a voice conversion method from esophageal speech into normal speech. A spectral parameter and excitation parameters of target normal speech are separately estimated from a spectral parameter of the esophageal speech based on Gaussian mixture models. The experimental results demonstrate that the proposed method yields significant improvements in intelligibility and naturalness. We also apply one-to-many eigenvoice conversion to esophageal speech enhancement to make it possible to flexibly control the voice quality of enhanced speech.

  • Speech Prior Estimation for Generalized Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator

    Ryo WAKISAKA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  Tomoya TAKATANI  

     
    LETTER-Engineering Acoustics

      Vol:
    E95-A No:2
      Page(s):
    591-595

    In this paper, we introduce a generalized minimum mean-square error short-time spectral amplitude estimator with a new prior estimation of the speech probability density function based on moment-cumulant transformation. From the objective and subjective evaluation experiments, we show the improved noise reduction performance of the proposed method.

  • Direction of Arrival Estimation Using Nonlinear Microphone Array

    Hidekazu KAMIYANAGIDA  Hiroshi SARUWATARI  Kazuya TAKEDA  Fumitada ITAKURA  Kiyohiro SHIKANO  

     
    PAPER

      Vol:
    E84-A No:4
      Page(s):
    999-1010

    This paper describes a new method for estimating the direction of arrival (DOA) using a nonlinear microphone array system based on complementary beamforming. Complementary beamforming is based on two types of beamformers designed to obtain complementary directivity patterns with respect to each other. In this system, since the resultant directivity pattern is proportional to the product of these directivity patterns, the proposed method can be used to estimate DOAs of 2(K-1) sound sources with K-element microphone array. First, DOA-estimation experiments are performed using both computer simulation and actual devices in real acoustic environments. The results clarify that DOA estimation for two sound sources can be accomplished by the proposed method with two microphones. Also, by comparing the resolutions of DOA estimation by the proposed method and by the conventional minimum variance method, we can show that the performance of the proposed method is superior to that of the minimum variance method under all reverberant conditions.

1-20hit(45hit)