The search functionality is under construction.

Author Search Result

[Author] Shoei SATO(8hit)

1-8hit
  • Filter Bank Subtraction for Robust Speech Recognition

    Kazuo ONOE  Hiroyuki SEGI  Takeshi KOBAYAKAWA  Shoei SATO  Shinichi HOMMA  Toru IMAI  Akio ANDO  

     
    PAPER-Robust Speech Recognition and Enhancement

      Vol:
    E86-D No:3
      Page(s):
    483-488

    In this paper, we propose a new technique of filter bank subtraction for robust speech recognition under various acoustic conditions. Spectral subtraction is a simple and useful technique for reducing the influence of additive noise. Conventional spectral subtraction assumes accurate estimation of the noise spectrum and no correlation between speech and noise. Those assumptions, however, are rarely satisfied in reality, leading to the degradation of speech recognition accuracy. Moreover, the recognition improvement attained by conventional methods is slight when the input SNR changes sharply. We propose a new method in which the output values of filter banks are used for noise estimation and subtraction. By estimating noise at each filter bank, instead of at each frequency point, the method alleviates the necessity for precise estimation of noise. We also take into consideration expected phase differences between the spectra of speech and noise in the subtraction and control a subtraction coefficient theoretically. Recognition experiments on test sets at several SNRs showed that the filter bank subtraction technique improved the word accuracy significantly and got better results than conventional spectral subtraction on all the test sets. In other experiments, on recognizing speech from TV news field reports with environmental noise, the proposed subtraction method yielded better results than the conventional method.

  • Robust Speech Recognition by Using Compensated Acoustic Scores

    Shoei SATO  Kazuo ONOE  Akio KOBAYASHI  Toru IMAI  

     
    PAPER-Speech Recognition

      Vol:
    E89-D No:3
      Page(s):
    915-921

    This paper proposes a new compensation method of acoustic scores in the Viterbi search for robust speech recognition. This method introduces noise models to represent a wide variety of noises and realizes robust decoding together with conventional techniques of subtraction and adaptation. This method uses likelihoods of noise models in two ways. One is to calculate a confidence factor for each input frame by comparing likelihoods of speech models and noise models. Then the weight of the acoustic score for a noisy frame is reduced according to the value of the confidence factor for compensation. The other is to use the likelihood of noise model as an alternative that of a silence model when given noisy input. Since a lower confidence factor compresses acoustic scores, the decoder rather relies on language scores and keeps more hypotheses within a fixed search depth for a noisy frame. An experiment using commentary transcriptions of a broadcast sports program (MLB: Major League Baseball) showed that the proposed method obtained a 6.7% relative word error reduction. The method also reduced the relative error rate of key words by 17.9%, and this is expected lead to an improvement metadata extraction accuracy.

  • Bi-Spectral Acoustic Features for Robust Speech Recognition

    Kazuo ONOE  Shoei SATO  Shinichi HOMMA  Akio KOBAYASHI  Toru IMAI  Tohru TAKAGI  

     
    LETTER

      Vol:
    E91-D No:3
      Page(s):
    631-634

    The extraction of acoustic features for robust speech recognition is very important for improving its performance in realistic environments. The bi-spectrum based on the Fourier transformation of the third-order cumulants expresses the non-Gaussianity and the phase information of the speech signal, showing the dependency between frequency components. In this letter, we propose a method of extracting short-time bi-spectral acoustic features with averaging features in a single frame. Merged with the conventional Mel frequency cepstral coefficients (MFCC) based on the power spectrum by the principal component analysis (PCA), the proposed features gave a 6.9% relative lower a word error rate in Japanese broadcast news transcription experiments.

  • Word Error Rate Minimization Using an Integrated Confidence Measure

    Akio KOBAYASHI  Kazuo ONOE  Shinichi HOMMA  Shoei SATO  Toru IMAI  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:5
      Page(s):
    835-843

    This paper describes a new criterion for speech recognition using an integrated confidence measure to minimize the word error rate (WER). The conventional criteria for WER minimization obtain the expected WER of a sentence hypothesis merely by comparing it with other hypotheses in an n-best list. The proposed criterion estimates the expected WER by using an integrated confidence measure with word posterior probabilities for a given acoustic input. The integrated confidence measure, which is implemented as a classifier based on maximum entropy (ME) modeling or support vector machines (SVMs), is used to acquire probabilities reflecting whether the word hypotheses are correct. The classifier is comprised of a variety of confidence measures and can deal with a temporal sequence of them to attain a more reliable confidence. Our proposed criterion for minimizing WER achieved a WER of 9.8% and a 3.9% reduction, relative to conventional n-best rescoring methods in transcribing Japanese broadcast news in various environments such as under noisy field and spontaneous speech conditions.

  • Mutual Information Based Dynamic Integration of Multiple Feature Streams for Robust Real-Time LVCSR

    Shoei SATO  Akio KOBAYASHI  Kazuo ONOE  Shinichi HOMMA  Toru IMAI  Tohru TAKAGI  Tetsunori KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E91-D No:3
      Page(s):
    815-824

    We present a novel method of integrating the likelihoods of multiple feature streams, representing different acoustic aspects, for robust speech recognition. The integration algorithm dynamically calculates a frame-wise stream weight so that a higher weight is given to a stream that is robust to a variety of noisy environments or speaking styles. Such a robust stream is expected to show discriminative ability. A conventional method proposed for the recognition of spoken digits calculates the weights from the entropy of the whole set of HMM states. This paper extends the dynamic weighting to a real-time large-vocabulary continuous speech recognition (LVCSR) system. The proposed weight is calculated in real-time from mutual information between an input stream and active HMM states in a search space without an additional likelihood calculation. Furthermore, the mutual information takes the width of the search space into account by calculating the marginal entropy from the number of active states. In this paper, we integrate three features that are extracted through auditory filters by taking into account the human auditory system's ability to extract amplitude and frequency modulations. Due to this, features representing energy, amplitude drift, and resonant frequency drifts, are integrated. These features are expected to provide complementary clues for speech recognition. Speech recognition experiments on field reports and spontaneous commentary from Japanese broadcast news showed that the proposed method reduced error words by 9.2% in field reports and 4.7% in spontaneous commentaries relative to the best result obtained from a single stream.

  • Online Speech Detection and Dual-Gender Speech Recognition for Captioning Broadcast News

    Toru IMAI  Shoei SATO  Shinichi HOMMA  Kazuo ONOE  Akio KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:8
      Page(s):
    1286-1291

    This paper describes a new method to detect speech segments online with identifying gender attributes for efficient dual gender-dependent speech recognition and broadcast news captioning. The proposed online speech detection performs dual-gender phoneme recognition and detects a start-point and an end-point based on the ratio between the cumulative phoneme likelihood and the cumulative non-speech likelihood with a very small delay from the audio input. Obtaining the speech segments, the phoneme recognizer also identifies gender attributes with high discrimination in order to guide the subsequent dual-gender continuous speech recognizer efficiently. As soon as the start-point is detected, the continuous speech recognizer with paralleled gender-dependent acoustic models starts a search and allows search transitions between male and female in a speech segment based on the gender attributes. Speech recognition experiments on conversational commentaries and field reporting from Japanese broadcast news showed that the proposed speech detection method was effective in reducing the false rejection rate from 4.6% to 0.53% and also recognition errors in comparison with a conventional method using adaptive energy thresholds. It was also effective in identifying the gender attributes, whose correct rate was 99.7% of words. With the new speech detection and the gender identification, the proposed dual-gender speech recognition significantly reduced the word error rate by 11.2% relative to a conventional gender-independent system, while keeping the computational cost feasible for real-time operation.

  • Simultaneous Subtitling System for Broadcast News Programs with a Speech Recognizer

    Akio ANDO  Toru IMAI  Akio KOBAYASHI  Shinich HOMMA  Jun GOTO  Nobumasa SEIYAMA  Takeshi MISHIMA  Takeshi KOBAYAKAWA  Shoei SATO  Kazuo ONOE  Hiroyuki SEGI  Atsushi IMAI  Atsushi MATSUI  Akira NAKAMURA  Hideki TANAKA  Tohru TAKAGI  Eiichi MIYASAKA  Haruo ISONO  

     
    INVITED PAPER

      Vol:
    E86-D No:1
      Page(s):
    15-25

    There is a strong demand to expand captioned broadcasting for TV news programs in Japan. However, keyboard entry of captioned manuscripts for news program cannot keep pace with the speed of speech, because in the case of Japanese it takes time to select the correct characters from among homonyms. In order to implement simultaneous subtitled broadcasting for Japanese news programs, a simultaneous subtitling system by speech recognition has been developed. This system consists of a real-time speech recognition system to handle broadcast news transcription and a recognition-error correction system that manually corrects mistakes in the recognition result with short delay time. NHK started simultaneous subtitled broadcasting for the news program "News 7" on the evening of March 27, 2000.

  • Learning Speech Variability in Discriminative Acoustic Model Adaptation

    Shoei SATO  Takahiro OKU  Shinichi HOMMA  Akio KOBAYASHI  Toru IMAI  

     
    PAPER-Adaptation

      Vol:
    E93-D No:9
      Page(s):
    2370-2378

    We present a new discriminative method of acoustic model adaptation that deals with a task-dependent speech variability. We have focused on differences of expressions or speaking styles between tasks and set the objective of this method as improving the recognition accuracy of indistinctly pronounced phrases dependent on a speaking style. The adaptation appends subword models for frequently observable variants of subwords in the task. To find the task-dependent variants, low-confidence words are statistically selected from words with higher frequency in the task's adaptation data by using their word lattices. HMM parameters of subword models dependent on the words are discriminatively trained by using linear transforms with a minimum phoneme error (MPE) criterion. For the MPE training, subword accuracy discriminating between the variants and the originals is also investigated. In speech recognition experiments, the proposed adaptation with the subword variants reduced the word error rate by 12.0% relative in a Japanese conversational broadcast task.