The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] speech intelligibility(6hit)

1-6hit
  • A Deep Learning-Based Approach to Non-Intrusive Objective Speech Intelligibility Estimation

    Deokgyu YUN  Hannah LEE  Seung Ho CHOI  

     
    LETTER-Speech and Hearing

      Pubricized:
    2018/01/09
      Vol:
    E101-D No:4
      Page(s):
    1207-1208

    This paper proposes a deep learning-based non-intrusive objective speech intelligibility estimation method based on recurrent neural network (RNN) with long short-term memory (LSTM) structure. Conventional non-intrusive estimation methods such as standard P.563 have poor estimation performance and lack of consistency, especially, in various noise and reverberation environments. The proposed method trains the LSTM RNN model parameters by utilizing the STOI that is the standard intrusive intelligibility estimation method with reference speech signal. The input and output of the LSTM RNN are the MFCC vector and the frame-wise STOI value, respectively. Experimental results show that the proposed objective intelligibility estimation method outperforms the conventional standard P.563 in various noisy and reverberant environments.

  • A Speech Intelligibility Estimation Method Using a Non-reference Feature Set

    Toshihiro SAKANO  Yosuke KOBAYASHI  Kazuhiro KONDO  

     
    PAPER

      Vol:
    E98-D No:1
      Page(s):
    21-28

    We proposed and evaluated a speech intelligibility estimation method that does not require a clean speech reference signal. The propose method uses the features defined in the ITU-T standard P.563, which estimates the overall quality of speech without the reference signal. We selected two sets of features from the P.563 features; the basic 9-feature set, which includes basic features that characterize both speech and background noise, e.g., cepstrum skewness and LPC kurtosis, and the extended 31-feature set with 22 additional features for a more accurate description of the degraded speech and noise, e.g., SNR, average pitch, and spectral clarity among others. Four hundred noise samples were added to speech, and about 70% of these samples were used to train a support vector regression (SVR) model. The trained models were used to estimate the intelligibility of speech degraded by added noise. The proposed method showed a root mean square error (RMSE) value of about 10% and correlation with subjective intelligibility of about 0.93 for speech distorted with known noise type, and RMSE of about 16% and a correlation of about 0.84 for speech distorted with unknown noise type, both with either the 9 or the 31-dimension feature set. These results were higher than the estimation using frequency-weighed SNR calculated in critical frequency bands, which requires the clean reference signal for its calculation. We believe this level of accuracy proves the proposed method to be applicable to real-time speech quality monitoring in the field.

  • Comparison of Output Devices for Augmented Audio Reality

    Kazuhiro KONDO  Naoya ANAZAWA  Yosuke KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E97-D No:8
      Page(s):
    2114-2123

    We compared two audio output devices for augmented audio reality applications. In these applications, we plan to use speech annotations on top of the actual ambient environment. Thus, it becomes essential that these audio output devices are able to deliver intelligible speech annotation along with transparent delivery of the environmental auditory scene. Two candidate devices were compared. The first output was the bone-conduction headphone, which can deliver speech signals by vibrating the skull, while normal hearing is left intact for surrounding noise since these headphones leave the ear canals open. The other is the binaural microphone/earphone combo, which is in a form factor similar to a regular earphone, but integrates a small microphone at the ear canal entry. The input from these microphones can be fed back to the earphones along with the annotation speech. We also compared these devices to normal hearing (i.e., without headphones or earphones) for reference. We compared the speech intelligibility when competing babble noise is simultaneously given from the surrounding environment. It was found that the binaural combo can generally deliver speech signals at comparable or higher intelligibility than the bone-conduction headphones. However, with the binaural combo, we found that the ear canal transfer characteristics were altered significantly by shutting the ear canals closed with the earphones. Accordingly, if we employed a compensation filter to account for this transfer function deviation, the resultant speech intelligibility was found to be significantly higher. However, both of these devices were found to be acceptable as audio output devices for augmented audio reality applications since both are able to deliver speech signals at high intelligibility even when a significant amount of competing noise is present. In fact, both of these speech output methods were able to deliver speech signals at higher intelligibility than natural speech, especially when the SNR was low.

  • Estimation of Speech Intelligibility Using Speech Recognition Systems

    Yusuke TAKANO  Kazuhiro KONDO  

     
    PAPER-Speech and Hearing

      Vol:
    E93-D No:12
      Page(s):
    3368-3376

    We attempted to estimate subjective scores of the Japanese Diagnostic Rhyme Test (DRT), a two-to-one forced selection speech intelligibility test. We used automatic speech recognizers with language models that force one of the words in the word-pair, mimicking the human recognition process of the DRT. Initial testing was done using speaker-independent models, and they showed significantly lower scores than subjective scores. The acoustic models were then adapted to each of the speakers in the corpus, and then adapted to noise at a specified SNR. Three different types of noise were tested: white noise, multi-talker (babble) noise, and pseudo-speech noise. The match between subjective and estimated scores improved significantly with noise-adapted models compared to speaker-independent models and the speaker-adapted models, when the adapted noise level and the tested level match. However, when SNR conditions do not match, the recognition scores degraded especially when tested SNR conditions were higher than the adapted noise level. Accordingly, we adapted the models to mixed levels of noise, i.e., multi-condition training. The adapted models now showed relatively high intelligibility matching subjective intelligibility performance over all levels of noise. The correlation between subjective and estimated intelligibility scores increased to 0.94 with multi-talker noise, 0.93 with white noise, and 0.89 with pseudo-speech noise, while the root mean square error (RMSE) reduced from more than 40 to 13.10, 13.05 and 16.06, respectively.

  • Harmonicity Based Dereverberation for Improving Automatic Speech Recognition Performance and Speech Intelligibility

    Keisuke KINOSHITA  Tomohiro NAKATANI  Masato MIYOSHI  

     
    PAPER-Speech Enhancement

      Vol:
    E88-A No:7
      Page(s):
    1724-1731

    A speech signal captured by a distant microphone is generally smeared by reverberation, which severely degrades both the speech intelligibility and Automatic Speech Recognition (ASR) performance. Previously, we proposed a single-microphone dereverberation method, named "Harmonicity based dEReverBeration (HERB)." HERB estimates the inverse filter for an unknown room transfer function by utilizing an essential feature of speech, namely harmonic structure. In previous studies, improvements in speech intelligibility was shown solely with spectrograms, and improvements in ASR performance were simply confirmed by matched condition acoustic model. In this paper, we undertook a further investigation of HERB's potential as regards to the above two factors. First, we examined speech intelligibility by means of objective indices. As a result, we found that HERB is capable of improving the speech intelligibility to approximately that of clean speech. Second, since HERB alone could not improve the ASR performance sufficiently, we further analyzed the HERB mechanism with a view to achieving further improvements. Taking the analysis results into account, we proposed an appropriate ASR configuration and conducted experiments. Experimental results confirmed that, if HERB is used with an ASR adaptation scheme such as MLLR and a multicondition acoustic model, it is very effective for improving ASR performance even in unknown severely reverberant environments.

  • Evaluation of a Novel Signal Processing Strategy for Cochlear Implant Speech Processors

    Erdenebat DASHTSEREN  Shigeyoshi KITAZAWA  Satoshi IWASAKI  Shinya KIRIYAMA  

     
    PAPER-Medical Engineering

      Vol:
    E87-D No:2
      Page(s):
    463-471

    Our study focuses on an evaluation of a novel speech processing strategy for multi-channel cochlear implant speech processors. Stimulation pulse trains for the Nucleus 24CI speech processor were generated in a way different from the speech processing strategies implemented in this processor. The distinctive features of the novel strategy are: 1) electrode stimulation order driven by location of maximum instantaneous frequency amplitude; 2) variable stimulation rates on electrodes; 3) variable number of selected channels within a cycle of signal processing schema. Within-subject designed tests on Japanese initial, medial and final consonants in CV, VCV and CV/N context tokens were carried out with cochlear implant patients using the Cochlear ACETM strategy, and results were compared with those of normal hearing listeners. Results of the initial and medial consonant tests showed significantly better performance with the novel strategy than with the ACE strategy for both the cochlear implant and normal hearing listener groups. Results of the final consonant tests showed a slightly better performance with the ACE strategy for cochlear implant listeners while showing a slightly better performance with the novel strategy for normal hearing listeners.