The search functionality is under construction.

Author Search Result

[Author] Oh-Wook KWON(2hit)

1-2hit
  • Combining Multiple Acoustic Models in GMM Spaces for Robust Speech Recognition

    Byung Ok KANG  Oh-Wook KWON  

     
    PAPER-Speech and Hearing

      Pubricized:
    2015/11/24
      Vol:
    E99-D No:3
      Page(s):
    724-730

    We propose a new method to combine multiple acoustic models in Gaussian mixture model (GMM) spaces for robust speech recognition. Even though large vocabulary continuous speech recognition (LVCSR) systems are recently widespread, they often make egregious recognition errors resulting from unavoidable mismatch of speaking styles or environments between the training and real conditions. To handle this problem, a multi-style training approach has been used conventionally to train a large acoustic model by using a large speech database with various kinds of speaking styles and environment noise. But, in this work, we combine multiple sub-models trained for different speaking styles or environment noise into a large acoustic model by maximizing the log-likelihood of the sub-model states sharing the same phonetic context and position. Then the combined acoustic model is used in a new target system, which is robust to variation in speaking style and diverse environment noise. Experimental results show that the proposed method significantly outperforms the conventional methods in two tasks: Non-native English speech recognition for second-language learning systems and noise-robust point-of-interest (POI) recognition for car navigation systems.

  • Automatic Construction of a Large-Scale Speech Recognition Database Using Multi-Genre Broadcast Data with Inaccurate Subtitle Timestamps

    Jeong-Uk BANG  Mu-Yeol CHOI  Sang-Hun KIM  Oh-Wook KWON  

     
    PAPER-Speech and Hearing

      Pubricized:
    2019/11/13
      Vol:
    E103-D No:2
      Page(s):
    406-415

    As deep learning-based speech recognition systems are spotlighted, the need for large-scale speech databases for acoustic model training is increasing. Broadcast data can be easily used for database construction, since it contains transcripts for the hearing impaired. However, the subtitle timestamps have not been used to extract speech data because they are often inaccurate due to the inherent characteristics of closed captioning. Thus, we propose to build a large-scale speech database from multi-genre broadcast data with inaccurate subtitle timestamps. The proposed method first extracts the most likely speech intervals by removing subtitle texts with low subtitle quality index, concatenating adjacent subtitle texts into a merged subtitle text, and adding a margin to the timestamp of the merged subtitle text. Next, a speech recognizer is used to extract a hypothesis text of a speech segment corresponding to the merged subtitle text, and then the hypothesis text obtained from the decoder is recursively aligned with the merged subtitle text. Finally, the speech database is constructed by selecting the sub-parts of the merged subtitle text that match the hypothesis text. Our method successfully refines a large amount of broadcast data with inaccurate subtitle timestamps, taking about half of the time compared with the previous methods. Consequently, our method is useful for broadcast data processing, where bulk speech data can be collected every hour.