The search functionality is under construction.
The search functionality is under construction.

Comparative Evaluation of Diverse Features in Fluency Evaluation of Spontaneous Speech

Huaijin DENG, Takehito UTSURO, Akio KOBAYASHI, Hiromitsu NISHIZAKI

  • Full Text Views

    0

  • Cite this

Summary :

There have been lots of previous studies on fluency evaluation of spontaneous speech. However, most of them focus on lexical cues, and little emphasis is placed on how diverse acoustic features and deep end-to-end models contribute to improving the performance. In this paper, we describe multi-layer neural network to investigate not only lexical features extracted from transcription, but also consider utterance-level acoustic features from audio data. We also conduct the experiments to investigate the performance of end-to-end approaches with mel-spectrogram in this task. As the speech fluency evaluation task, we evaluate our proposed method in two binary classification tasks of fluent speech detection and disfluent speech detection. Speech data of around 10 seconds duration each with the annotation of the three classes of “fluent,” “neutral,” and “disfluent” is used for evaluation. According to the two way splits of those three classes, the task of fluent speech detection is defined as binary classification of fluent vs. neutral and disfluent, while that of disfluent speech detection is defined as binary classification of fluent and neutral vs. disfluent. We then conduct experiments with the purpose of comparative evaluation of multi-layer neural network with diverse features as well as end-to-end models. For the fluent speech detection, in the comparison of utterance-level disfluency-based, prosodic, and acoustic features with multi-layer neural network, disfluency-based and prosodic features only are better. More specifically, the performance improved a lot when removing all of the acoustic features from the full set of features, while the performance is damaged a lot if fillers related features are removed. Overall, however, the end-to-end Transformer+VGGNet model with mel-spectrogram achieves the best results. For the disfluent speech detection, the multi-layer neural network using disfluency-based, prosodic, and acoustic features without fillers achieves the best results. The end-to-end Transformer+VGGNet architecture also obtains high scores, whereas it is exceeded by the best results with the multi-layer neural network with significant difference. Thus, unlike in the fluent speech detection, disfluency-based and prosodic features other than fillers are still necessary in the disfluent speech detection.

Publication
IEICE TRANSACTIONS on Information Vol.E106-D No.1 pp.36-45
Publication Date
2023/01/01
Publicized
2022/10/25
Online ISSN
1745-1361
DOI
10.1587/transinf.2022EDP7047
Type of Manuscript
PAPER
Category
Speech and Hearing

Authors

Huaijin DENG
  University of Tsukuba
Takehito UTSURO
  University of Tsukuba
Akio KOBAYASHI
  Tsukuba University of Technology
Hiromitsu NISHIZAKI
  University of Yamanashi

Keyword