Siamese Attention-Based LSTM for Speech Emotion Recognition

Tashpolat NIZAMIDIN; Li ZHAO; Ruiyu LIANG; Yue XIE; Askar HAMDULLA

doi:10.1587/transfun.2019EAL2156

Siamese Attention-Based LSTM for Speech Emotion Recognition

Tashpolat NIZAMIDIN, Li ZHAO, Ruiyu LIANG, Yue XIE, Askar HAMDULLA

Full Text Views

0

Cite this

Summary :

As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.

Publication: IEICE TRANSACTIONS on Fundamentals Vol.E103-A No.7 pp.937-941

Publication Date: 2020/07/01

Publicized

Online ISSN: 1745-1337

DOI: 10.1587/transfun.2019EAL2156

Type of Manuscript: LETTER

Category: Engineering Acoustics

Authors

Tashpolat NIZAMIDIN
  Southeast University
Li ZHAO
  Southeast University
Ruiyu LIANG
  Nanjing Institute of Technology
Yue XIE
  Southeast University
Askar HAMDULLA
  Xinjiang University

Keyword

Siamese networks, pairwise training, attention-based long short-term memory, speech emotion recognition

Cite this

Copy

Tashpolat NIZAMIDIN, Li ZHAO, Ruiyu LIANG, Yue XIE, Askar HAMDULLA, "Siamese Attention-Based LSTM for Speech Emotion Recognition" in IEICE TRANSACTIONS on Fundamentals, vol. E103-A, no. 7, pp. 937-941, July 2020, doi: 10.1587/transfun.2019EAL2156.
Abstract: As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/transfun.2019EAL2156/_p

Copy

@ARTICLE{e103-a_7_937,
author={Tashpolat NIZAMIDIN, Li ZHAO, Ruiyu LIANG, Yue XIE, Askar HAMDULLA, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Siamese Attention-Based LSTM for Speech Emotion Recognition},
year={2020},
volume={E103-A},
number={7},
pages={937-941},
abstract={As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.},
keywords={},
doi={10.1587/transfun.2019EAL2156},
ISSN={1745-1337},
month={July},}

Copy

TY - JOUR
TI - Siamese Attention-Based LSTM for Speech Emotion Recognition
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 937
EP - 941
AU - Tashpolat NIZAMIDIN
AU - Li ZHAO
AU - Ruiyu LIANG
AU - Yue XIE
AU - Askar HAMDULLA
PY - 2020
DO - 10.1587/transfun.2019EAL2156
JO - IEICE TRANSACTIONS on Fundamentals
SN - 1745-1337
VL - E103-A
IS - 7
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - July 2020
AB - As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.
ER -