Speech emotion recognition (SER) has been a complex and difficult task for a long time due to emotional complexity. In this paper, we propose a multitask deep learning approach based on cascaded attention network and self-adaption loss for SER. First, non-personalized features are extracted to represent the process of emotion change while reducing external variables' influence. Second, to highlight salient speech emotion features, a cascade attention network is proposed, where spatial temporal attention can effectively locate the regions of speech that express emotion, while self-attention reduces the dependence on external information. Finally, the influence brought by the differences in gender and human perception of external information is alleviated by using a multitask learning strategy, where a self-adaption loss is introduced to determine the weights of different tasks dynamically. Experimental results on IEMOCAP dataset demonstrate that our method gains an absolute improvement of 1.97% and 0.91% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.
Yang LIU
Qingdao University of Science and Technology
Yuqi XIA
Qingdao University of Science and Technology
Haoqin SUN
Qingdao University of Science and Technology
Xiaolei MENG
Qingdao University of Science and Technology
Jianxiong BAI
Qingdao University of Science and Technology
Wenbo GUAN
Qingdao University of Science and Technology
Zhen ZHAO
Qingdao University of Science and Technology
Yongwei LI
Chinese Academy of Sciences
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Yang LIU, Yuqi XIA, Haoqin SUN, Xiaolei MENG, Jianxiong BAI, Wenbo GUAN, Zhen ZHAO, Yongwei LI, "A Multitask Learning Approach Based on Cascaded Attention Network and Self-Adaption Loss for Speech Emotion Recognition" in IEICE TRANSACTIONS on Fundamentals,
vol. E106-A, no. 6, pp. 876-885, June 2023, doi: 10.1587/transfun.2022EAP1091.
Abstract: Speech emotion recognition (SER) has been a complex and difficult task for a long time due to emotional complexity. In this paper, we propose a multitask deep learning approach based on cascaded attention network and self-adaption loss for SER. First, non-personalized features are extracted to represent the process of emotion change while reducing external variables' influence. Second, to highlight salient speech emotion features, a cascade attention network is proposed, where spatial temporal attention can effectively locate the regions of speech that express emotion, while self-attention reduces the dependence on external information. Finally, the influence brought by the differences in gender and human perception of external information is alleviated by using a multitask learning strategy, where a self-adaption loss is introduced to determine the weights of different tasks dynamically. Experimental results on IEMOCAP dataset demonstrate that our method gains an absolute improvement of 1.97% and 0.91% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/transfun.2022EAP1091/_p
Copy
@ARTICLE{e106-a_6_876,
author={Yang LIU, Yuqi XIA, Haoqin SUN, Xiaolei MENG, Jianxiong BAI, Wenbo GUAN, Zhen ZHAO, Yongwei LI, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={A Multitask Learning Approach Based on Cascaded Attention Network and Self-Adaption Loss for Speech Emotion Recognition},
year={2023},
volume={E106-A},
number={6},
pages={876-885},
abstract={Speech emotion recognition (SER) has been a complex and difficult task for a long time due to emotional complexity. In this paper, we propose a multitask deep learning approach based on cascaded attention network and self-adaption loss for SER. First, non-personalized features are extracted to represent the process of emotion change while reducing external variables' influence. Second, to highlight salient speech emotion features, a cascade attention network is proposed, where spatial temporal attention can effectively locate the regions of speech that express emotion, while self-attention reduces the dependence on external information. Finally, the influence brought by the differences in gender and human perception of external information is alleviated by using a multitask learning strategy, where a self-adaption loss is introduced to determine the weights of different tasks dynamically. Experimental results on IEMOCAP dataset demonstrate that our method gains an absolute improvement of 1.97% and 0.91% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.},
keywords={},
doi={10.1587/transfun.2022EAP1091},
ISSN={1745-1337},
month={June},}
Copy
TY - JOUR
TI - A Multitask Learning Approach Based on Cascaded Attention Network and Self-Adaption Loss for Speech Emotion Recognition
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 876
EP - 885
AU - Yang LIU
AU - Yuqi XIA
AU - Haoqin SUN
AU - Xiaolei MENG
AU - Jianxiong BAI
AU - Wenbo GUAN
AU - Zhen ZHAO
AU - Yongwei LI
PY - 2023
DO - 10.1587/transfun.2022EAP1091
JO - IEICE TRANSACTIONS on Fundamentals
SN - 1745-1337
VL - E106-A
IS - 6
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - June 2023
AB - Speech emotion recognition (SER) has been a complex and difficult task for a long time due to emotional complexity. In this paper, we propose a multitask deep learning approach based on cascaded attention network and self-adaption loss for SER. First, non-personalized features are extracted to represent the process of emotion change while reducing external variables' influence. Second, to highlight salient speech emotion features, a cascade attention network is proposed, where spatial temporal attention can effectively locate the regions of speech that express emotion, while self-attention reduces the dependence on external information. Finally, the influence brought by the differences in gender and human perception of external information is alleviated by using a multitask learning strategy, where a self-adaption loss is introduced to determine the weights of different tasks dynamically. Experimental results on IEMOCAP dataset demonstrate that our method gains an absolute improvement of 1.97% and 0.91% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.
ER -