The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] emotion recognition(31hit)

1-20hit(31hit)

  • Multimodal Speech Emotion Recognition Based on Large Language Model Open Access

    Congcong FANG  Yun JIN  Guanlin CHEN  Yunfan ZHANG  Shidang LI  Yong MA  Yue XIE  

     
    LETTER-Speech and Hearing

      Pubricized:
    2024/07/22
      Vol:
    E107-D No:11
      Page(s):
    1463-1467

    Currently, an increasing number of tasks in speech emotion recognition rely on the analysis of both speech and text features. However, there remains a paucity of research exploring the potential of leveraging large language models like GPT-3 to enhance emotion recognition. In this investigation, we harness the power of the GPT-3 model to extract semantic information from transcribed texts, generating text modal features with a dimensionality of 1536. Subsequently, we perform feature fusion, combining the 1536-dimensional text features with 1188-dimensional acoustic features to yield comprehensive multi-modal recognition outcomes. Our findings reveal that the proposed method achieves a weighted accuracy of 79.62% across the four emotion categories in IEMOCAP, underscoring the considerable enhancement in emotion recognition accuracy facilitated by integrating large language models.

  • Cross-Corpus Speech Emotion Recognition Based on Causal Emotion Information Representation Open Access

    Hongliang FU  Qianqian LI  Huawei TAO  Chunhua ZHU  Yue XIE  Ruxue GUO  

     
    LETTER-Speech and Hearing

      Pubricized:
    2024/04/12
      Vol:
    E107-D No:8
      Page(s):
    1097-1100

    Speech emotion recognition (SER) is a key research technology to realize the third generation of artificial intelligence, which is widely used in human-computer interaction, emotion diagnosis, interpersonal communication and other fields. However, the aliasing of language and semantic information in speech tends to distort the alignment of emotion features, which affects the performance of cross-corpus SER system. This paper proposes a cross-corpus SER model based on causal emotion information representation (CEIR). The model uses the reconstruction loss of the deep autoencoder network and the source domain label information to realize the preliminary separation of causal features. Then, the causal correlation matrix is constructed, and the local maximum mean difference (LMMD) feature alignment technology is combined to make the causal features of different dimensions jointly distributed independent. Finally, the supervised fine-tuning of labeled data is used to achieve effective extraction of causal emotion information. The experimental results show that the average unweighted average recall (UAR) of the proposed algorithm is increased by 3.4% to 7.01% compared with the latest partial algorithms in the field.

  • A Multitask Learning Approach Based on Cascaded Attention Network and Self-Adaption Loss for Speech Emotion Recognition

    Yang LIU  Yuqi XIA  Haoqin SUN  Xiaolei MENG  Jianxiong BAI  Wenbo GUAN  Zhen ZHAO  Yongwei LI  

     
    PAPER-Speech and Hearing

      Pubricized:
    2022/12/08
      Vol:
    E106-A No:6
      Page(s):
    876-885

    Speech emotion recognition (SER) has been a complex and difficult task for a long time due to emotional complexity. In this paper, we propose a multitask deep learning approach based on cascaded attention network and self-adaption loss for SER. First, non-personalized features are extracted to represent the process of emotion change while reducing external variables' influence. Second, to highlight salient speech emotion features, a cascade attention network is proposed, where spatial temporal attention can effectively locate the regions of speech that express emotion, while self-attention reduces the dependence on external information. Finally, the influence brought by the differences in gender and human perception of external information is alleviated by using a multitask learning strategy, where a self-adaption loss is introduced to determine the weights of different tasks dynamically. Experimental results on IEMOCAP dataset demonstrate that our method gains an absolute improvement of 1.97% and 0.91% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.

  • Speech Emotion Recognition Using Multihead Attention in Both Time and Feature Dimensions

    Yue XIE  Ruiyu LIANG  Zhenlin LIANG  Xiaoyan ZHAO  Wenhao ZENG  

     
    LETTER-Speech and Hearing

      Pubricized:
    2023/02/21
      Vol:
    E106-D No:5
      Page(s):
    1098-1101

    To enhance the emotion feature and improve the performance of speech emotion recognition, an attention mechanism is employed to recognize the important information in both time and feature dimensions. In the time dimension, multi-heads attention is modified with the last state of the long short-term memory (LSTM)'s output to match the time accumulation characteristic of LSTM. In the feature dimension, scaled dot-product attention is replaced with additive attention that refers to the method of the state update of LSTM to construct multi-heads attention. This means that a nonlinear change replaces the linear mapping in classical multi-heads attention. Experiments on IEMOCAP datasets demonstrate that the attention mechanism could enhance emotional information and improve the performance of speech emotion recognition.

  • Convolutional Auto-Encoder and Adversarial Domain Adaptation for Cross-Corpus Speech Emotion Recognition

    Yang WANG  Hongliang FU  Huawei TAO  Jing YANG  Hongyi GE  Yue XIE  

     
    LETTER-Artificial Intelligence, Data Mining

      Pubricized:
    2022/07/12
      Vol:
    E105-D No:10
      Page(s):
    1803-1806

    This letter focuses on the cross-corpus speech emotion recognition (SER) task, in which the training and testing speech signals in cross-corpus SER belong to different speech corpora. Existing algorithms are incapable of effectively extracting common sentiment information between different corpora to facilitate knowledge transfer. To address this challenging problem, a novel convolutional auto-encoder and adversarial domain adaptation (CAEADA) framework for cross-corpus SER is proposed. The framework first constructs a one-dimensional convolutional auto-encoder (1D-CAE) for feature processing, which can explore the correlation among adjacent one-dimensional statistic features and the feature representation can be enhanced by the architecture based on encoder-decoder-style. Subsequently the adversarial domain adaptation (ADA) module alleviates the feature distributions discrepancy between the source and target domains by confusing domain discriminator, and specifically employs maximum mean discrepancy (MMD) to better accomplish feature transformation. To evaluate the proposed CAEADA, extensive experiments were conducted on EmoDB, eNTERFACE, and CASIA speech corpora, and the results show that the proposed method outperformed other approaches.

  • A novel Adaptive Weighted Transfer Subspace Learning Method for Cross-Database Speech Emotion Recognition

    Keke ZHAO  Peng SONG  Shaokai LI  Wenjing ZHANG  Wenming ZHENG  

     
    LETTER-Speech and Hearing

      Pubricized:
    2022/06/09
      Vol:
    E105-D No:9
      Page(s):
    1643-1646

    In this letter, we present an adaptive weighted transfer subspace learning (AWTSL) method for cross-database speech emotion recognition (SER), which can efficiently eliminate the discrepancy between source and target databases. Specifically, on one hand, a subspace projection matrix is first learned to project the cross-database features into a common subspace. At the same time, each target sample can be represented by the source samples by using a sparse reconstruction matrix. On the other hand, we design an adaptive weighted matrix learning strategy, which can improve the reconstruction contribution of important features and eliminate the negative influence of redundant features. Finally, we conduct extensive experiments on four benchmark databases, and the experimental results demonstrate the efficacy of the proposed method.

  • Weighted Gradient Pretrain for Low-Resource Speech Emotion Recognition

    Yue XIE  Ruiyu LIANG  Xiaoyan ZHAO  Zhenlin LIANG  Jing DU  

     
    LETTER-Speech and Hearing

      Pubricized:
    2022/04/04
      Vol:
    E105-D No:7
      Page(s):
    1352-1355

    To alleviate the problem of the dependency on the quantity of the training sample data in speech emotion recognition, a weighted gradient pre-train algorithm for low-resource speech emotion recognition is proposed. Multiple public emotion corpora are used for pre-training to generate shared hidden layer (SHL) parameters with the generalization ability. The parameters are used to initialize the downsteam network of the recognition task for the low-resource dataset, thereby improving the recognition performance on low-resource emotion corpora. However, the emotion categories are different among the public corpora, and the number of samples varies greatly, which will increase the difficulty of joint training on multiple emotion datasets. To this end, a weighted gradient (WG) algorithm is proposed to enable the shared layer to learn the generalized representation of different datasets without affecting the priority of the emotion recognition on each corpus. Experiments show that the accuracy is improved by using CASIA, IEMOCAP, and eNTERFACE as the known datasets to pre-train the emotion models of GEMEP, and the performance could be improved further by combining WG with gradient reversal layer.

  • Image Emotion Recognition Using Visual and Semantic Features Reflecting Emotional and Similar Objects

    Takahisa YAMAMOTO  Shiki TAKEUCHI  Atsushi NAKAZAWA  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2021/06/24
      Vol:
    E104-D No:10
      Page(s):
    1691-1701

    Visual sentiment analysis has a lot of applications, including image captioning, opinion mining, and advertisement; however, it is still a difficult problem and existing algorithms cannot produce satisfactory results. One of the difficulties in classifying images into emotions is that visual sentiments are evoked by different types of information - visual and semantic information where visual information includes colors or textures, and semantic information includes types of objects evoking emotions and/or their combinations. In contrast to the existing methods that use only visual information, this paper shows a novel algorithm for image emotion recognition that uses both information simultaneously. For semantic features, we introduce an object vector and a word vector. The object vector is created by an object detection method and reflects existing objects in an image. The word vector is created by transforming the names of detected objects through a word embedding model. This vector will be similar among objects that are semantically similar. These semantic features and a visual feature made by a fine-tuned convolutional neural network (CNN) are concatenated. We perform the classification by the concatenated feature vector. Extensive evaluation experiments using emotional image datasets show that our method achieves the best accuracy except for one dataset against other existing methods. The improvement in accuracy of our method from existing methods is 4.54% at the highest.

  • Multi-Task Learning for Improved Recognition of Multiple Types of Acoustic Information

    Jae-Won KIM  Hochong PARK  

     
    LETTER-Speech and Hearing

      Pubricized:
    2021/07/14
      Vol:
    E104-D No:10
      Page(s):
    1762-1765

    We propose a new method for improving the recognition performance of phonemes, speech emotions, and music genres using multi-task learning. When tasks are closely related, multi-task learning can improve the performance of each task by learning common feature representation for all the tasks. However, the recognition tasks considered in this study demand different input signals of speech and music at different time scales, resulting in input features with different characteristics. In addition, a training dataset with multiple labels for all information sources is not available. Considering these issues, we conduct multi-task learning in a sequential training process using input features with a single label for one information source. A comparative evaluation confirms that the proposed method for multi-task learning provides higher performance for all recognition tasks than individual learning for each task as in conventional methods.

  • A Two-Stage Attention Based Modality Fusion Framework for Multi-Modal Speech Emotion Recognition

    Dongni HU  Chengxin CHEN  Pengyuan ZHANG  Junfeng LI  Yonghong YAN  Qingwei ZHAO  

     
    LETTER-Human-computer Interaction

      Pubricized:
    2021/04/30
      Vol:
    E104-D No:8
      Page(s):
    1391-1394

    Recently, automated recognition and analysis of human emotion has attracted increasing attention from multidisciplinary communities. However, it is challenging to utilize the emotional information simultaneously from multiple modalities. Previous studies have explored different fusion methods, but they mainly focused on either inter-modality interaction or intra-modality interaction. In this letter, we propose a novel two-stage fusion strategy named modality attention flow (MAF) to model the intra- and inter-modality interactions simultaneously in a unified end-to-end framework. Experimental results show that the proposed approach outperforms the widely used late fusion methods, and achieves even better performance when the number of stacked MAF blocks increases.

  • Siamese Attention-Based LSTM for Speech Emotion Recognition

    Tashpolat NIZAMIDIN  Li ZHAO  Ruiyu LIANG  Yue XIE  Askar HAMDULLA  

     
    LETTER-Engineering Acoustics

      Vol:
    E103-A No:7
      Page(s):
    937-941

    As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.

  • Cross-Corpus Speech Emotion Recognition Based on Deep Domain-Adaptive Convolutional Neural Network

    Jiateng LIU  Wenming ZHENG  Yuan ZONG  Cheng LU  Chuangao TANG  

     
    LETTER-Pattern Recognition

      Pubricized:
    2019/11/07
      Vol:
    E103-D No:2
      Page(s):
    459-463

    In this letter, we propose a novel deep domain-adaptive convolutional neural network (DDACNN) model to handle the challenging cross-corpus speech emotion recognition (SER) problem. The framework of the DDACNN model consists of two components: a feature extraction model based on a deep convolutional neural network (DCNN) and a domain-adaptive (DA) layer added in the DCNN utilizing the maximum mean discrepancy (MMD) criterion. We use labeled spectrograms from source speech corpus combined with unlabeled spectrograms from target speech corpus as the input of two classic DCNNs to extract the emotional features of speech, and train the model with a special mixed loss combined with a cross-entrophy loss and an MMD loss. Compared to other classic cross-corpus SER methods, the major advantage of the DDACNN model is that it can extract robust speech features which are time-frequency related by spectrograms and narrow the discrepancies between feature distribution of source corpus and target corpus to get better cross-corpus performance. Through several cross-corpus SER experiments, our DDACNN achieved the state-of-the-art performance on three public emotion speech corpora and is proved to handle the cross-corpus SER problem efficiently.

  • Target-Adapted Subspace Learning for Cross-Corpus Speech Emotion Recognition

    Xiuzhen CHEN  Xiaoyan ZHOU  Cheng LU  Yuan ZONG  Wenming ZHENG  Chuangao TANG  

     
    LETTER-Speech and Hearing

      Pubricized:
    2019/08/26
      Vol:
    E102-D No:12
      Page(s):
    2632-2636

    For cross-corpus speech emotion recognition (SER), how to obtain effective feature representation for the discrepancy elimination of feature distributions between source and target domains is a crucial issue. In this paper, we propose a Target-adapted Subspace Learning (TaSL) method for cross-corpus SER. The TaSL method trys to find a projection subspace, where the feature regress the label more accurately and the gap of feature distributions in target and source domains is bridged effectively. Then, in order to obtain more optimal projection matrix, ℓ1 norm and ℓ2,1 norm penalty terms are added to different regularization terms, respectively. Finally, we conduct extensive experiments on three public corpuses, EmoDB, eNTERFACE and AFEW 4.0. The experimental results show that our proposed method can achieve better performance compared with the state-of-the-art methods in the cross-corpus SER tasks.

  • Attention-Based Dense LSTM for Speech Emotion Recognition Open Access

    Yue XIE  Ruiyu LIANG  Zhenlin LIANG  Li ZHAO  

     
    LETTER-Pattern Recognition

      Pubricized:
    2019/04/17
      Vol:
    E102-D No:7
      Page(s):
    1426-1429

    Despite the widespread use of deep learning for speech emotion recognition, they are severely restricted due to the information loss in the high layer of deep neural networks, as well as the degradation problem. In order to efficiently utilize information and solve degradation, attention-based dense long short-term memory (LSTM) is proposed for speech emotion recognition. LSTM networks with the ability to process time series such as speech are constructed into which attention-based dense connections are introduced. That means the weight coefficients are added to skip-connections of each layer to distinguish the difference of the emotional information between layers and avoid the interference of redundant information from the bottom layer to the effective information from the top layer. The experiments demonstrate that proposed method improves the recognition performance by 12% and 7% on eNTERFACE and IEMOCAP corpus respectively.

  • Locality Preserved Joint Nonnegative Matrix Factorization for Speech Emotion Recognition

    Seksan MATHULAPRANGSAN  Yuan-Shan LEE  Jia-Ching WANG  

     
    LETTER

      Pubricized:
    2019/01/28
      Vol:
    E102-D No:4
      Page(s):
    821-825

    This study presents a joint dictionary learning approach for speech emotion recognition named locality preserved joint nonnegative matrix factorization (LP-JNMF). The learned representations are shared between the learned dictionaries and annotation matrix. Moreover, a locality penalty term is incorporated into the objective function. Thus, the system's discriminability is further improved.

  • A Novel Supervised Bimodal Emotion Recognition Approach Based on Facial Expression and Body Gesture

    Jingjie YAN  Guanming LU  Xiaodong BAI  Haibo LI  Ning SUN  Ruiyu LIANG  

     
    LETTER-Image

      Vol:
    E101-A No:11
      Page(s):
    2003-2006

    In this letter, we propose a supervised bimodal emotion recognition approach based on two important human emotion modalities including facial expression and body gesture. A effectively supervised feature fusion algorithms named supervised multiset canonical correlation analysis (SMCCA) is presented to established the linear connection between three sets of matrices, which contain the feature matrix of two modalities and their concurrent category matrix. The test results in the bimodal emotion recognition of the FABO database show that the SMCCA algorithm can get better or considerable efficiency than unsupervised feature fusion algorithm covering canonical correlation analysis (CCA), sparse canonical correlation analysis (SCCA), multiset canonical correlation analysis (MCCA) and so on.

  • Construction of Spontaneous Emotion Corpus from Indonesian TV Talk Shows and Its Application on Multimodal Emotion Recognition

    Nurul LUBIS  Dessi LESTARI  Sakriani SAKTI  Ayu PURWARIANTI  Satoshi NAKAMURA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2018/05/10
      Vol:
    E101-D No:8
      Page(s):
    2092-2100

    As interaction between human and computer continues to develop to the most natural form possible, it becomes increasingly urgent to incorporate emotion in the equation. This paper describes a step toward extending the research on emotion recognition to Indonesian. The field continues to develop, yet exploration of the subject in Indonesian is still lacking. In particular, this paper highlights two contributions: (1) the construction of the first emotional audio-visual database in Indonesian, and (2) the first multimodal emotion recognizer in Indonesian, built from the aforementioned corpus. In constructing the corpus, we aim at natural emotions that are corresponding to real life occurrences. However, the collection of emotional corpora is notably labor intensive and expensive. To diminish the cost, we collect the emotional data from television programs recordings, eliminating the need of an elaborate recording set up and experienced participants. In particular, we choose television talk shows due to its natural conversational content, yielding spontaneous emotion occurrences. To cover a broad range of emotions, we collected three episodes in different genres: politics, humanity, and entertainment. In this paper, we report points of analysis of the data and annotations. The acquisition of the emotion corpus serves as a foundation in further research on emotion. Subsequently, in the experiment, we employ the support vector machine (SVM) algorithm to model the emotions in the collected data. We perform multimodal emotion recognition utilizing the predictions of three modalities: acoustic, semantic, and visual. When compared to the unimodal result, in the multimodal feature combination, we attain identical accuracy for the arousal at 92.6%, and a significant improvement for the valence classification task at 93.8%. We hope to continue this work and move towards a finer-grain, more precise quantification of emotion.

  • Learning Corpus-Invariant Discriminant Feature Representations for Speech Emotion Recognition

    Peng SONG  Shifeng OU  Zhenbin DU  Yanyan GUO  Wenming MA  Jinglei LIU  Wenming ZHENG  

     
    LETTER-Speech and Hearing

      Pubricized:
    2017/02/02
      Vol:
    E100-D No:5
      Page(s):
    1136-1139

    As a hot topic of speech signal processing, speech emotion recognition methods have been developed rapidly in recent years. Some satisfactory results have been achieved. However, it should be noted that most of these methods are trained and evaluated on the same corpus. In reality, the training data and testing data are often collected from different corpora, and the feature distributions of different datasets often follow different distributions. These discrepancies will greatly affect the recognition performance. To tackle this problem, a novel corpus-invariant discriminant feature representation algorithm, called transfer discriminant analysis (TDA), is presented for speech emotion recognition. The basic idea of TDA is to integrate the kernel LDA algorithm and the similarity measurement of distributions into one objective function. Experimental results under the cross-corpus conditions show that our proposed method can significantly improve the recognition rates.

  • Spectral Features Based on Local Normalized Center Moments for Speech Emotion Recognition

    Huawei TAO  Ruiyu LIANG  Xinran ZHANG  Li ZHAO  

     
    LETTER-Speech and Hearing

      Vol:
    E99-A No:10
      Page(s):
    1863-1866

    To discuss whether rotational invariance is the main role in spectrogram features, new spectral features based on local normalized center moments, denoted by LNCMSF, are proposed. The proposed LNCMSF firstly adopts 2nd order normalized center moments to describe local energy distribution of the logarithmic energy spectrum, then normalized center moment spectrograms NC1 and NC2 are gained. Secondly, DCT (Discrete Cosine Transform) is used to eliminate the correlation of NC1 and NC2, then high order cepstral coefficients TNC1 and TNC2 are obtained. Finally, LNCMSF is generated by combining NC1, NC2, TNC1 and TNC2. The rotational invariance test experiment shows that the rotational invariance is not a necessary property in partial spectrogram features. The recognition experiment shows that the maximum UA (Unweighted Average of Class-Wise Recall Rate) of LNCMSF are improved by at least 10.7% and 1.2% respectively, compared to that of MFCC (Mel Frequency Cepstrum Coefficient) and HuWSF (Weighted Spectral Features Based on Local Hu Moments).

  • Transfer Semi-Supervised Non-Negative Matrix Factorization for Speech Emotion Recognition

    Peng SONG  Shifeng OU  Xinran ZHANG  Yun JIN  Wenming ZHENG  Jinglei LIU  Yanwei YU  

     
    LETTER-Speech and Hearing

      Pubricized:
    2016/07/01
      Vol:
    E99-D No:10
      Page(s):
    2647-2650

    In practice, emotional speech utterances are often collected from different devices or conditions, which will lead to discrepancy between the training and testing data, resulting in sharp decrease of recognition rates. To solve this problem, in this letter, a novel transfer semi-supervised non-negative matrix factorization (TSNMF) method is presented. A semi-supervised negative matrix factorization algorithm, utilizing both labeled source and unlabeled target data, is adopted to learn common feature representations. Meanwhile, the maximum mean discrepancy (MMD) as a similarity measurement is employed to reduce the distance between the feature distributions of two databases. Finally, the TSNMF algorithm, which optimizes the SNMF and MMD functions together, is proposed to obtain robust feature representations across databases. Extensive experiments demonstrate that in comparison to the state-of-the-art approaches, our proposed method can significantly improve the cross-corpus recognition rates.

1-20hit(31hit)