IEICE global.ieice.org Site

Author Search Result

[Author] Ruiyu LIANG(21hit)

1-20hit(21hit)

A Novel Supervised Bimodal Emotion Recognition Approach Based on Facial Expression and Body Gesture
Jingjie YAN Guanming LU Xiaodong BAI Haibo LI Ning SUN Ruiyu LIANG

LETTER-Image

Vol:
E101-A No:11
Page(s):
2003-2006
In this letter, we propose a supervised bimodal emotion recognition approach based on two important human emotion modalities including facial expression and body gesture. A effectively supervised feature fusion algorithms named supervised multiset canonical correlation analysis (SMCCA) is presented to established the linear connection between three sets of matrices, which contain the feature matrix of two modalities and their concurrent category matrix. The test results in the bimodal emotion recognition of the FABO database show that the SMCCA algorithm can get better or considerable efficiency than unsupervised feature fusion algorithm covering canonical correlation analysis (CCA), sparse canonical correlation analysis (SCCA), multiset canonical correlation analysis (MCCA) and so on.
A Salient Feature Extraction Algorithm for Speech Emotion Recognition
Ruiyu LIANG Huawei TAO Guichen TANG Qingyun WANG Li ZHAO

LETTER-Speech and Hearing

Pubricized:
2015/05/29
Vol:
E98-D No:9
Page(s):
1715-1718
A salient feature extraction algorithm is proposed to improve the recognition rate of the speech emotion. Firstly, the spectrogram of the emotional speech is calculated. Secondly, imitating the selective attention mechanism, the color, direction and brightness map of the spectrogram is computed. Each map is normalized and down-sampled to form the low resolution feature matrix. Then, each feature matrix is converted to the row vector and the principal component analysis (PCA) is used to reduce features redundancy to make the subsequent classification algorithm more practical. Finally, the speech emotion is classified with the support vector machine. Compared with the tradition features, the improved recognition rate reaches 15%.
Combining Siamese Network and Regression Network for Visual Tracking
Yao GE Rui CHEN Ying TONG Xuehong CAO Ruiyu LIANG

LETTER-Image Recognition, Computer Vision

Pubricized:
2020/05/13
Vol:
E103-D No:8
Page(s):
1924-1927
We combine the siamese network and the recurrent regression network, proposing a two-stage tracking framework termed as SiamReg. Our method solves the problem that the classic siamese network can not judge the target size precisely and simplifies the procedures of regression in the training and testing process. We perform experiments on three challenging tracking datasets: VOT2016, OTB100, and VOT2018. The results indicate that, after offline trained, SiamReg can obtain a higher expected average overlap measure.
A Novel Bimodal Emotion Database from Physiological Signals and Facial Expression
Jingjie YAN Bei WANG Ruiyu LIANG

LETTER-Multimedia Pattern Processing

Pubricized:
2018/04/17
Vol:
E101-D No:7
Page(s):
1976-1979
In this paper, we establish a novel bimodal emotion database from physiological signals and facial expression, which is named as PSFE. The physiological signals and facial expression of the PSFE database are respectively recorded by the equipment of the BIOPAC MP 150 and the Kinect for Windows in the meantime. The PSFE database altogether records 32 subjects which include 11 women and 21 man, and their age distribution is from 20 to 25. Moreover, the PSFE database records three basic emotion classes containing calmness, happiness and sadness, which respectively correspond to the neutral, positive and negative emotion state. The general sample number of the PSFE database is 288 and each emotion class contains 96 samples.
Speech Emotion Recognition Using Multihead Attention in Both Time and Feature Dimensions
Yue XIE Ruiyu LIANG Zhenlin LIANG Xiaoyan ZHAO Wenhao ZENG

LETTER-Speech and Hearing

Pubricized:
2023/02/21
Vol:
E106-D No:5
Page(s):
1098-1101
To enhance the emotion feature and improve the performance of speech emotion recognition, an attention mechanism is employed to recognize the important information in both time and feature dimensions. In the time dimension, multi-heads attention is modified with the last state of the long short-term memory (LSTM)'s output to match the time accumulation characteristic of LSTM. In the feature dimension, scaled dot-product attention is replaced with additive attention that refers to the method of the state update of LSTM to construct multi-heads attention. This means that a nonlinear change replaces the linear mapping in classical multi-heads attention. Experiments on IEMOCAP datasets demonstrate that the attention mechanism could enhance emotional information and improve the performance of speech emotion recognition.
Unconstrained Facial Expression Recognition Based on Feature Enhanced CNN and Cross-Layer LSTM
Ying TONG Rui CHEN Ruiyu LIANG

LETTER-Image Recognition, Computer Vision

Pubricized:
2020/07/30
Vol:
E103-D No:11
Page(s):
2403-2406
LSTM network have shown to outperform in facial expression recognition of video sequence. In view of limited representation ability of single-layer LSTM, a hierarchical attention model with enhanced feature branch is proposed. This new network architecture consists of traditional VGG-16-FACE with enhanced feature branch followed by a cross-layer LSTM. The VGG-16-FACE with enhanced branch extracts the spatial features as well as the cross-layer LSTM extracts the temporal relations between different frames in the video. The proposed method is evaluated on the public emotion databases in subject-independent and cross-database tasks and outperforms state-of-the-art methods.
Attention-Based Dense LSTM for Speech Emotion Recognition Open Access
Yue XIE Ruiyu LIANG Zhenlin LIANG Li ZHAO

LETTER-Pattern Recognition

Pubricized:
2019/04/17
Vol:
E102-D No:7
Page(s):
1426-1429
Despite the widespread use of deep learning for speech emotion recognition, they are severely restricted due to the information loss in the high layer of deep neural networks, as well as the degradation problem. In order to efficiently utilize information and solve degradation, attention-based dense long short-term memory (LSTM) is proposed for speech emotion recognition. LSTM networks with the ability to process time series such as speech are constructed into which attention-based dense connections are introduced. That means the weight coefficients are added to skip-connections of each layer to distinguish the difference of the emotional information between layers and avoid the interference of redundant information from the bottom layer to the effective information from the top layer. The experiments demonstrate that proposed method improves the recognition performance by 12% and 7% on eNTERFACE and IEMOCAP corpus respectively.
Speaker-Independent Speech Emotion Recognition Based Multiple Kernel Learning of Collaborative Representation
Cheng ZHA Xinrang ZHANG Li ZHAO Ruiyu LIANG

LETTER-Engineering Acoustics

Vol:
E99-A No:3
Page(s):
756-759
We propose a novel multiple kernel learning (MKL) method using a collaborative representation constraint, called CR-MKL, for fusing the emotion information from multi-level features. To this end, the similarity and distinctiveness of multi-level features are learned in the kernels-induced space using the weighting distance measure. Our method achieves better performance than existing methods by using the voiced-level and unvoiced-level features.
Weighted Gradient Pretrain for Low-Resource Speech Emotion Recognition
Yue XIE Ruiyu LIANG Xiaoyan ZHAO Zhenlin LIANG Jing DU

LETTER-Speech and Hearing

Pubricized:
2022/04/04
Vol:
E105-D No:7
Page(s):
1352-1355
To alleviate the problem of the dependency on the quantity of the training sample data in speech emotion recognition, a weighted gradient pre-train algorithm for low-resource speech emotion recognition is proposed. Multiple public emotion corpora are used for pre-training to generate shared hidden layer (SHL) parameters with the generalization ability. The parameters are used to initialize the downsteam network of the recognition task for the low-resource dataset, thereby improving the recognition performance on low-resource emotion corpora. However, the emotion categories are different among the public corpora, and the number of samples varies greatly, which will increase the difficulty of joint training on multiple emotion datasets. To this end, a weighted gradient (WG) algorithm is proposed to enable the shared layer to learn the generalized representation of different datasets without affecting the priority of the emotion recognition on each corpus. Experiments show that the accuracy is improved by using CASIA, IEMOCAP, and eNTERFACE as the known datasets to pre-train the emotion models of GEMEP, and the performance could be improved further by combining WG with gradient reversal layer.
Facial Expression Recognition via Regression-Based Robust Locality Preserving Projections
Jingjie YAN Bojie YAN Ruiyu LIANG Guanming LU Haibo LI Shipeng XIE

LETTER-Image Recognition, Computer Vision

Pubricized:
2017/11/06
Vol:
E101-D No:2
Page(s):
564-567
In this paper, we present a novel regression-based robust locality preserving projections (RRLPP) method to effectively deal with the issue of noise and occlusion in facial expression recognition. Similar to robust principal component analysis (RPCA) and robust regression (RR) approach, the basic idea of the presented RRLPP approach is also to lead in the low-rank term and the sparse term of facial expression image sample matrix to simultaneously overcome the shortcoming of the locality preserving projections (LPP) method and enhance the robustness of facial expression recognition. However, RRLPP is a nonlinear robust subspace method which can effectively describe the local structure of facial expression images. The test results on the Multi-PIE facial expression database indicate that the RRLPP method can effectively eliminate the noise and the occlusion problem of facial expression images, and it also can achieve better or comparative facial expression recognition rate compared to the non-robust and robust subspace methods meantime.
A Novel Hybrid Network Model Based on Attentional Multi-Feature Fusion for Deception Detection
Yuanbo FANG Hongliang FU Huawei TAO Ruiyu LIANG Li ZHAO

LETTER-Speech and Hearing

Pubricized:
2020/09/24
Vol:
E104-A No:3
Page(s):
622-626
Speech based deception detection using deep learning is one of the technologies to realize a deception detection system with high recognition rate in the future. Multi-network feature extraction technology can effectively improve the recognition performance of the system, but due to the limited labeled data and the lack of effective feature fusion methods, the performance of the network is limited. Based on this, a novel hybrid network model based on attentional multi-feature fusion (HN-AMFF) is proposed. Firstly, the static features of large amounts of unlabeled speech data are input into DAE for unsupervised training. Secondly, the frame-level features and static features of a small amount of labeled speech data are simultaneously input into the LSTM network and the encoded output part of DAE for joint supervised training. Finally, a feature fusion algorithm based on attention mechanism is proposed, which can get the optimal feature set in the training process. Simulation results show that the proposed feature fusion method is significantly better than traditional feature fusion methods, and the model can achieve advanced performance with only a small amount of labeled data.
An Effective Acoustic Feedback Cancellation Algorithm Based on the Normalized Sub-Band Adaptive Filter
Xia WANG Ruiyu LIANG Qingyun WANG Li ZHAO Cairong ZOU

LETTER-Speech and Hearing

Pubricized:
2015/10/20
Vol:
E99-D No:1
Page(s):
288-291
In this letter, an effective acoustic feedback cancellation algorithm is proposed based on the normalized sub-band adaptive filter (NSAF). To improve the confliction between fast convergence rate and low misalignment in the NSAF algorithm, a variable step size is designed to automatically vary according to the update state of the filter. The update state of the filter is adaptively detected via the normalized distance between the long term average and the short term average of the tap-weight vector. Simulation results demonstrate that the proposed algorithm has superior performance in terms of convergence rate and misalignment.
Spectral Features Based on Local Normalized Center Moments for Speech Emotion Recognition
Huawei TAO Ruiyu LIANG Xinran ZHANG Li ZHAO

LETTER-Speech and Hearing

Vol:
E99-A No:10
Page(s):
1863-1866
To discuss whether rotational invariance is the main role in spectrogram features, new spectral features based on local normalized center moments, denoted by LNCMSF, are proposed. The proposed LNCMSF firstly adopts 2nd order normalized center moments to describe local energy distribution of the logarithmic energy spectrum, then normalized center moment spectrograms NC1 and NC2 are gained. Secondly, DCT (Discrete Cosine Transform) is used to eliminate the correlation of NC1 and NC2, then high order cepstral coefficients TNC1 and TNC2 are obtained. Finally, LNCMSF is generated by combining NC1, NC2, TNC1 and TNC2. The rotational invariance test experiment shows that the rotational invariance is not a necessary property in partial spectrogram features. The recognition experiment shows that the maximum UA (Unweighted Average of Class-Wise Recall Rate) of LNCMSF are improved by at least 10.7% and 1.2% respectively, compared to that of MFCC (Mel Frequency Cepstrum Coefficient) and HuWSF (Weighted Spectral Features Based on Local Hu Moments).
A Non-Intrusive Speech Quality Evaluation Method Based on the Audiogram and Weighted Frequency Information for Hearing Aid
Ruxue GUO Pengxu JIANG Ruiyu LIANG Yue XIE Cairong ZOU

LETTER-Speech and Hearing

Pubricized:
2022/07/25
Vol:
E106-A No:1
Page(s):
64-68
For a long time, the compensation effect of hearing aid is mainly evaluated subjectively, and there are fewer studies of objective evaluation. Furthermore, a pure speech signal is generally required as a reference in the existing objective evaluation methods, which restricts the practicality in a real-world environment. Therefore, this paper presents a non-intrusive speech quality evaluation method for hearing aid, which combines the audiogram and weighted frequency information. The proposed model mainly includes an audiogram information extraction network, a frequency information extraction network, and a quality score mapping network. The audiogram is the input of the audiogram information extraction network, which helps the system capture the information related to hearing loss. In addition, the low-frequency bands of speech contain loudness information and the medium and high-frequency components contribute to semantic comprehension. The information of two frequency bands is input to the frequency information extraction network to obtain time-frequency information. When obtaining the high-level features of different frequency bands and audiograms, they are fused into two groups of tensors that distinguish the information of different frequency bands and used as the input of the attention layer to calculate the corresponding weight distribution. Finally, a dense layer is employed to predict the score of speech quality. The experimental results show that it is reasonable to combine the audiogram and the weight of the information from two frequency bands, which can effectively realize the evaluation of the speech quality of the hearing aid.
Real-Time Generic Object Tracking via Recurrent Regression Network
Rui CHEN Ying TONG Ruiyu LIANG

PAPER-Artificial Intelligence, Data Mining

Pubricized:
2019/12/20
Vol:
E103-D No:3
Page(s):
602-611
Deep neural networks have achieved great success in visual tracking by learning a generic representation and leveraging large amounts of training data to improve performance. Most generic object trackers are trained from scratch online and do not benefit from a large number of videos available for offline training. We present a real-time generic object tracker capable of incorporating temporal information into its model, learning from many examples offline and quickly updating online. During the training process, the pre-trained weight of convolution layer is updated lagging behind, and the input video sequence length is gradually increased for fast convergence. Furthermore, only the hidden states in recurrent network are updated to guarantee the real-time tracking speed. The experimental results show that the proposed tracking method is capable of tracking objects at 150 fps with higher predicting overlap rate, and achieves more robustness in multiple benchmarks than state-of-the-art performance.
Speech Emotion Recognition Based on Sparse Transfer Learning Method
Peng SONG Wenming ZHENG Ruiyu LIANG

LETTER-Speech and Hearing

Pubricized:
2015/04/10
Vol:
E98-D No:7
Page(s):
1409-1412
In traditional speech emotion recognition systems, when the training and testing utterances are obtained from different corpora, the recognition rates will decrease dramatically. To tackle this problem, in this letter, inspired from the recent developments of sparse coding and transfer learning, a novel sparse transfer learning method is presented for speech emotion recognition. Firstly, a sparse coding algorithm is employed to learn a robust sparse representation of emotional features. Then, a novel sparse transfer learning approach is presented, where the distance between the feature distributions of source and target datasets is considered and used to regularize the objective function of sparse coding. The experimental results demonstrate that, compared with the automatic recognition approach, the proposed method achieves promising improvements on recognition rates and significantly outperforms the classic dimension reduction based transfer learning approach.
Siamese Attention-Based LSTM for Speech Emotion Recognition
Tashpolat NIZAMIDIN Li ZHAO Ruiyu LIANG Yue XIE Askar HAMDULLA

LETTER-Engineering Acoustics

Vol:
E103-A No:7
Page(s):
937-941
As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.
Spectral Features Based on Local Hu Moments of Gabor Spectrograms for Speech Emotion Recognition
Huawei TAO Ruiyu LIANG Cheng ZHA Xinran ZHANG Li ZHAO

LETTER-Pattern Recognition

Pubricized:
2016/05/06
Vol:
E99-D No:8
Page(s):
2186-2189
To improve the recognition rate of the speech emotion, new spectral features based on local Hu moments of Gabor spectrograms are proposed, denoted by GSLHu-PCA. Firstly, the logarithmic energy spectrum of the emotional speech is computed. Secondly, the Gabor spectrograms are obtained by convoluting logarithmic energy spectrum with Gabor wavelet. Thirdly, Gabor local Hu moments(GLHu) spectrograms are obtained through block Hu strategy, then discrete cosine transform (DCT) is used to eliminate correlation among components of GLHu spectrograms. Fourthly, statistical features are extracted from cepstral coefficients of GLHu spectrograms, then all the statistical features form a feature vector. Finally, principal component analysis (PCA) is used to reduce redundancy of features. The experimental results on EmoDB and ABC databases validate the effectiveness of GSLHu-PCA.
Compressed Sampling and Source Localization of Miniature Microphone Array
Qingyun WANG Xinchun JI Ruiyu LIANG Li ZHAO

LETTER

Vol:
E97-A No:9
Page(s):
1902-1906
In the traditional microphone array signal processing, the performance degrades rapidly when the array aperture decreases, which has been a barrier restricting its implementation in the small-scale acoustic system such as digital hearing aids. In this work a new compressed sampling method of miniature microphone array is proposed, which compresses information in the internal of ADC by means of mixture system of hardware circuit and software program in order to remove the redundancy of the different array element signals. The architecture of the method is developed using the Verilog language and has already been tested in the FPGA chip. Experiments of compressed sampling and reconstruction show the successful sparseness and reconstruction for speech sources. Owing to having avoided singularity problem of the correlation matrix of the miniature microphone array, when used in the direction of arrival (DOA) estimation in digital hearing aids, the proposed method has the advantage of higher resolution compared with the traditional GCC and MUSIC algorithms.
Detecting Depression from Speech through an Attentive LSTM Network
Yan ZHAO Yue XIE Ruiyu LIANG Li ZHANG Li ZHAO Chengyu LIU

LETTER-Speech and Hearing

Pubricized:
2021/08/24
Vol:
E104-D No:11
Page(s):
2019-2023
Depression endangers people's health conditions and affects the social order as a mental disorder. As an efficient diagnosis of depression, automatic depression detection has attracted lots of researcher's interest. This study presents an attention-based Long Short-Term Memory (LSTM) model for depression detection to make full use of the difference between depression and non-depression between timeframes. The proposed model uses frame-level features, which capture the temporal information of depressive speech, to replace traditional statistical features as an input of the LSTM layers. To achieve more multi-dimensional deep feature representations, the LSTM output is then passed on attention layers on both time and feature dimensions. Then, we concat the output of the attention layers and put the fused feature representation into the fully connected layer. At last, the fully connected layer's output is passed on to softmax layer. Experiments conducted on the DAIC-WOZ database demonstrate that the proposed attentive LSTM model achieves an average accuracy rate of 90.2% and outperforms the traditional LSTM network and LSTM with local attention by 0.7% and 2.3%, respectively, which indicates its feasibility.

1-20hit(21hit)

Author Search Result

[Author] Ruiyu LIANG(21hit)

A Novel Supervised Bimodal Emotion Recognition Approach Based on Facial Expression and Body Gesture

A Salient Feature Extraction Algorithm for Speech Emotion Recognition

Combining Siamese Network and Regression Network for Visual Tracking

A Novel Bimodal Emotion Database from Physiological Signals and Facial Expression

Speech Emotion Recognition Using Multihead Attention in Both Time and Feature Dimensions

Unconstrained Facial Expression Recognition Based on Feature Enhanced CNN and Cross-Layer LSTM

Attention-Based Dense LSTM for Speech Emotion Recognition Open Access

Speaker-Independent Speech Emotion Recognition Based Multiple Kernel Learning of Collaborative Representation

Weighted Gradient Pretrain for Low-Resource Speech Emotion Recognition

Facial Expression Recognition via Regression-Based Robust Locality Preserving Projections

A Novel Hybrid Network Model Based on Attentional Multi-Feature Fusion for Deception Detection

An Effective Acoustic Feedback Cancellation Algorithm Based on the Normalized Sub-Band Adaptive Filter

Spectral Features Based on Local Normalized Center Moments for Speech Emotion Recognition

A Non-Intrusive Speech Quality Evaluation Method Based on the Audiogram and Weighted Frequency Information for Hearing Aid

Real-Time Generic Object Tracking via Recurrent Regression Network

Speech Emotion Recognition Based on Sparse Transfer Learning Method

Siamese Attention-Based LSTM for Speech Emotion Recognition

Spectral Features Based on Local Hu Moments of Gabor Spectrograms for Speech Emotion Recognition

Compressed Sampling and Source Localization of Miniature Microphone Array

Detecting Depression from Speech through an Attentive LSTM Network

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles