IEICE global.ieice.org Site

Author Search Result

[Author] Li ZHAO(20hit)

1-20hit

An Integrated Convolutional Neural Network with a Fusion Attention Mechanism for Acoustic Scene Classification
Pengxu JIANG Yue XIE Cairong ZOU Li ZHAO Qingyun WANG

LETTER-Engineering Acoustics

Pubricized:
2023/02/06
Vol:
E106-A No:8
Page(s):
1057-1061
In human-computer interaction, acoustic scene classification (ASC) is one of the relevant research domains. In real life, the recorded audio may include a lot of noise and quiet clips, making it hard for earlier ASC-based research to isolate the crucial scene information in sound. Furthermore, scene information may be scattered across numerous audio frames; hence, selecting scene-related frames is crucial for ASC. In this context, an integrated convolutional neural network with a fusion attention mechanism (ICNN-FA) is proposed for ASC. Firstly, segmented mel-spectrograms as the input of ICNN can assist the model in learning the short-term time-frequency correlation information. Then, the designed ICNN model is employed to learn these segment-level features. In addition, the proposed global attention layer may gather global information by integrating these segment features. Finally, the developed fusion attention layer is utilized to fuse all segment-level features while the classifier classifies various situations. Experimental findings using ASC datasets from DCASE 2018 and 2019 indicate the efficacy of the suggested method.
Sparse FIR Filter Design Using Binary Particle Swarm Optimization
Chen WU Yifeng ZHANG Yuhui SHI Li ZHAO Minghai XIN

LETTER-Digital Signal Processing

Vol:
E97-A No:12
Page(s):
2653-2657
Recently, design of sparse finite impulse response (FIR) digital filters has attracted much attention due to its ability to reduce the implementation cost. However, finding a filter with the fewest number of nonzero coefficients subject to prescribed frequency domain constraints is a rather difficult problem because of its non-convexity. In this paper, an algorithm based on binary particle swarm optimization (BPSO) is proposed, which successively thins the filter coefficients until no sparser solution can be obtained. The proposed algorithm is evaluated on a set of examples, and better results can be achieved than other existing algorithms.
Attention-Based Dense LSTM for Speech Emotion Recognition Open Access
Yue XIE Ruiyu LIANG Zhenlin LIANG Li ZHAO

LETTER-Pattern Recognition

Pubricized:
2019/04/17
Vol:
E102-D No:7
Page(s):
1426-1429
Despite the widespread use of deep learning for speech emotion recognition, they are severely restricted due to the information loss in the high layer of deep neural networks, as well as the degradation problem. In order to efficiently utilize information and solve degradation, attention-based dense long short-term memory (LSTM) is proposed for speech emotion recognition. LSTM networks with the ability to process time series such as speech are constructed into which attention-based dense connections are introduced. That means the weight coefficients are added to skip-connections of each layer to distinguish the difference of the emotional information between layers and avoid the interference of redundant information from the bottom layer to the effective information from the top layer. The experiments demonstrate that proposed method improves the recognition performance by 12% and 7% on eNTERFACE and IEMOCAP corpus respectively.
Speaker-Independent Speech Emotion Recognition Based Multiple Kernel Learning of Collaborative Representation
Cheng ZHA Xinrang ZHANG Li ZHAO Ruiyu LIANG

LETTER-Engineering Acoustics

Vol:
E99-A No:3
Page(s):
756-759
We propose a novel multiple kernel learning (MKL) method using a collaborative representation constraint, called CR-MKL, for fusing the emotion information from multi-level features. To this end, the similarity and distinctiveness of multi-level features are learned in the kernels-induced space using the weighting distance measure. Our method achieves better performance than existing methods by using the voiced-level and unvoiced-level features.
A Comparative Study of Output Probability Functions in HMMs
Seiichi NAKAGAWA Li ZHAO Hideyuki SUZUKI

PAPER

Vol:
E78-D No:6
Page(s):
669-675
One of the most effective methods in speech recognition is the HMM which has been used to model speech statistically. The discrete distribution and the continuos distribution HMMs have been widely used in various applications. However, in recent years, HMMs with various output probability functions have been proposed to further improve recognition performance, e.g. the Gaussian mixture continuous and the semi-continuous distributed HMMs. We recently have also proposed the RBF (radial basis function)-based HMM and the VQ-distortion based HMM which use a RBF function and VQ-distortion measure at each state instead of an output probability density function used by traditional HMMs. In this paper, we describe the RBF-based HMM and the VQ-distortion based HMM and compare their performance with the discrete distributed, the Gaussian mixture distributed and the semi-continuous distributed HMMs based on their speech recognition performance rates through experiments on speaker-independent spoken digit recognition. Our results confirmed that the RBF-based and VQ-distortion based HMMs are more robust and superior to traditional HMMs.
A Novel Hybrid Network Model Based on Attentional Multi-Feature Fusion for Deception Detection
Yuanbo FANG Hongliang FU Huawei TAO Ruiyu LIANG Li ZHAO

LETTER-Speech and Hearing

Pubricized:
2020/09/24
Vol:
E104-A No:3
Page(s):
622-626
Speech based deception detection using deep learning is one of the technologies to realize a deception detection system with high recognition rate in the future. Multi-network feature extraction technology can effectively improve the recognition performance of the system, but due to the limited labeled data and the lack of effective feature fusion methods, the performance of the network is limited. Based on this, a novel hybrid network model based on attentional multi-feature fusion (HN-AMFF) is proposed. Firstly, the static features of large amounts of unlabeled speech data are input into DAE for unsupervised training. Secondly, the frame-level features and static features of a small amount of labeled speech data are simultaneously input into the LSTM network and the encoded output part of DAE for joint supervised training. Finally, a feature fusion algorithm based on attention mechanism is proposed, which can get the optimal feature set in the training process. Simulation results show that the proposed feature fusion method is significantly better than traditional feature fusion methods, and the model can achieve advanced performance with only a small amount of labeled data.
An Effective Acoustic Feedback Cancellation Algorithm Based on the Normalized Sub-Band Adaptive Filter
Xia WANG Ruiyu LIANG Qingyun WANG Li ZHAO Cairong ZOU

LETTER-Speech and Hearing

Pubricized:
2015/10/20
Vol:
E99-D No:1
Page(s):
288-291
In this letter, an effective acoustic feedback cancellation algorithm is proposed based on the normalized sub-band adaptive filter (NSAF). To improve the confliction between fast convergence rate and low misalignment in the NSAF algorithm, a variable step size is designed to automatically vary according to the update state of the filter. The update state of the filter is adaptively detected via the normalized distance between the long term average and the short term average of the tap-weight vector. Simulation results demonstrate that the proposed algorithm has superior performance in terms of convergence rate and misalignment.
Joint Patch Weighting and Moment Matching for Unsupervised Domain Adaptation in Micro-Expression Recognition
Jie ZHU Yuan ZONG Hongli CHANG Li ZHAO Chuangao TANG

LETTER-Image Recognition, Computer Vision

Pubricized:
2021/11/17
Vol:
E105-D No:2
Page(s):
441-445
Unsupervised domain adaptation (DA) is a challenging machine learning problem since the labeled training (source) and unlabeled testing (target) sets belong to different domains and then have different feature distributions, which has recently attracted wide attention in micro-expression recognition (MER). Although some well-performing unsupervised DA methods have been proposed, these methods cannot well solve the problem of unsupervised DA in MER, a. k. a., cross-domain MER. To deal with such a challenging problem, in this letter we propose a novel unsupervised DA method called Joint Patch weighting and Moment Matching (JPMM). JPMM bridges the source and target micro-expression feature sets by minimizing their probability distribution divergence with a multi-order moment matching operation. Meanwhile, it takes advantage of the contributive facial patches by the weight learning such that a domain-invariant feature representation involving micro-expression distinguishable information can be learned. Finally, we carry out extensive experiments to evaluate the proposed JPMM method is superior to recent state-of-the-art unsupervised DA methods in dealing with cross-domain MER.
Spectral Features Based on Local Normalized Center Moments for Speech Emotion Recognition
Huawei TAO Ruiyu LIANG Xinran ZHANG Li ZHAO

LETTER-Speech and Hearing

Vol:
E99-A No:10
Page(s):
1863-1866
To discuss whether rotational invariance is the main role in spectrogram features, new spectral features based on local normalized center moments, denoted by LNCMSF, are proposed. The proposed LNCMSF firstly adopts 2nd order normalized center moments to describe local energy distribution of the logarithmic energy spectrum, then normalized center moment spectrograms NC1 and NC2 are gained. Secondly, DCT (Discrete Cosine Transform) is used to eliminate the correlation of NC1 and NC2, then high order cepstral coefficients TNC1 and TNC2 are obtained. Finally, LNCMSF is generated by combining NC1, NC2, TNC1 and TNC2. The rotational invariance test experiment shows that the rotational invariance is not a necessary property in partial spectrogram features. The recognition experiment shows that the maximum UA (Unweighted Average of Class-Wise Recall Rate) of LNCMSF are improved by at least 10.7% and 1.2% respectively, compared to that of MFCC (Mel Frequency Cepstrum Coefficient) and HuWSF (Weighted Spectral Features Based on Local Hu Moments).
Low-Complexity Resource Allocation Algorithm for Multicell OFDMA System
Qingli ZHAO Fangjiong CHEN Sujuan XIONG Gang WEI

LETTER-Wireless Communication Technologies

Vol:
E96-B No:5
Page(s):
1218-1221
Low-complexity joint subcarrier and power allocation is considered. The applied criterion is to minimize the transmission power while satisfying the users' rate requirements. Subcarrier and power allocation are separately applied. Fixed spectrum efficiency is assumed to simplify the subcarrier allocation. We show that under fixed spectrum efficiency, power allocation can be obtained by solving some sets of linear equations. Simulation result shows the effectiveness of the proposed algorithm.
Siamese Attention-Based LSTM for Speech Emotion Recognition
Tashpolat NIZAMIDIN Li ZHAO Ruiyu LIANG Yue XIE Askar HAMDULLA

LETTER-Engineering Acoustics

Vol:
E103-A No:7
Page(s):
937-941
As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.
Spectral Features Based on Local Hu Moments of Gabor Spectrograms for Speech Emotion Recognition
Huawei TAO Ruiyu LIANG Cheng ZHA Xinran ZHANG Li ZHAO

LETTER-Pattern Recognition

Pubricized:
2016/05/06
Vol:
E99-D No:8
Page(s):
2186-2189
To improve the recognition rate of the speech emotion, new spectral features based on local Hu moments of Gabor spectrograms are proposed, denoted by GSLHu-PCA. Firstly, the logarithmic energy spectrum of the emotional speech is computed. Secondly, the Gabor spectrograms are obtained by convoluting logarithmic energy spectrum with Gabor wavelet. Thirdly, Gabor local Hu moments(GLHu) spectrograms are obtained through block Hu strategy, then discrete cosine transform (DCT) is used to eliminate correlation among components of GLHu spectrograms. Fourthly, statistical features are extracted from cepstral coefficients of GLHu spectrograms, then all the statistical features form a feature vector. Finally, principal component analysis (PCA) is used to reduce redundancy of features. The experimental results on EmoDB and ABC databases validate the effectiveness of GSLHu-PCA.
Compressed Sampling and Source Localization of Miniature Microphone Array
Qingyun WANG Xinchun JI Ruiyu LIANG Li ZHAO

LETTER

Vol:
E97-A No:9
Page(s):
1902-1906
In the traditional microphone array signal processing, the performance degrades rapidly when the array aperture decreases, which has been a barrier restricting its implementation in the small-scale acoustic system such as digital hearing aids. In this work a new compressed sampling method of miniature microphone array is proposed, which compresses information in the internal of ADC by means of mixture system of hardware circuit and software program in order to remove the redundancy of the different array element signals. The architecture of the method is developed using the Verilog language and has already been tested in the FPGA chip. Experiments of compressed sampling and reconstruction show the successful sparseness and reconstruction for speech sources. Owing to having avoided singularity problem of the correlation matrix of the miniature microphone array, when used in the direction of arrival (DOA) estimation in digital hearing aids, the proposed method has the advantage of higher resolution compared with the traditional GCC and MUSIC algorithms.
Speech Emotion Recognition Using Transfer Learning
Peng SONG Yun JIN Li ZHAO Minghai XIN

LETTER-Speech and Hearing

Vol:
E97-D No:9
Page(s):
2530-2532
A major challenge for speech emotion recognition is that when the training and deployment conditions do not use the same speech corpus, the recognition rates will obviously drop. Transfer learning, which has successfully addressed the cross-domain classification or recognition problem, is presented for cross-corpus speech emotion recognition. First, by using the maximum mean discrepancy embedding (MMDE) optimization and dimension reduction algorithms, two close low-dimensional feature spaces are obtained for source and target speech corpora, respectively. Then, a classifier function is trained using the learned low-dimensional features in the labeled source corpus, and directly applied to the unlabeled target corpus for emotion label recognition. Experimental results demonstrate that the transfer learning method can significantly outperform the traditional automatic recognition technique for cross-corpus speech emotion recognition.
Detecting Depression from Speech through an Attentive LSTM Network
Yan ZHAO Yue XIE Ruiyu LIANG Li ZHANG Li ZHAO Chengyu LIU

LETTER-Speech and Hearing

Pubricized:
2021/08/24
Vol:
E104-D No:11
Page(s):
2019-2023
Depression endangers people's health conditions and affects the social order as a mental disorder. As an efficient diagnosis of depression, automatic depression detection has attracted lots of researcher's interest. This study presents an attention-based Long Short-Term Memory (LSTM) model for depression detection to make full use of the difference between depression and non-depression between timeframes. The proposed model uses frame-level features, which capture the temporal information of depressive speech, to replace traditional statistical features as an input of the LSTM layers. To achieve more multi-dimensional deep feature representations, the LSTM output is then passed on attention layers on both time and feature dimensions. Then, we concat the output of the attention layers and put the fused feature representation into the fully connected layer. At last, the fully connected layer's output is passed on to softmax layer. Experiments conducted on the DAIC-WOZ database demonstrate that the proposed attentive LSTM model achieves an average accuracy rate of 90.2% and outperforms the traditional LSTM network and LSTM with local attention by 0.7% and 2.3%, respectively, which indicates its feasibility.
Speaker-Independent Speech Emotion Recognition Based on Two-Layer Multiple Kernel Learning
Yun JIN Peng SONG Wenming ZHENG Li ZHAO Minghai XIN

LETTER-Speech and Hearing

Vol:
E96-D No:10
Page(s):
2286-2289
In this paper, a two-layer Multiple Kernel Learning (MKL) scheme for speaker-independent speech emotion recognition is presented. In the first layer, MKL is used for feature selection. The training samples are separated into n groups according to some rules. All groups are used for feature selection to obtain n sparse feature subsets. The intersection and the union of all feature subsets are the result of our feature selection methods. In the second layer, MKL is used again for speech emotion classification with the selected features. In order to evaluate the effectiveness of our proposed two-layer MKL scheme, we compare it with state-of-the-art results. It is shown that our scheme results in large gain in performance. Furthermore, another experiment is carried out to compare our feature selection method with other popular ones. And the result proves the effectiveness of our feature selection method.
Sub-Band Noise Reduction in Multi-Channel Digital Hearing Aid
Qingyun WANG Ruiyu LIANG Li JING Cairong ZOU Li ZHAO

LETTER-Speech and Hearing

Pubricized:
2015/10/14
Vol:
E99-D No:1
Page(s):
292-295
Since digital hearing aids are sensitive to time delay and power consumption, the computational complexity of noise reduction must be reduced as much as possible. Therefore, some complicated algorithms based on the analysis of the time-frequency domain are very difficult to implement in digital hearing aids. This paper presents a new approach that yields an improved noise reduction algorithm with greatly reduce computational complexity for multi-channel digital hearing aids. First, the sub-band sound pressure level (SPL) is calculated in real time. Then, based on the calculated sub-band SPL, the noise in the sub-band is estimated and the possibility of speech is computed. Finally, a posteriori and a priori signal-to-noise ratios are estimated and the gain function is acquired to reduce the noise adaptively. By replacing the FFT and IFFT transforms by the known SPL, the proposed algorithm greatly reduces the computation loads. Experiments on a prototype digital hearing aid show that the time delay is decreased to nearly half that of the traditional adaptive Wiener filtering and spectral subtraction algorithms, but the SNR improvement and PESQ score are rather satisfied. Compared with modulation frequency-based noise reduction algorithm, which is used in many commercial digital hearing aids, the proposed algorithm achieves not only more than 5dB SNR improvement but also less time delay and power consumption.
Estimation of Multi-Layer Tissue Conductivities from Non-invasively Measured Bioresistances Using Divided Electrodes
Xueli ZHAO Yohsuke KINOUCHI Tadamitsu IRITANI Tadaoki MORIMOTO Mieko TAKEUCHI

PAPER-Medical Engineering

Vol:
E85-D No:6
Page(s):
1031-1038
To estimate inner multi-layer tissue conductivity distribution in a cross section of the local tissue by using bioresistance data measured noninvasively on the surface of the tissue, a measurement method using divided electrodes is proposed, where a current electrode is divided into several parts. The method is evaluated by computer simulations using a three-dimension (3D) model and two two-dimension (2D) models. In this paper, conductivity distributions of the simplified (2D) model are analyzed based on a combination of a finite difference method (FDM) and a steepest descent method (SDM). Simulation results show that conductivity values for skin, fat and muscle layers can be estimated with an error less than 0.1%. Even though different strength random noise is added to measured resistance values, the conductivities are estimated with reasonable precise, e.g., the average error is about 4.25% for 10% noise. The configuration of the divided electrodes are examined in terms of dividing pattern and the size of surrounding guard electrodes to confine and control the input currents from the divided electrodes within a cross sectional area in the tissue.
A Salient Feature Extraction Algorithm for Speech Emotion Recognition
Ruiyu LIANG Huawei TAO Guichen TANG Qingyun WANG Li ZHAO

LETTER-Speech and Hearing

Pubricized:
2015/05/29
Vol:
E98-D No:9
Page(s):
1715-1718
A salient feature extraction algorithm is proposed to improve the recognition rate of the speech emotion. Firstly, the spectrogram of the emotional speech is calculated. Secondly, imitating the selective attention mechanism, the color, direction and brightness map of the spectrogram is computed. Each map is normalized and down-sampled to form the low resolution feature matrix. Then, each feature matrix is converted to the row vector and the principal component analysis (PCA) is used to reduce features redundancy to make the subsequent classification algorithm more practical. Finally, the speech emotion is classified with the support vector machine. Compared with the tradition features, the improved recognition rate reaches 15%.
An Iterative Technique for Optimally Designing Extrapolated Impulse Response Filter in the Mini-Max Sense
Hao WANG Li ZHAO Wenjiang PEI Jiakuo ZUO Qingyun WANG Minghai XIN

LETTER-Systems and Control

Vol:
E96-A No:10
Page(s):
2029-2033
The optimal design of an extrapolated impulse response (EIR) filter (in the mini-max sense) is a non-linear programming problem. In this paper, the optimal design of the EIR filter by the semi-infinite programming (SIP) is investigated and an iterative technique for optimally designing the EIR filter is proposed. The simulation experiment validates the effectiveness of the SIP technique and the proposed iterative technique in the optimal design of the EIR filter.

Author Search Result

[Author] Li ZHAO(20hit)

An Integrated Convolutional Neural Network with a Fusion Attention Mechanism for Acoustic Scene Classification

Sparse FIR Filter Design Using Binary Particle Swarm Optimization

Attention-Based Dense LSTM for Speech Emotion Recognition Open Access

Speaker-Independent Speech Emotion Recognition Based Multiple Kernel Learning of Collaborative Representation

A Comparative Study of Output Probability Functions in HMMs

A Novel Hybrid Network Model Based on Attentional Multi-Feature Fusion for Deception Detection

An Effective Acoustic Feedback Cancellation Algorithm Based on the Normalized Sub-Band Adaptive Filter

Joint Patch Weighting and Moment Matching for Unsupervised Domain Adaptation in Micro-Expression Recognition

Spectral Features Based on Local Normalized Center Moments for Speech Emotion Recognition

Low-Complexity Resource Allocation Algorithm for Multicell OFDMA System

Siamese Attention-Based LSTM for Speech Emotion Recognition

Spectral Features Based on Local Hu Moments of Gabor Spectrograms for Speech Emotion Recognition

Compressed Sampling and Source Localization of Miniature Microphone Array

Speech Emotion Recognition Using Transfer Learning

Detecting Depression from Speech through an Attentive LSTM Network

Speaker-Independent Speech Emotion Recognition Based on Two-Layer Multiple Kernel Learning

Sub-Band Noise Reduction in Multi-Channel Digital Hearing Aid

Estimation of Multi-Layer Tissue Conductivities from Non-invasively Measured Bioresistances Using Divided Electrodes

A Salient Feature Extraction Algorithm for Speech Emotion Recognition

An Iterative Technique for Optimally Designing Extrapolated Impulse Response Filter in the Mini-Max Sense

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles