The search functionality is under construction.

Author Search Result

[Author] Li ZHAO(20hit)

1-20hit
  • An Integrated Convolutional Neural Network with a Fusion Attention Mechanism for Acoustic Scene Classification

    Pengxu JIANG  Yue XIE  Cairong ZOU  Li ZHAO  Qingyun WANG  

     
    LETTER-Engineering Acoustics

      Pubricized:
    2023/02/06
      Vol:
    E106-A No:8
      Page(s):
    1057-1061

    In human-computer interaction, acoustic scene classification (ASC) is one of the relevant research domains. In real life, the recorded audio may include a lot of noise and quiet clips, making it hard for earlier ASC-based research to isolate the crucial scene information in sound. Furthermore, scene information may be scattered across numerous audio frames; hence, selecting scene-related frames is crucial for ASC. In this context, an integrated convolutional neural network with a fusion attention mechanism (ICNN-FA) is proposed for ASC. Firstly, segmented mel-spectrograms as the input of ICNN can assist the model in learning the short-term time-frequency correlation information. Then, the designed ICNN model is employed to learn these segment-level features. In addition, the proposed global attention layer may gather global information by integrating these segment features. Finally, the developed fusion attention layer is utilized to fuse all segment-level features while the classifier classifies various situations. Experimental findings using ASC datasets from DCASE 2018 and 2019 indicate the efficacy of the suggested method.

  • Sparse FIR Filter Design Using Binary Particle Swarm Optimization

    Chen WU  Yifeng ZHANG  Yuhui SHI  Li ZHAO  Minghai XIN  

     
    LETTER-Digital Signal Processing

      Vol:
    E97-A No:12
      Page(s):
    2653-2657

    Recently, design of sparse finite impulse response (FIR) digital filters has attracted much attention due to its ability to reduce the implementation cost. However, finding a filter with the fewest number of nonzero coefficients subject to prescribed frequency domain constraints is a rather difficult problem because of its non-convexity. In this paper, an algorithm based on binary particle swarm optimization (BPSO) is proposed, which successively thins the filter coefficients until no sparser solution can be obtained. The proposed algorithm is evaluated on a set of examples, and better results can be achieved than other existing algorithms.

  • Attention-Based Dense LSTM for Speech Emotion Recognition Open Access

    Yue XIE  Ruiyu LIANG  Zhenlin LIANG  Li ZHAO  

     
    LETTER-Pattern Recognition

      Pubricized:
    2019/04/17
      Vol:
    E102-D No:7
      Page(s):
    1426-1429

    Despite the widespread use of deep learning for speech emotion recognition, they are severely restricted due to the information loss in the high layer of deep neural networks, as well as the degradation problem. In order to efficiently utilize information and solve degradation, attention-based dense long short-term memory (LSTM) is proposed for speech emotion recognition. LSTM networks with the ability to process time series such as speech are constructed into which attention-based dense connections are introduced. That means the weight coefficients are added to skip-connections of each layer to distinguish the difference of the emotional information between layers and avoid the interference of redundant information from the bottom layer to the effective information from the top layer. The experiments demonstrate that proposed method improves the recognition performance by 12% and 7% on eNTERFACE and IEMOCAP corpus respectively.

  • Speaker-Independent Speech Emotion Recognition Based Multiple Kernel Learning of Collaborative Representation

    Cheng ZHA  Xinrang ZHANG  Li ZHAO  Ruiyu LIANG  

     
    LETTER-Engineering Acoustics

      Vol:
    E99-A No:3
      Page(s):
    756-759

    We propose a novel multiple kernel learning (MKL) method using a collaborative representation constraint, called CR-MKL, for fusing the emotion information from multi-level features. To this end, the similarity and distinctiveness of multi-level features are learned in the kernels-induced space using the weighting distance measure. Our method achieves better performance than existing methods by using the voiced-level and unvoiced-level features.

  • A Comparative Study of Output Probability Functions in HMMs

    Seiichi NAKAGAWA  Li ZHAO  Hideyuki SUZUKI  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    669-675

    One of the most effective methods in speech recognition is the HMM which has been used to model speech statistically. The discrete distribution and the continuos distribution HMMs have been widely used in various applications. However, in recent years, HMMs with various output probability functions have been proposed to further improve recognition performance, e.g. the Gaussian mixture continuous and the semi-continuous distributed HMMs. We recently have also proposed the RBF (radial basis function)-based HMM and the VQ-distortion based HMM which use a RBF function and VQ-distortion measure at each state instead of an output probability density function used by traditional HMMs. In this paper, we describe the RBF-based HMM and the VQ-distortion based HMM and compare their performance with the discrete distributed, the Gaussian mixture distributed and the semi-continuous distributed HMMs based on their speech recognition performance rates through experiments on speaker-independent spoken digit recognition. Our results confirmed that the RBF-based and VQ-distortion based HMMs are more robust and superior to traditional HMMs.

  • A Novel Hybrid Network Model Based on Attentional Multi-Feature Fusion for Deception Detection

    Yuanbo FANG  Hongliang FU  Huawei TAO  Ruiyu LIANG  Li ZHAO  

     
    LETTER-Speech and Hearing

      Pubricized:
    2020/09/24
      Vol:
    E104-A No:3
      Page(s):
    622-626

    Speech based deception detection using deep learning is one of the technologies to realize a deception detection system with high recognition rate in the future. Multi-network feature extraction technology can effectively improve the recognition performance of the system, but due to the limited labeled data and the lack of effective feature fusion methods, the performance of the network is limited. Based on this, a novel hybrid network model based on attentional multi-feature fusion (HN-AMFF) is proposed. Firstly, the static features of large amounts of unlabeled speech data are input into DAE for unsupervised training. Secondly, the frame-level features and static features of a small amount of labeled speech data are simultaneously input into the LSTM network and the encoded output part of DAE for joint supervised training. Finally, a feature fusion algorithm based on attention mechanism is proposed, which can get the optimal feature set in the training process. Simulation results show that the proposed feature fusion method is significantly better than traditional feature fusion methods, and the model can achieve advanced performance with only a small amount of labeled data.

  • An Effective Acoustic Feedback Cancellation Algorithm Based on the Normalized Sub-Band Adaptive Filter

    Xia WANG  Ruiyu LIANG  Qingyun WANG  Li ZHAO  Cairong ZOU  

     
    LETTER-Speech and Hearing

      Pubricized:
    2015/10/20
      Vol:
    E99-D No:1
      Page(s):
    288-291

    In this letter, an effective acoustic feedback cancellation algorithm is proposed based on the normalized sub-band adaptive filter (NSAF). To improve the confliction between fast convergence rate and low misalignment in the NSAF algorithm, a variable step size is designed to automatically vary according to the update state of the filter. The update state of the filter is adaptively detected via the normalized distance between the long term average and the short term average of the tap-weight vector. Simulation results demonstrate that the proposed algorithm has superior performance in terms of convergence rate and misalignment.

  • Joint Patch Weighting and Moment Matching for Unsupervised Domain Adaptation in Micro-Expression Recognition

    Jie ZHU  Yuan ZONG  Hongli CHANG  Li ZHAO  Chuangao TANG  

     
    LETTER-Image Recognition, Computer Vision

      Pubricized:
    2021/11/17
      Vol:
    E105-D No:2
      Page(s):
    441-445

    Unsupervised domain adaptation (DA) is a challenging machine learning problem since the labeled training (source) and unlabeled testing (target) sets belong to different domains and then have different feature distributions, which has recently attracted wide attention in micro-expression recognition (MER). Although some well-performing unsupervised DA methods have been proposed, these methods cannot well solve the problem of unsupervised DA in MER, a. k. a., cross-domain MER. To deal with such a challenging problem, in this letter we propose a novel unsupervised DA method called Joint Patch weighting and Moment Matching (JPMM). JPMM bridges the source and target micro-expression feature sets by minimizing their probability distribution divergence with a multi-order moment matching operation. Meanwhile, it takes advantage of the contributive facial patches by the weight learning such that a domain-invariant feature representation involving micro-expression distinguishable information can be learned. Finally, we carry out extensive experiments to evaluate the proposed JPMM method is superior to recent state-of-the-art unsupervised DA methods in dealing with cross-domain MER.

  • Spectral Features Based on Local Normalized Center Moments for Speech Emotion Recognition

    Huawei TAO  Ruiyu LIANG  Xinran ZHANG  Li ZHAO  

     
    LETTER-Speech and Hearing

      Vol:
    E99-A No:10
      Page(s):
    1863-1866

    To discuss whether rotational invariance is the main role in spectrogram features, new spectral features based on local normalized center moments, denoted by LNCMSF, are proposed. The proposed LNCMSF firstly adopts 2nd order normalized center moments to describe local energy distribution of the logarithmic energy spectrum, then normalized center moment spectrograms NC1 and NC2 are gained. Secondly, DCT (Discrete Cosine Transform) is used to eliminate the correlation of NC1 and NC2, then high order cepstral coefficients TNC1 and TNC2 are obtained. Finally, LNCMSF is generated by combining NC1, NC2, TNC1 and TNC2. The rotational invariance test experiment shows that the rotational invariance is not a necessary property in partial spectrogram features. The recognition experiment shows that the maximum UA (Unweighted Average of Class-Wise Recall Rate) of LNCMSF are improved by at least 10.7% and 1.2% respectively, compared to that of MFCC (Mel Frequency Cepstrum Coefficient) and HuWSF (Weighted Spectral Features Based on Local Hu Moments).

  • Low-Complexity Resource Allocation Algorithm for Multicell OFDMA System

    Qingli ZHAO  Fangjiong CHEN  Sujuan XIONG  Gang WEI  

     
    LETTER-Wireless Communication Technologies

      Vol:
    E96-B No:5
      Page(s):
    1218-1221

    Low-complexity joint subcarrier and power allocation is considered. The applied criterion is to minimize the transmission power while satisfying the users' rate requirements. Subcarrier and power allocation are separately applied. Fixed spectrum efficiency is assumed to simplify the subcarrier allocation. We show that under fixed spectrum efficiency, power allocation can be obtained by solving some sets of linear equations. Simulation result shows the effectiveness of the proposed algorithm.

  • Siamese Attention-Based LSTM for Speech Emotion Recognition

    Tashpolat NIZAMIDIN  Li ZHAO  Ruiyu LIANG  Yue XIE  Askar HAMDULLA  

     
    LETTER-Engineering Acoustics

      Vol:
    E103-A No:7
      Page(s):
    937-941

    As one of the popular topics in the field of human-computer interaction, the Speech Emotion Recognition (SER) aims to classify the emotional tendency from the speakers' utterances. Using the existing deep learning methods, and with a large amount of training data, we can achieve a highly accurate performance result. Unfortunately, it's time consuming and difficult job to build such a huge emotional speech database that can be applicable universally. However, the Siamese Neural Network (SNN), which we discuss in this paper, can yield extremely precise results with just a limited amount of training data through pairwise training which mitigates the impacts of sample deficiency and provides enough iterations. To obtain enough SER training, this study proposes a novel method which uses Siamese Attention-based Long Short-Term Memory Networks. In this framework, we designed two Attention-based Long Short-Term Memory Networks which shares the same weights, and we input frame level acoustic emotional features to the Siamese network rather than utterance level emotional features. The proposed solution has been evaluated on EMODB, ABC and UYGSEDB corpora, and showed significant improvement on SER results, compared to conventional deep learning methods.

  • Spectral Features Based on Local Hu Moments of Gabor Spectrograms for Speech Emotion Recognition

    Huawei TAO  Ruiyu LIANG  Cheng ZHA  Xinran ZHANG  Li ZHAO  

     
    LETTER-Pattern Recognition

      Pubricized:
    2016/05/06
      Vol:
    E99-D No:8
      Page(s):
    2186-2189

    To improve the recognition rate of the speech emotion, new spectral features based on local Hu moments of Gabor spectrograms are proposed, denoted by GSLHu-PCA. Firstly, the logarithmic energy spectrum of the emotional speech is computed. Secondly, the Gabor spectrograms are obtained by convoluting logarithmic energy spectrum with Gabor wavelet. Thirdly, Gabor local Hu moments(GLHu) spectrograms are obtained through block Hu strategy, then discrete cosine transform (DCT) is used to eliminate correlation among components of GLHu spectrograms. Fourthly, statistical features are extracted from cepstral coefficients of GLHu spectrograms, then all the statistical features form a feature vector. Finally, principal component analysis (PCA) is used to reduce redundancy of features. The experimental results on EmoDB and ABC databases validate the effectiveness of GSLHu-PCA.

  • Compressed Sampling and Source Localization of Miniature Microphone Array

    Qingyun WANG  Xinchun JI  Ruiyu LIANG  Li ZHAO  

     
    LETTER

      Vol:
    E97-A No:9
      Page(s):
    1902-1906

    In the traditional microphone array signal processing, the performance degrades rapidly when the array aperture decreases, which has been a barrier restricting its implementation in the small-scale acoustic system such as digital hearing aids. In this work a new compressed sampling method of miniature microphone array is proposed, which compresses information in the internal of ADC by means of mixture system of hardware circuit and software program in order to remove the redundancy of the different array element signals. The architecture of the method is developed using the Verilog language and has already been tested in the FPGA chip. Experiments of compressed sampling and reconstruction show the successful sparseness and reconstruction for speech sources. Owing to having avoided singularity problem of the correlation matrix of the miniature microphone array, when used in the direction of arrival (DOA) estimation in digital hearing aids, the proposed method has the advantage of higher resolution compared with the traditional GCC and MUSIC algorithms.

  • Speech Emotion Recognition Using Transfer Learning

    Peng SONG  Yun JIN  Li ZHAO  Minghai XIN  

     
    LETTER-Speech and Hearing

      Vol:
    E97-D No:9
      Page(s):
    2530-2532

    A major challenge for speech emotion recognition is that when the training and deployment conditions do not use the same speech corpus, the recognition rates will obviously drop. Transfer learning, which has successfully addressed the cross-domain classification or recognition problem, is presented for cross-corpus speech emotion recognition. First, by using the maximum mean discrepancy embedding (MMDE) optimization and dimension reduction algorithms, two close low-dimensional feature spaces are obtained for source and target speech corpora, respectively. Then, a classifier function is trained using the learned low-dimensional features in the labeled source corpus, and directly applied to the unlabeled target corpus for emotion label recognition. Experimental results demonstrate that the transfer learning method can significantly outperform the traditional automatic recognition technique for cross-corpus speech emotion recognition.

  • Detecting Depression from Speech through an Attentive LSTM Network

    Yan ZHAO  Yue XIE  Ruiyu LIANG  Li ZHANG  Li ZHAO  Chengyu LIU  

     
    LETTER-Speech and Hearing

      Pubricized:
    2021/08/24
      Vol:
    E104-D No:11
      Page(s):
    2019-2023

    Depression endangers people's health conditions and affects the social order as a mental disorder. As an efficient diagnosis of depression, automatic depression detection has attracted lots of researcher's interest. This study presents an attention-based Long Short-Term Memory (LSTM) model for depression detection to make full use of the difference between depression and non-depression between timeframes. The proposed model uses frame-level features, which capture the temporal information of depressive speech, to replace traditional statistical features as an input of the LSTM layers. To achieve more multi-dimensional deep feature representations, the LSTM output is then passed on attention layers on both time and feature dimensions. Then, we concat the output of the attention layers and put the fused feature representation into the fully connected layer. At last, the fully connected layer's output is passed on to softmax layer. Experiments conducted on the DAIC-WOZ database demonstrate that the proposed attentive LSTM model achieves an average accuracy rate of 90.2% and outperforms the traditional LSTM network and LSTM with local attention by 0.7% and 2.3%, respectively, which indicates its feasibility.

  • Speaker-Independent Speech Emotion Recognition Based on Two-Layer Multiple Kernel Learning

    Yun JIN  Peng SONG  Wenming ZHENG  Li ZHAO  Minghai XIN  

     
    LETTER-Speech and Hearing

      Vol:
    E96-D No:10
      Page(s):
    2286-2289

    In this paper, a two-layer Multiple Kernel Learning (MKL) scheme for speaker-independent speech emotion recognition is presented. In the first layer, MKL is used for feature selection. The training samples are separated into n groups according to some rules. All groups are used for feature selection to obtain n sparse feature subsets. The intersection and the union of all feature subsets are the result of our feature selection methods. In the second layer, MKL is used again for speech emotion classification with the selected features. In order to evaluate the effectiveness of our proposed two-layer MKL scheme, we compare it with state-of-the-art results. It is shown that our scheme results in large gain in performance. Furthermore, another experiment is carried out to compare our feature selection method with other popular ones. And the result proves the effectiveness of our feature selection method.

  • Sub-Band Noise Reduction in Multi-Channel Digital Hearing Aid

    Qingyun WANG  Ruiyu LIANG  Li JING  Cairong ZOU  Li ZHAO  

     
    LETTER-Speech and Hearing

      Pubricized:
    2015/10/14
      Vol:
    E99-D No:1
      Page(s):
    292-295

    Since digital hearing aids are sensitive to time delay and power consumption, the computational complexity of noise reduction must be reduced as much as possible. Therefore, some complicated algorithms based on the analysis of the time-frequency domain are very difficult to implement in digital hearing aids. This paper presents a new approach that yields an improved noise reduction algorithm with greatly reduce computational complexity for multi-channel digital hearing aids. First, the sub-band sound pressure level (SPL) is calculated in real time. Then, based on the calculated sub-band SPL, the noise in the sub-band is estimated and the possibility of speech is computed. Finally, a posteriori and a priori signal-to-noise ratios are estimated and the gain function is acquired to reduce the noise adaptively. By replacing the FFT and IFFT transforms by the known SPL, the proposed algorithm greatly reduces the computation loads. Experiments on a prototype digital hearing aid show that the time delay is decreased to nearly half that of the traditional adaptive Wiener filtering and spectral subtraction algorithms, but the SNR improvement and PESQ score are rather satisfied. Compared with modulation frequency-based noise reduction algorithm, which is used in many commercial digital hearing aids, the proposed algorithm achieves not only more than 5dB SNR improvement but also less time delay and power consumption.

  • Estimation of Multi-Layer Tissue Conductivities from Non-invasively Measured Bioresistances Using Divided Electrodes

    Xueli ZHAO  Yohsuke KINOUCHI  Tadamitsu IRITANI  Tadaoki MORIMOTO  Mieko TAKEUCHI  

     
    PAPER-Medical Engineering

      Vol:
    E85-D No:6
      Page(s):
    1031-1038

    To estimate inner multi-layer tissue conductivity distribution in a cross section of the local tissue by using bioresistance data measured noninvasively on the surface of the tissue, a measurement method using divided electrodes is proposed, where a current electrode is divided into several parts. The method is evaluated by computer simulations using a three-dimension (3D) model and two two-dimension (2D) models. In this paper, conductivity distributions of the simplified (2D) model are analyzed based on a combination of a finite difference method (FDM) and a steepest descent method (SDM). Simulation results show that conductivity values for skin, fat and muscle layers can be estimated with an error less than 0.1%. Even though different strength random noise is added to measured resistance values, the conductivities are estimated with reasonable precise, e.g., the average error is about 4.25% for 10% noise. The configuration of the divided electrodes are examined in terms of dividing pattern and the size of surrounding guard electrodes to confine and control the input currents from the divided electrodes within a cross sectional area in the tissue.

  • A Salient Feature Extraction Algorithm for Speech Emotion Recognition

    Ruiyu LIANG  Huawei TAO  Guichen TANG  Qingyun WANG  Li ZHAO  

     
    LETTER-Speech and Hearing

      Pubricized:
    2015/05/29
      Vol:
    E98-D No:9
      Page(s):
    1715-1718

    A salient feature extraction algorithm is proposed to improve the recognition rate of the speech emotion. Firstly, the spectrogram of the emotional speech is calculated. Secondly, imitating the selective attention mechanism, the color, direction and brightness map of the spectrogram is computed. Each map is normalized and down-sampled to form the low resolution feature matrix. Then, each feature matrix is converted to the row vector and the principal component analysis (PCA) is used to reduce features redundancy to make the subsequent classification algorithm more practical. Finally, the speech emotion is classified with the support vector machine. Compared with the tradition features, the improved recognition rate reaches 15%.

  • An Iterative Technique for Optimally Designing Extrapolated Impulse Response Filter in the Mini-Max Sense

    Hao WANG  Li ZHAO  Wenjiang PEI  Jiakuo ZUO  Qingyun WANG  Minghai XIN  

     
    LETTER-Systems and Control

      Vol:
    E96-A No:10
      Page(s):
    2029-2033

    The optimal design of an extrapolated impulse response (EIR) filter (in the mini-max sense) is a non-linear programming problem. In this paper, the optimal design of the EIR filter by the semi-infinite programming (SIP) is investigated and an iterative technique for optimally designing the EIR filter is proposed. The simulation experiment validates the effectiveness of the SIP technique and the proposed iterative technique in the optimal design of the EIR filter.