1-7hit |
Pengxu JIANG Yue XIE Cairong ZOU Li ZHAO Qingyun WANG
In human-computer interaction, acoustic scene classification (ASC) is one of the relevant research domains. In real life, the recorded audio may include a lot of noise and quiet clips, making it hard for earlier ASC-based research to isolate the crucial scene information in sound. Furthermore, scene information may be scattered across numerous audio frames; hence, selecting scene-related frames is crucial for ASC. In this context, an integrated convolutional neural network with a fusion attention mechanism (ICNN-FA) is proposed for ASC. Firstly, segmented mel-spectrograms as the input of ICNN can assist the model in learning the short-term time-frequency correlation information. Then, the designed ICNN model is employed to learn these segment-level features. In addition, the proposed global attention layer may gather global information by integrating these segment features. Finally, the developed fusion attention layer is utilized to fuse all segment-level features while the classifier classifies various situations. Experimental findings using ASC datasets from DCASE 2018 and 2019 indicate the efficacy of the suggested method.
Pengxu JIANG Yang YANG Yue XIE Cairong ZOU Qingyun WANG
Convolutional neural network (CNN) is widely used in acoustic scene classification (ASC) tasks. In most cases, local convolution is utilized to gather time-frequency information between spectrum nodes. It is challenging to adequately express the non-local link between frequency domains in a finite convolution region. In this paper, we propose a dual-path convolutional neural network based on band interaction block (DCNN-bi) for ASC, with mel-spectrogram as the model’s input. We build two parallel CNN paths to learn the high-frequency and low-frequency components of the input feature. Additionally, we have created three band interaction blocks (bi-blocks) to explore the pertinent nodes between various frequency bands, which are connected between two paths. Combining the time-frequency information from two paths, the bi-blocks with three distinct designs acquire non-local information and send it back to the respective paths. The experimental results indicate that the utilization of the bi-block has the potential to improve the initial performance of the CNN substantially. Specifically, when applied to the DCASE 2018 and DCASE 2020 datasets, the CNN exhibited performance improvements of 1.79% and 3.06%, respectively.
Xia WANG Ruiyu LIANG Qingyun WANG Li ZHAO Cairong ZOU
In this letter, an effective acoustic feedback cancellation algorithm is proposed based on the normalized sub-band adaptive filter (NSAF). To improve the confliction between fast convergence rate and low misalignment in the NSAF algorithm, a variable step size is designed to automatically vary according to the update state of the filter. The update state of the filter is adaptively detected via the normalized distance between the long term average and the short term average of the tap-weight vector. Simulation results demonstrate that the proposed algorithm has superior performance in terms of convergence rate and misalignment.
Qingyun WANG Xinchun JI Ruiyu LIANG Li ZHAO
In the traditional microphone array signal processing, the performance degrades rapidly when the array aperture decreases, which has been a barrier restricting its implementation in the small-scale acoustic system such as digital hearing aids. In this work a new compressed sampling method of miniature microphone array is proposed, which compresses information in the internal of ADC by means of mixture system of hardware circuit and software program in order to remove the redundancy of the different array element signals. The architecture of the method is developed using the Verilog language and has already been tested in the FPGA chip. Experiments of compressed sampling and reconstruction show the successful sparseness and reconstruction for speech sources. Owing to having avoided singularity problem of the correlation matrix of the miniature microphone array, when used in the direction of arrival (DOA) estimation in digital hearing aids, the proposed method has the advantage of higher resolution compared with the traditional GCC and MUSIC algorithms.
Qingyun WANG Ruiyu LIANG Li JING Cairong ZOU Li ZHAO
Since digital hearing aids are sensitive to time delay and power consumption, the computational complexity of noise reduction must be reduced as much as possible. Therefore, some complicated algorithms based on the analysis of the time-frequency domain are very difficult to implement in digital hearing aids. This paper presents a new approach that yields an improved noise reduction algorithm with greatly reduce computational complexity for multi-channel digital hearing aids. First, the sub-band sound pressure level (SPL) is calculated in real time. Then, based on the calculated sub-band SPL, the noise in the sub-band is estimated and the possibility of speech is computed. Finally, a posteriori and a priori signal-to-noise ratios are estimated and the gain function is acquired to reduce the noise adaptively. By replacing the FFT and IFFT transforms by the known SPL, the proposed algorithm greatly reduces the computation loads. Experiments on a prototype digital hearing aid show that the time delay is decreased to nearly half that of the traditional adaptive Wiener filtering and spectral subtraction algorithms, but the SNR improvement and PESQ score are rather satisfied. Compared with modulation frequency-based noise reduction algorithm, which is used in many commercial digital hearing aids, the proposed algorithm achieves not only more than 5dB SNR improvement but also less time delay and power consumption.
Ruiyu LIANG Huawei TAO Guichen TANG Qingyun WANG Li ZHAO
A salient feature extraction algorithm is proposed to improve the recognition rate of the speech emotion. Firstly, the spectrogram of the emotional speech is calculated. Secondly, imitating the selective attention mechanism, the color, direction and brightness map of the spectrogram is computed. Each map is normalized and down-sampled to form the low resolution feature matrix. Then, each feature matrix is converted to the row vector and the principal component analysis (PCA) is used to reduce features redundancy to make the subsequent classification algorithm more practical. Finally, the speech emotion is classified with the support vector machine. Compared with the tradition features, the improved recognition rate reaches 15%.
Hao WANG Li ZHAO Wenjiang PEI Jiakuo ZUO Qingyun WANG Minghai XIN
The optimal design of an extrapolated impulse response (EIR) filter (in the mini-max sense) is a non-linear programming problem. In this paper, the optimal design of the EIR filter by the semi-infinite programming (SIP) is investigated and an iterative technique for optimally designing the EIR filter is proposed. The simulation experiment validates the effectiveness of the SIP technique and the proposed iterative technique in the optimal design of the EIR filter.