1-5hit |
Hong Kook KIM Seung Ho CHOI Hwang Soo LEE
In this paper, we propose dynamic cepstral representations to effectively capture the temporal information of cepstral coefficients. The number of speech frames for the regression analysis to extract a dynamic cepstral coefficient is inversely proportional to the cepstral order since the cepstral coefficients of higher orders are more fluctuating than those of lower orders. By exploiting the relationship between the window length for extracting a dynamic cepstral coefficient and the statistical variance of the cepstral coefficient, we propose three kinds of windowing methods in this work: an utterance-specific variance-ratio windowing method, a statistical variance-ratio windowing method, and an inverse-lifter windowing method. Intra-speaker, inter-speaker, and speaker-independent recognition tests on 100 phonetically balanced words are carried out to evaluate the performance of the proposed order-dependent windowing methods.
Chenlin HU Jin Young KIM Seung Ho CHOI Chang Joo KIM
Tonal signals are shown as spectral peaks in the frequency domain. When the number of spectral peaks is small and the spectral signal is sparse, Compressive Sensing (CS) can be adopted to locate the peaks with a low-cost sensing system. In the CS scheme, a time domain signal is modelled as $oldsymbol{y}=Phi F^{-1}oldsymbol{s}$, where y and s are signal vectors in the time and frequency domains. In addition, F-1 and $Phi$ are an inverse DFT matrix and a random-sampling matrix, respectively. For a given y and $Phi$, the CS method attempts to estimate s with l0 or l1 optimization. To generate the peak candidates, we adopt the frequency-domain information of $ esmile{oldsymbol{s}}$ = $oldsymbol{F} esmile{oldsymbol{y}}$, where $ esmile{y}$ is the extended version of y and $ esmile{oldsymbol{y}}left(oldsymbol{n} ight)$ is zero when n is not elements of CS time instances. In this paper, we develop Gaussian statistics of $ esmile{oldsymbol{s}}$. That is, the variance and the mean values of $ esmile{oldsymbol{s}}left(oldsymbol{k} ight)$ are examined.
Yoonhee KIM Deokgyu YUN Hannah LEE Seung Ho CHOI
This paper presents a deep learning-based non-intrusive speech intelligibility estimation method using bottleneck features of autoencoder. The conventional standard non-intrusive speech intelligibility estimation method, P.563, lacks intelligibility estimation performance in various noise environments. We propose a more accurate speech intelligibility estimation method based on long-short term memory (LSTM) neural network whose input and output are an autoencoder bottleneck features and a short-time objective intelligence (STOI) score, respectively, where STOI is a standard tool for measuring intrusive speech intelligibility with reference speech signals. We showed that the proposed method has a superior performance by comparing with the conventional standard P.563 and mel-frequency cepstral coefficient (MFCC) feature-based intelligibility estimation methods for speech signals in various noise environments.
In this paper, we propose a statistical approach to improve the performance of spectral quantization of speech coders. The proposed techniques compensate for the distortion in a decoded line spectrum pair (LSP) vector based on a statistical mapping function between a decoded LSP vector and its corresponding original LSP vector. We first develop two codebook-based probabilistic matching (CBPM) methods by investigating the distribution of LSP vectors. In addition, we propose an iterative procedure for the two CBPMs. Next, the proposed techniques are applied to the predictive vector quantizer (PVQ) used for the IS-641 speech coder. The experimental results show that the proposed techniques reduce average spectral distortion by around 0.064 dB and the percentage of outliers compared with the PVQ without any compensation, resulting in transparent quality of spectral quantization. Finally, the comparison of speech quality using the perceptual evaluation of speech quality (PESQ) measure is performed and it is shown that the IS-641 speech coder employing the proposed techniques has better decoded speech quality than the standard IS-641 speech coder.
Deokgyu YUN Hannah LEE Seung Ho CHOI
This paper proposes a deep learning-based non-intrusive objective speech intelligibility estimation method based on recurrent neural network (RNN) with long short-term memory (LSTM) structure. Conventional non-intrusive estimation methods such as standard P.563 have poor estimation performance and lack of consistency, especially, in various noise and reverberation environments. The proposed method trains the LSTM RNN model parameters by utilizing the STOI that is the standard intrusive intelligibility estimation method with reference speech signal. The input and output of the LSTM RNN are the MFCC vector and the frame-wise STOI value, respectively. Experimental results show that the proposed objective intelligibility estimation method outperforms the conventional standard P.563 in various noisy and reverberant environments.