IEICE global.ieice.org Site

Keyword Search Result

[Keyword] vocoder(10hit)

1-10hit

Vector Quantization of Speech Spectrum Based on the VQ-VAE Embedding Space Learning by GAN Technique
Tanasan SRIKOTR Kazunori MANO

PAPER-Speech and Hearing, Digital Signal Processing

Pubricized:
2021/09/30
Vol:
E105-A No:4
Page(s):
647-654
The spectral envelope parameter is a significant speech parameter in the vocoder's quality. Recently, the Vector Quantized Variational AutoEncoder (VQ-VAE) is a state-of-the-art end-to-end quantization method based on the deep learning model. This paper proposed a new technique for improving the embedding space learning of VQ-VAE with the Generative Adversarial Network for quantizing the spectral envelope parameter, called VQ-VAE-EMGAN. In experiments, we designed the quantizer for the spectral envelope parameters of the WORLD vocoder extracted from the 16kHz speech waveform. As the results shown, the proposed technique reduced the Log Spectral Distortion (LSD) around 0.5dB and increased the PESQ by around 0.17 on average for four target bit operations compared to the conventional VQ-VAE.
DNN-Based Full-Band Speech Synthesis Using GMM Approximation of Spectral Envelope
Junya KOGUCHI Shinnosuke TAKAMICHI Masanori MORISE Hiroshi SARUWATARI Shigeki SAGAYAMA

PAPER-Speech and Hearing

Pubricized:
2020/09/03
Vol:
E103-D No:12
Page(s):
2673-2681
We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.
Continuous Noise Masking Based Vocoder for Statistical Parametric Speech Synthesis
Mohammed Salah AL-RADHI Tamás Gábor CSAPÓ Géza NÉMETH

PAPER-Speech and Hearing

Pubricized:
2020/02/10
Vol:
E103-D No:5
Page(s):
1099-1107
In this article, we propose a method called “continuous noise masking (cNM)” that allows eliminating residual buzziness in a continuous vocoder, i.e. of which all parameters are continuous and offers a simple and flexible speech analysis and synthesis system. Traditional parametric vocoders generally show a perceptible deterioration in the quality of the synthesized speech due to different processing algorithms. Furthermore, an inaccurate noise resynthesis (e.g. in breathiness or hoarseness) is also considered to be one of the main underlying causes of performance degradation, leading to noisy transients and temporal discontinuity in the synthesized speech. To overcome these issues, a new cNM is developed based on the phase distortion deviation in order to reduce the perceptual effect of the residual noise, allowing a proper reconstruction of noise characteristics, and model better the creaky voice segments that may happen in natural speech. To this end, the cNM is designed to keep only voice components under a condition of the cNM threshold while discarding others. We evaluate the proposed approach and compare with state-of-the-art vocoders using objective and subjective listening tests. Experimental results show that the proposed method can reduce the effect of residual noise and can reach the quality of other sophisticated approaches like STRAIGHT and log domain pulse model (PML).
WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications Open Access
Masanori MORISE Fumiya YOKOMORI Kenji OZAWA

PAPER-Speech and Hearing

Pubricized:
2016/04/05
Vol:
E99-D No:7
Page(s):
1877-1884
A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of real-time applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing.
Speech/Music Classification Enhancement for 3GPP2 SMV Codec Based on Deep Belief Networks
Ji-Hyun SONG Hong-Sub AN Sangmin LEE

LETTER-Speech and Hearing

Vol:
E97-A No:2
Page(s):
661-664
In this paper, we propose a robust speech/music classification algorithm to improve the performance of speech/music classification in the selectable mode vocoder (SMV) of 3GPP2 using deep belief networks (DBNs), which is a powerful hierarchical generative model for feature extraction and can determine the underlying discriminative characteristic of the extracted features. The six feature vectors selected from the relevant parameters of the SMV are applied to the visible layer in the proposed DBN-based method. The performance of the proposed algorithm is evaluated using the detection accuracy and error probability of speech and music for various music genres. The proposed algorithm yields better results when compared with the original SMV method and support vector machine (SVM) based method.
Efficient Implementation of Voiced/Unvoiced Sounds Classification Based on GMM for SMV Codec
Ji-Hyun SONG Joon-Hyuk CHANG

LETTER-Speech and Hearing

Vol:
E92-A No:8
Page(s):
2120-2123
In this letter, we propose an efficient method to improve the performance of voiced/unvoiced (V/UV) sounds decision for the selectable mode vocoder (SMV) of 3GPP2 using the Gaussian mixture model (GMM). We first present an effective analysis of the features and the classification method adopted in the SMV. And feature vectors which are applied to the GMM are then selected from relevant parameters of the SMV for the efficient V/UV classification. The performance of the proposed algorithm are evaluated under various conditions and yield better results compared to the conventional method of the SMV.
Acoustic Environment Classification Based on SMV Speech Codec Parameters for Context-Aware Mobile Phone
Kye-Hwan LEE Joon-Hyuk CHANG

LETTER-Speech and Hearing

Vol:
E92-D No:7
Page(s):
1491-1495
In this letter, an acoustic environment classification algorithm based on the 3GPP2 selectable mode vocoder (SMV) is proposed for context-aware mobile phones. Classification of the acoustic environment is performed based on a Gaussian mixture model (GMM) using coding parameters of the SMV extracted directly from the encoding process of the acoustic input data in the mobile phone. Experimental results show that the proposed environment classification algorithm provides superior performance over a conventional method in various acoustic environments.
Transform-Based CELP Vocoders with Low-Delay Low-Complexity and Variable-Rate Features
Jar-Ferr YANG Rong-San LIN Chung-Rong HU

PAPER-Speech and Hearing

Vol:
E85-D No:6
Page(s):
1003-1014
In this paper, we propose a simplified transform-based and variable-rate vocoder, which is evolved from the code-excited linear prediction (CELP) coding structure. With pre-emphasis and de-emphasis filters, the transformed-based CELP vocoder incorporates a long-term predictor, a discrete cosine transform (DCT), and pre-filters and postfilters for achieving perceptually weighted quantization. The proposed transform-based vocoder requires less computational complexity with slightly worse quality than the CELP coders. Furthermore, the proposed DCT-based coding structure easily figured with additional DCT coefficients could simultaneously offer low, middle, and high bit rates to adapt the variation of bandwidth for modern Internet or wireless communications.
A Pseudo Glottal Excitation Model for the Linear Prediction Vocoder with Speech Signals Coded at 1.6 kbps
Hwai-Tsu HU Fang-Jang KUO Hsin-Jen WANG

PAPER-Speech and Hearing

Vol:
E83-D No:8
Page(s):
1654-1661
This paper presents a pseudo glottal excitation model for the type of linear prediction vocoders with speech being coded at 1.6 kbps. While unvoiced speech and silence intervals are processed with a stochastic codebook of 512 entries, a glottal codebook with 32 entries for voiced excitation is used to describe the glottal phase characteristics. Steps of formulating the pseudo glottal excitation for one pitch period consist of 1) applying a polynomial model to simulate the low-frequency constituent of the residual, 2) inserting a magnitude-adjustable pulse sequence to characterize the main excitation, and 3) introducing turbulent noise in series with the resulting excitation. Procedures are described for codebook construction in addition to analysis and synthesis of the pseudo glottal excitation. Results in a mean opinion score (MOS) test show that the quality produced by the proposed coder is almost as good as that by 4.8 kbps CELP coder for male utterances, but the quality for female utterances is yet somewhat inferior.
The Skipping Technique: A Simple and Fast Algorithm to Find the Pitch in CELP Vocoder
JooHun LEE MyungJin BAE Souguil ANN

PAPER-Digital Signal Processing

Vol:
E78-A No:11
Page(s):
1571-1575
A fast pitch search algorithm using the skipping technique is proposed to reduce the computation time in CELP vocoder. Based on the characteristics of the correlation function of speech signal, the proposed algorithm skips over certain ranges in the full pitch search range in a simple way. Though the search range is reduced, high speech quality can be maintained since those lags having high correlation values are not skipped over and are used for search by closed-loop analysis. To improve the efficiency of the proposed method, we develop three variants of the skipping technique. The experimental results show that the proposed and the modified algorithm can reduce the computation time in the pitch search considerably, over 60% reduction compared with the traditional full search method.

Keyword Search Result

[Keyword] vocoder(10hit)

Vector Quantization of Speech Spectrum Based on the VQ-VAE Embedding Space Learning by GAN Technique

DNN-Based Full-Band Speech Synthesis Using GMM Approximation of Spectral Envelope

Continuous Noise Masking Based Vocoder for Statistical Parametric Speech Synthesis

WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications Open Access

Speech/Music Classification Enhancement for 3GPP2 SMV Codec Based on Deep Belief Networks

Efficient Implementation of Voiced/Unvoiced Sounds Classification Based on GMM for SMV Codec

Acoustic Environment Classification Based on SMV Speech Codec Parameters for Context-Aware Mobile Phone

Transform-Based CELP Vocoders with Low-Delay Low-Complexity and Variable-Rate Features

A Pseudo Glottal Excitation Model for the Linear Prediction Vocoder with Speech Signals Coded at 1.6 kbps

The Skipping Technique: A Simple and Fast Algorithm to Find the Pitch in CELP Vocoder

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles