IEICE global.ieice.org Site

Keyword Search Result

[Keyword] voice(140hit)

21-40hit(140hit)

A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models
Shinnosuke TAKAMICHI Tomoki TODA Graham NEUBIG Sakriani SAKTI Satoshi NAKAMURA

PAPER-Voice conversion

Pubricized:
2016/07/19
Vol:
E99-D No:10
Page(s):
2490-2498
This paper presents a novel statistical sample-based approach for Gaussian Mixture Model (GMM)-based Voice Conversion (VC). Although GMM-based VC has the promising flexibility of model adaptation, quality in converted speech is significantly worse than that of natural speech. This paper addresses the problem of inaccurate modeling, which is one of the main reasons causing the quality degradation. Recently, we have proposed statistical sample-based speech synthesis using rich context models for high-quality and flexible Hidden Markov Model (HMM)-based Text-To-Speech (TTS) synthesis. This method makes it possible not only to produce high-quality speech by introducing ideas from unit selection synthesis, but also to preserve flexibility of the original HMM-based TTS. In this paper, we apply this idea to GMM-based VC. The rich context models are first trained for individual joint speech feature vectors, and then we gather them mixture by mixture to form a Rich context-GMM (R-GMM). In conversion, an iterative generation algorithm using R-GMMs is used to convert speech parameters, after initialization using over-trained probability distributions. Because the proposed method utilizes individual speech features, and its formulation is the same as that of conventional GMM-based VC, it makes it possible to produce high-quality speech while keeping flexibility of the original GMM-based VC. The experimental results demonstrate that the proposed method yields significant improvements in term of speech quality and speaker individuality in converted speech.
DNN-Based Voice Activity Detection with Multi-Task Learning
Tae Gyoon KANG Nam Soo KIM

LETTER-Speech and Hearing

Pubricized:
2015/10/26
Vol:
E99-D No:2
Page(s):
550-553
Recently, notable improvements in voice activity detection (VAD) problem have been achieved by adopting several machine learning techniques. Among them, the deep neural network (DNN) which learns the mapping between the noisy speech features and the corresponding voice activity status with its deep hidden structure has been one of the most popular techniques. In this letter, we propose a novel approach which enhances the robustness of DNN in mismatched noise conditions with multi-task learning (MTL) framework. In the proposed algorithm, a feature enhancement task for speech features is jointly trained with the conventional VAD task. The experimental results show that the DNN with the proposed framework outperforms the conventional DNN-based VAD algorithm.
An Effective Carrier Frequency and Phase Offset Tracking Scheme in the Case of Symbol Rate Sampling
Yunhua LI Bin TIAN Ke-Chu YI Quan YU

PAPER-Fundamental Theories for Communications

Vol:
E99-B No:2
Page(s):
337-346
In modern communication systems, it is a critical and challenging issue for existing carrier tracking techniques to achieve near-ideal carrier synchronization without the help of pilot signals in the case of symbol rate sampling and low signal-to-noise ratio (SNR). To overcome this issue, this paper proposes an effective carrier frequency and phase offset tracking scheme which has a robust confluent synchronization architecture whose main components are a digital frequency-locked loop (FLL), a digital phase-locked loop (PLL), a modified symbol hard decision block and some sampling rate conversion blocks. As received signals are sampled at symbol baud rate, this carrier tracking scheme is still able to obtain precise estimated values of carrier synchronization parameters under the condition of very low SNRs. The performance of the proposed carrier synchronization scheme is also evaluated by using Monte-Carlo method. Simulation results confirm the feasibility of this carrier tracking scheme and demonstrate that it ensures that both the rate-3/4 irregular low-density parity-code (LDPC) coded system and the military voice transmission system utilizing the direct sequence spread spectrum (DSSS) technique achieve satisfactory bit-error rate (BER) performance at correspondingly low SNRs.
Robust Voice Activity Detection Algorithm Based on Feature of Frequency Modulation of Harmonics and Its DSP Implementation
Chung-Chien HSU Kah-Meng CHEONG Tai-Shih CHI Yu TSAO

PAPER-Speech and Hearing

Pubricized:
2015/07/10
Vol:
E98-D No:10
Page(s):
1808-1817
This paper proposes a voice activity detection (VAD) algorithm based on an energy related feature of the frequency modulation of harmonics. A multi-resolution spectro-temporal analysis framework, which was developed to extract texture features of the audio signal from its Fourier spectrogram, is used to extract frequency modulation features of the speech signal. The proposed algorithm labels the voice active segments of the speech signal by comparing the energy related feature of the frequency modulation of harmonics with a threshold. Then, the proposed VAD is implemented on one of Texas Instruments (TI) digital signal processor (DSP) platforms for real-time operation. Simulations conducted on the DSP platform demonstrate the proposed VAD performs significantly better than three standard VADs, ITU-T G.729B, ETSI AMR1 and AMR2, in non-stationary noise in terms of the receiver operating characteristic (ROC) curves and the recognition rates from a practical distributed speech recognition (DSR) system.
A Novel Iterative Speaker Model Alignment Method from Non-Parallel Speech for Voice Conversion
Peng SONG Wenming ZHENG Xinran ZHANG Yun JIN Cheng ZHA Minghai XIN

LETTER-Speech and Hearing

Vol:
E98-A No:10
Page(s):
2178-2181
Most of the current voice conversion methods are conducted based on parallel speech, which is not easily obtained in practice. In this letter, a novel iterative speaker model alignment (ISMA) method is proposed to address this problem. First, the source and target speaker models are each trained from the background model by adopting maximum a posteriori (MAP) algorithm. Then, a novel ISMA method is presented for alignment and transformation of spectral features. Finally, the proposed ISMA approach is further combined with a Gaussian mixture model (GMM) to improve the conversion performance. A series of objective and subjective experiments are carried out on CMU ARCTIC dataset, and the results demonstrate that the proposed method significantly outperforms the state-of-the-art approach.
Similar Speaker Selection Technique Based on Distance Metric Learning Using Highly Correlated Acoustic Features with Perceptual Voice Quality Similarity
Yusuke IJIMA Hideyuki MIZUNO

PAPER-Speech and Hearing

Pubricized:
2014/10/15
Vol:
E98-D No:1
Page(s):
157-165
This paper analyzes the correlation between various acoustic features and perceptual voice quality similarity, and proposes a perceptually similar speaker selection technique based on distance metric learning. To analyze the relationship between acoustic features and voice quality similarity, we first conduct a large-scale subjective experiment using the voices of 62 female speakers and perceptual voice quality similarity scores between all pairs of speakers are acquired. Next, multiple linear regression analysis is carried out; it shows that four acoustic features are highly correlated to voice quality similarity. The proposed speaker selection technique first trains a transform matrix based on distance metric learning using the perceptual voice quality similarity acquired in the subjective experiment. Given an input speech, acoustic features of the input speech are transformed using the trained transform matrix, after which speaker selection is performed based on the Euclidean distance on the transformed acoustic feature space. We perform speaker selection experiments and evaluate the performance of the proposed technique by comparing it to speaker selection without feature space transformation. The results indicate that transformation based on distance metric learning reduces the error rate by 53.9%.
Cross-Dialectal Voice Conversion with Neural Networks
Weixun GAO Qiying CAO Yao QIAN

PAPER-Speech and Hearing

Vol:
E97-D No:11
Page(s):
2872-2880
In this paper, we use neural networks (NNs) for cross-dialectal (Mandarin-Shanghainese) voice conversion using a bi-dialectal speakers' recordings. This system employs a nonlinear mapping function, which is trained by parallel mandarin features of source and target speakers, to convert source speaker's Shanghainese features to those of target speaker. This study investigates three training aspects: a) Frequency warping, which is supposed to be language independent; b) Pre-training, which drives weights to a better starting point than random initialization or be regarded as unsupervised feature learning; and c) Sequence training, which minimizes sequence-level errors and matches objectives used in training and converting. Experimental results show that the performance of cross-dialectal voice conversion is close to that of intra-dialectal. This benefit is likely from the strong learning capabilities of NNs, e.g., exploiting feature correlations between fundamental frequency (F0) and spectrum. The objective measures: log spectral distortion (LSD) and root mean squared error (RMSE) of F0, both show that pre-training and sequence training outperform the frame-level mean square error (MSE) training. The naturalness of the converted Shanghainese speech and the similarity between converted Shanghainese speech and target Mandarin speech are significantly improved.
Adaptation of Acoustic Models in Joint Speaker and Noise Space Using Bilinear Models
Yongwon JEONG Hyung Soon KIM

LETTER-Speech and Hearing

Vol:
E97-D No:8
Page(s):
2195-2199
We present the adaptation of the acoustic models of hidden Markov models (HMMs) to the target speaker and noise environment using bilinear models. Acoustic models trained from various speakers and noise conditions are decomposed to build the bases that capture the interaction between the two factors. The model for the target speaker and noise is represented as a product of bases and two weight vectors. In experiments using the AURORA4 corpus, the bilinear model outperforms the linear model.
Voice Timbre Control Based on Perceived Age in Singing Voice Conversion
Kazuhiro KOBAYASHI Tomoki TODA Hironori DOI Tomoyasu NAKANO Masataka GOTO Graham NEUBIG Sakriani SAKTI Satoshi NAKAMURA

PAPER-Voice Conversion and Speech Enhancement

Vol:
E97-D No:6
Page(s):
1419-1428
The perceived age of a singing voice is the age of the singer as perceived by the listener, and is one of the notable characteristics that determines perceptions of a song. In this paper, we describe an investigation of acoustic features that have an effect on the perceived age, and a novel voice timbre control technique based on the perceived age for singing voice conversion (SVC). Singers can sing expressively by controlling prosody and voice timbre, but the varieties of voices that singers can produce are limited by physical constraints. Previous work has attempted to overcome this limitation through the use of statistical voice conversion. This technique makes it possible to convert singing voice timbre of an arbitrary source singer into those of an arbitrary target singer. However, it is still difficult to intuitively control singing voice characteristics by manipulating parameters corresponding to specific physical traits, such as gender and age. In this paper, we first perform an investigation of the factors that play a part in the listener's perception of the singer's age at first. Then, we applied a multiple-regression Gaussian mixture models (MR-GMM) to SVC for the purpose of controlling voice timbre based on the perceived age and we propose SVC based on the modified MR-GMM for manipulating the perceived age while maintaining singer's individuality. The experimental results show that 1) the perceived age of singing voices corresponds relatively well to the actual age of the singer, 2) prosodic features have a larger effect on the perceived age than spectral features, 3) the individuality of a singer is influenced more heavily by segmental features than prosodic features 4) the proposed voice timbre control method makes it possible to change the singer's perceived age while not having an adverse effect on the perceived individuality.
Voice Conversion Based on Speaker-Dependent Restricted Boltzmann Machines
Toru NAKASHIKA Tetsuya TAKIGUCHI Yasuo ARIKI

PAPER-Voice Conversion and Speech Enhancement

Vol:
E97-D No:6
Page(s):
1403-1410
This paper presents a voice conversion technique using speaker-dependent Restricted Boltzmann Machines (RBM) to build high-order eigen spaces of source/target speakers, where it is easier to convert the source speech to the target speech than in the traditional cepstrum space. We build a deep conversion architecture that concatenates the two speaker-dependent RBMs with neural networks, expecting that they automatically discover abstractions to express the original input features. Under this concept, if we train the RBMs using only the speech of an individual speaker that includes various phonemes while keeping the speaker individuality unchanged, it can be considered that there are fewer phonemes and relatively more speaker individuality in the output features of the hidden layer than original acoustic features. Training the RBMs for a source speaker and a target speaker, we can then connect and convert the speaker individuality abstractions using Neural Networks (NN). The converted abstraction of the source speaker is then back-propagated into the acoustic space (e.g., MFCC) using the RBM of the target speaker. We conducted speaker-voice conversion experiments and confirmed the efficacy of our method with respect to subjective and objective criteria, comparing it with the conventional Gaussian Mixture Model-based method and an ordinary NN.
Noise-Robust Voice Conversion Based on Sparse Spectral Mapping Using Non-negative Matrix Factorization
Ryo AIHARA Ryoichi TAKASHIMA Tetsuya TAKIGUCHI Yasuo ARIKI

PAPER-Voice Conversion and Speech Enhancement

Vol:
E97-D No:6
Page(s):
1411-1418
This paper presents a voice conversion (VC) technique for noisy environments based on a sparse representation of speech. Sparse representation-based VC using Non-negative matrix factorization (NMF) is employed for noise-added spectral conversion between different speakers. In our previous exemplar-based VC method, source exemplars and target exemplars are extracted from parallel training data, having the same texts uttered by the source and target speakers. The input source signal is represented using the source exemplars and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. However, this exemplar-based approach needs to hold all training exemplars (frames), and it requires high computation times to obtain the weights of the source exemplars. In this paper, we propose a framework to train the basis matrices of the source and target exemplars so that they have a common weight matrix. By using the basis matrices instead of the exemplars, the VC is performed with lower computation times than with the exemplar-based method. The effectiveness of this method was confirmed by comparing its effectiveness (in speaker conversion experiments using noise-added speech data) with that of an exemplar-based method and a conventional Gaussian mixture model (GMM)-based method.
A Hybrid Approach to Electrolaryngeal Speech Enhancement Based on Noise Reduction and Statistical Excitation Generation
Kou TANAKA Tomoki TODA Graham NEUBIG Sakriani SAKTI Satoshi NAKAMURA

PAPER-Voice Conversion and Speech Enhancement

Vol:
E97-D No:6
Page(s):
1429-1437
This paper presents an electrolaryngeal (EL) speech enhancement method capable of significantly improving naturalness of EL speech while causing no degradation in its intelligibility. An electrolarynx is an external device that artificially generates excitation sounds to enable laryngectomees to produce EL speech. Although proficient laryngectomees can produce quite intelligible EL speech, it sounds very unnatural due to the mechanical excitation produced by the device. Moreover, the excitation sounds produced by the device often leak outside, adding to EL speech as noise. To address these issues, there are mainly two conventional approached to EL speech enhancement through either noise reduction or statistical voice conversion (VC). The former approach usually causes no degradation in intelligibility but yields only small improvements in naturalness as the mechanical excitation sounds remain essentially unchanged. On the other hand, the latter approach significantly improves naturalness of EL speech using spectral and excitation parameters of natural voices converted from acoustic parameters of EL speech, but it usually causes degradation in intelligibility owing to errors in conversion. We propose a hybrid approach using a noise reduction method for enhancing spectral parameters and statistical voice conversion method for predicting excitation parameters. Moreover, we further modify the prediction process of the excitation parameters to improve its prediction accuracy and reduce adverse effects caused by unvoiced/voiced prediction errors. The experimental results demonstrate the proposed method yields significant improvements in naturalness compared with EL speech while keeping intelligibility high enough.
Mapping Articulatory-Features to Vocal-Tract Parameters for Voice Conversion
Narpendyah Wisjnu ARIWARDHANI Masashi KIMURA Yurie IRIBE Kouichi KATSURADA Tsuneo NITTA

PAPER-Speech and Hearing

Vol:
E97-D No:4
Page(s):
911-918
In this paper, we propose voice conversion (VC) based on articulatory features (AF) to vocal-tract parameters (VTP) mapping. An artificial neural network (ANN) is applied to map AF to VTP and to convert a speaker's voice to a target-speaker's voice. The proposed system is not only text-independent VC, in which it does not need parallel utterances between source and target-speakers, but can also be used for an arbitrary source-speaker. This means that our approach does not require source-speaker data to build the VC model. We are also focusing on a small number of target-speaker training data. For comparison, a baseline system based on Gaussian mixture model (GMM) approach is conducted. The experimental results for a small number of training data show that the converted voice of our approach is intelligible and has speaker individuality of the target-speaker.
Efficient Implementation of Statistical Model-Based Voice Activity Detection Using Taylor Series Approximation
Chungsoo LIM Soojeong LEE Jae-Hun CHOI Joon-Hyuk CHANG

LETTER-Digital Signal Processing

Vol:
E97-A No:3
Page(s):
865-868
In this letter, we propose a simple but effective technique that improves statistical model-based voice activity detection (VAD) by both reducing computational complexity and increasing detection accuracy. The improvements are made by applying Taylor series approximations to the exponential and logarithmic functions in the VAD algorithm based on an in-depth analysis of the algorithm. Experiments performed on a smartphone as well as on a desktop computer with various background noises confirm the effectiveness of the proposed technique.
Voice Activity Detection Based on Generalized Normal-Laplace Distribution Incorporating Conditional MAP
Ji-Hyun SONG Sangmin LEE

LETTER-Speech and Hearing

Vol:
E96-D No:12
Page(s):
2888-2891
In this paper, we propose a novel voice activity detection (VAD) algorithm based on the generalized normal-Laplace (GNL) distribution to provide enhanced performance in adverse noise environments. Specifically, the probability density function (PDF) of a noisy speech signal is represented by the GNL distribution; the variance of the speech and noise of the GNL distribution are estimated using higher-order moments. After in-depth analysis of estimated variances, a feature that is useful for discrimination between speech and noise at low SNRs is derived and compared to a threshold to detect speech activity. To consider the inter-frame correlation of speech activity, the result from the previous frame is employed in the decision rule of the proposed VAD algorithm. The performance of our proposed VAD algorithm is evaluated in terms of receiver operating characteristics (ROC) and detection accuracy. Results show that the proposed method yields better results than conventional VAD algorithms.
A Robust Speech Communication into Smart Info-Media System
Yoshikazu MIYANAGA Wataru TAKAHASHI Shingo YOSHIZAWA

INVITED PAPER

Vol:
E96-A No:11
Page(s):
2074-2080
This paper introduces our developed noise robust speech communication techniques and describes its implementation to a smart info-media system, i.e., a small robot. Our designed speech communication system consists of automatic speech detection, recognition, and rejection. By using automatic speech detection and recognition, an observed speech waveform can be recognized without a manual trigger. In addition, using speech rejection, this system only accepts registered speech phrases and rejects any other words. In other words, although an arbitrary input speech waveform can be fed into this system and recognized, the system responds only to the registered speech phrases. The developed noise robust speech processing can reduce various noises in many environments. In addition to the design of noise robust speech recognition, the LSI design of this system has been introduced. By using the design of speech recognition application specific IC (ASIC), we can simultaneously realize low power consumption and real-time processing. This paper describes the LSI architecture of this system and its performances in some field experiments. In terms of current speech recognition accuracy, the system can realize 85-99% under 0-20dB SNR and echo environments.
Exemplar-Based Voice Conversion Using Sparse Representation in Noisy Environments
Ryoichi TAKASHIMA Tetsuya TAKIGUCHI Yasuo ARIKI

PAPER

Vol:
E96-A No:10
Page(s):
1946-1953
This paper presents a voice conversion (VC) technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal. The parallel exemplars (dictionary) consist of the source exemplars and target exemplars, having the same texts uttered by the source and target speakers. The input source signal is decomposed into the source exemplars, noise exemplars and their weights (activities). Then, by using the weights of the source exemplars, the converted signal is constructed from the target exemplars. We carried out speaker conversion tasks using clean speech data and noise-added speech data. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.
Horizontal Spectral Entropy with Long-Span of Time for Robust Voice Activity Detection
Kun-Ching WANG

LETTER-Speech and Hearing

Vol:
E96-D No:9
Page(s):
2156-2161
This letter introduces innovative VAD based on horizontal spectral entropy with long-span of time (HSELT) feature sets to improve mobile ASR performance in low signal-to-noise ratio (SNR) conditions. Since the signal characteristics of nonstationary noise change with time, we need long-term information of the noisy speech signal to define a more robust decision rule yielding high accuracy. We find that HSELT measures can horizontally enhance the transition between speech and non-speech segments. Based on this finding, we use the HSELT measures to achieve high accuracy for detecting speech signal form various stationary and nonstationary noises.
Speaker Adaptation in Sparse Subspace of Acoustic Models
Yongwon JEONG

LETTER-Speech and Hearing

Vol:
E96-D No:6
Page(s):
1402-1405
I propose an acoustic model adaptation method using bases constructed through the sparse principal component analysis (SPCA) of acoustic models trained in a clean environment. I perform experiments on adaptation to a new speaker and noise. The SPCA-based method outperforms the PCA-based method in the presence of babble noise.
A Novel Approach Based on Adaptive Long-Term Sub-Band Entropy and Multi-Thresholding Scheme for Detecting Speech Signal
Kun-Ching WANG

LETTER-Speech and Hearing

Vol:
E95-D No:11
Page(s):
2732-2736
Conventional entropy measure is derived from full-band (range from 0 Hz to 4 kHz); however, it can not clearly describe the spectrum variability during voice-activity. Here we propose a novel concept of adaptive long-term sub-band entropy ( ALT-SubEnpy ) measure and combine it with a multi-thresholding scheme for voice activity detection. In detail, the ALT-SubEnpy measure developed with four part parameters of sub-entropy which uses different long-term spectral window length at each part. Consequently, the proposed ALT-SubEnpy -based algorithm recursively updates the four adaptive thresholds on each part. The proposed ALT-SubEnpy-based VAD method is shown to be an effective method while working at variable noise-level condition.

21-40hit(140hit)

Keyword Search Result

[Keyword] voice(140hit)

A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models

DNN-Based Voice Activity Detection with Multi-Task Learning

An Effective Carrier Frequency and Phase Offset Tracking Scheme in the Case of Symbol Rate Sampling

Robust Voice Activity Detection Algorithm Based on Feature of Frequency Modulation of Harmonics and Its DSP Implementation

A Novel Iterative Speaker Model Alignment Method from Non-Parallel Speech for Voice Conversion

Similar Speaker Selection Technique Based on Distance Metric Learning Using Highly Correlated Acoustic Features with Perceptual Voice Quality Similarity

Cross-Dialectal Voice Conversion with Neural Networks

Adaptation of Acoustic Models in Joint Speaker and Noise Space Using Bilinear Models

Voice Timbre Control Based on Perceived Age in Singing Voice Conversion

Voice Conversion Based on Speaker-Dependent Restricted Boltzmann Machines

Noise-Robust Voice Conversion Based on Sparse Spectral Mapping Using Non-negative Matrix Factorization

A Hybrid Approach to Electrolaryngeal Speech Enhancement Based on Noise Reduction and Statistical Excitation Generation

Mapping Articulatory-Features to Vocal-Tract Parameters for Voice Conversion

Efficient Implementation of Statistical Model-Based Voice Activity Detection Using Taylor Series Approximation

Voice Activity Detection Based on Generalized Normal-Laplace Distribution Incorporating Conditional MAP

A Robust Speech Communication into Smart Info-Media System

Exemplar-Based Voice Conversion Using Sparse Representation in Noisy Environments

Horizontal Spectral Entropy with Long-Span of Time for Robust Voice Activity Detection

Speaker Adaptation in Sparse Subspace of Acoustic Models

A Novel Approach Based on Adaptive Long-Term Sub-Band Entropy and Multi-Thresholding Scheme for Detecting Speech Signal

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles