IEICE global.ieice.org Site

Keyword Search Result

[Keyword] voice(140hit)

41-60hit(140hit)

Voice Activity Detection Using Global Speech Absence Probability Based on Teager Energy for Speech Enhancement
Yun-Sik PARK Sangmin LEE

LETTER-Speech and Hearing

Vol:
E95-D No:10
Page(s):
2568-2571
In this paper, we propose a novel voice activity detection (VAD) algorithm using global speech absence probability (GSAP) based on Teager energy (TE) for speech enhancement. The proposed method provides a better representation of GSAP, resulting in improved decision performance for speech and noise segments by the use of a TE operator which is employed to suppress the influence of noise signals. The performance of our approach is evaluated by objective tests under various environments, and it is found that the suggested method yields better results than conventional schemes.
Voice-Activity Detection Using Long-Term Sub-Band Entropy Measure
Kun-Ching WANG

LETTER-Engineering Acoustics

Vol:
E95-A No:9
Page(s):
1606-1609
A novel long-term sub-band entropy (LT-SubEntropy) measure, which uses improved long-term spectral analysis and sub-band entropy, is proposed for voice activity detection (VAD). Based on the measure, we can accurately exploit the inherent nature of the formant structure on speech spectrogram (the well-known as voiceprint). Results show that the proposed VAD is superior to existing standard VAD methods at low SNR levels, especially at variable-level noise.
Spectral Features for Perceptually Natural Phoneme Replacement by Another Speaker's Speech
Reiko TAKOU Hiroyuki SEGI Tohru TAKAGI Nobumasa SEIYAMA

PAPER-Speech and Hearing

Vol:
E95-A No:4
Page(s):
751-759
The frequency regions and spectral features that can be used to measure the perceived similarity and continuity of voice quality are reported here. A perceptual evaluation test was conducted to assess the naturalness of spoken sentences in which either a vowel or a long vowel of the original speaker was replaced by that of another. Correlation analysis between the evaluation score and the spectral feature distance was conducted to select the spectral features that were expected to be effective in measuring the voice quality and to identify the appropriate speech segment of another speaker. The mel-frequency cepstrum coefficient (MFCC) and the spectral center of gravity (COG) in the low-, middle-, and high-frequency regions were selected. A perceptual paired comparison test was carried out to confirm the effectiveness of the spectral features. The results showed that the MFCC was effective for spectra across a wide range of frequency regions, the COG was effective in the low- and high-frequency regions, and the effective spectral features differed among the original speakers.
Tense-Lax Vowel Classification with Energy Trajectory and Voice Quality Measurements
Suk-Myung LEE Jeung-Yoon CHOI

LETTER-Speech and Hearing

Vol:
E95-D No:3
Page(s):
884-887
This work examines energy trajectory and voice quality measurements, in addition to conventional formant and duration properties, to classify tense and lax vowels in English. Tense and lax vowels are produced with differing articulatory configurations which can be identified by measuring acoustic cues such as energy peak location, energy convexity, open quotient and spectral tilt. An analysis of variance (ANOVA) is conducted, and dialect effects are observed. An overall 85.2% classification rate is obtained using the proposed features on the TIMIT database, resulting in improvement over using only conventional acoustic features. Adding the proposed features to widely used cepstral features also results in improved classification.
GTS Allocation Scheme for Bidirectional Voice Traffic in IEEE 802.15.4 Multihop Networks
Junwoo JUNG Hoki BAEK Jaesung LIM

PAPER-Wireless Communication Technologies

Vol:
E95-B No:2
Page(s):
493-508
The IEEE 802.15.4 protocol is considered a promising technology for low-cost low-power wireless personal area networks. Researchers have discussed the feasibility of voice communications over IEEE 802.15.4 networks. To this end, the personal area network (PAN) coordinator allocates guaranteed time slots (GTSs) for voice communications in the beacon-enabled mode of IEEE 802.15.4. Although IEEE 802.15.4 is capable of supporting voice communications by GTS allocation, it is impossible to accommodate voice transmission beyond two hops due to the excessive transmission delay. In this paper, we propose a GTS allocation scheme for bidirectional voice traffic in IEEE 802.15.4 multihop networks. The goal of our proposed scheme is to achieve low end-to-end delay and packet drop ratio without a complex allocation algorithm. Thus, the proposed scheme allocates GTSs to devices for successful completion of voice transmission in a superframe duration. The proposed scheme also considers transceiver switching delay. This is relatively large compared to a time slot due to the low-cost and low-gain antenna designs. We analyze and validate the proposed scheme in terms of average end-to-end delay and packet drop ratio. Our scheme has lower end-to-end delay and packet drop ratio than the basic IEEE 802.15.4 GTS allocation scheme.
Numerical Simulation of Air Flow through Glottis during Very Weak Whisper Sound Production
Makoto OTANI Tatsuya HIRAHARA

PAPER-Speech and Hearing

Vol:
E94-A No:9
Page(s):
1779-1785
A non-audible murmur (NAM), a very weak whisper sound produced without vocal fold vibration, has been researched in the development of a silent-speech communication tool for functional speech disorders as well as human-to-human/machine interfaces with inaudible voice input. The NAM can be detected using a specially designed microphone, called a NAM microphone, attached to the neck. However, the detected NAM signal has a low signal-to-noise ratio and severely suppressed high-frequency component. To improve NAM clarity, the mechanism of a NAM production must be clarified. In this work, an air flow through a glottis in the vocal tract was numerically simulated using computational fluid dynamics and vocal tract shape models that are obtained by a magnetic resonance imaging (MRI) scan for whispered voice production with various strengths, i.e. strong, weak, and very weak. For a very weak whispering during the MRI scan, subjects were trained, just before the scanning, to produce the very weak whispered voice, or the NAM. The numerical results show that a weak vorticity flow occurs in the supraglottal region even during a very weak whisper production; such vorticity flow provide aeroacoustic sources for a very weak whispering, i.e. NAM, as in an ordinary whispering.
Optimal Selection Criterion of the Modulation and Coding Scheme in Consideration of the Signaling Overhead of Mobile WiMAX Systems
Jaewoo SO

LETTER-Wireless Communication Technologies

Vol:
E94-B No:7
Page(s):
2153-2157
An optimal selection criterion of the modulation and coding scheme (MCS) for maximizing spectral efficiency is proposed in consideration of the signaling overhead of mobile WiMAX systems with a hybrid automatic repeat request mechanism. A base station informs users about the resource assignments in each frame, and the allocation process generates a substantial signaling overhead, which influences the system throughput. However, the signaling overhead was ignored in previous MCS selection criteria. In this letter, the spectral efficiency is estimated on the basis of the signaling overhead and the number of transmissions. The performance of the proposed MCS selection criterion is evaluated in terms of the spectral efficiency in the mobile WiMAX system, with and without persistent allocation.
Perceptual-Based Playout Mechanisms for Multi-Stream Voice over IP Networks
Chun-Feng WU Wen-Whei CHANG Yuan-Chuan CHIANG

PAPER-Information Network

Vol:
E94-D No:5
Page(s):
1018-1025
Packet loss and delay are the major network impairments for transporting real-time voice over IP networks. In the proposed system, multiple descriptions of the speech are used to take advantage of the packet path diversity. A new objective method is presented for predicting the perceived quality of multi-stream voice transmission. Also proposed is a multi-stream playout buffer algorithm, together with an adaptive parameter adjustment scheme, that maximizes the perceived speech quality via delay-loss trading. Experimental results showed that, compared to FEC-protected single-path transmission, the proposed multi-stream transmission scheme achieves significant reductions in delay and packet loss rates as well as improved speech quality.
HMM-Based Voice Conversion Using Quantized F0 Context
Takashi NOSE Yuhei OTA Takao KOBAYASHI

PAPER-Voice Conversion

Vol:
E93-D No:9
Page(s):
2483-2490
We propose a segment-based voice conversion technique using hidden Markov model (HMM)-based speech synthesis with nonparallel training data. In the proposed technique, the phoneme information with durations and a quantized F0 contour are extracted from the input speech of a source speaker, and are transmitted to a synthesis part. In the synthesis part, the quantized F0 symbols are used as prosodic context. A phonetically and prosodically context-dependent label sequence is generated from the transmitted phoneme and the F0 symbols. Then, converted speech is generated from the label sequence with durations using the target speaker's pre-trained context-dependent HMMs. In the model training, the models of the source and target speakers can be trained separately, hence there is no need to prepare parallel speech data of the source and target speakers. Objective and subjective experimental results show that the segment-based voice conversion with phonetic and prosodic contexts works effectively even if the parallel speech data is not available.
Improvements of the One-to-Many Eigenvoice Conversion System
Yamato OHTANI Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Voice Conversion

Vol:
E93-D No:9
Page(s):
2491-2499
We have developed a one-to-many eigenvoice conversion (EVC) system that allows us to convert a single source speaker's voice into an arbitrary target speaker's voice using an eigenvoice Gaussian mixture model (EV-GMM). This system is capable of effectively building a conversion model for an arbitrary target speaker by adapting the EV-GMM using only a small amount of speech data uttered by the target speaker in a text-independent manner. However, the conversion performance is still insufficient for the following reasons: 1) the excitation signal is not precisely modeled; 2) the oversmoothing of the converted spectrum causes muffled sounds in converted speech; and 3) the conversion model is affected by redundant acoustic variations among a lot of pre-stored target speakers used for building the EV-GMM. In order to address these problems, we apply the following promising techniques to one-to-many EVC: 1) mixed excitation; 2) a conversion algorithm considering global variance; and 3) adaptive training of the EV-GMM. The experimental results demonstrate that the conversion performance of one-to-many EVC is significantly improved by integrating all of these techniques into the one-to-many EVC system.
Esophageal Speech Enhancement Based on Statistical Voice Conversion with Gaussian Mixture Models
Hironori DOI Keigo NAKAMURA Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Voice Conversion

Vol:
E93-D No:9
Page(s):
2472-2482
This paper presents a novel method of enhancing esophageal speech using statistical voice conversion. Esophageal speech is one of the alternative speaking methods for laryngectomees. Although it doesn't require any external devices, generated voices usually sound unnatural compared with normal speech. To improve the intelligibility and naturalness of esophageal speech, we propose a voice conversion method from esophageal speech into normal speech. A spectral parameter and excitation parameters of target normal speech are separately estimated from a spectral parameter of the esophageal speech based on Gaussian mixture models. The experimental results demonstrate that the proposed method yields significant improvements in intelligibility and naturalness. We also apply one-to-many eigenvoice conversion to esophageal speech enhancement to make it possible to flexibly control the voice quality of enhanced speech.
Evaluation of Extremely Small Sound Source Signals Used in Speaking-Aid System with Statistical Voice Conversion
Keigo NAKAMURA Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Rehabilitation Engineering and Assistive Technology

Vol:
E93-D No:7
Page(s):
1909-1917
We have so far proposed a speaking-aid system for laryngectomees using a statistical voice conversion technique. In the proposed system, artificial speech articulated with extremely small sound source signals is detected with a Non-Audible Murmur (NAM) microphone, and then, the detected artificial speech is converted into more natural voice in a probabilistic manner. Although this system basically allows laryngectomees to speak while keeping the external source signals silent, it is still questionable how much these new sound source signals affect the converted speech quality. In this paper, we investigate the impact of various sound source signals on voice conversion accuracy. Various small sound source signals are designed by changing the spectral envelope and the waveform power independently. We conduct objective and subjective evaluations. The results of these experimental evaluations demonstrate that voice conversion accepts 1) various sound source signals with different spectral envelopes and 2) large degree of power of the sound source signals unless the power of speaking parts is almost equal to that of silence parts. Moreover, we also investigate the effectiveness of enhancing auditory feedback during speaking with the extremely small sound source signals.
Adaptive Training for Voice Conversion Based on Eigenvoices
Yamato OHTANI Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Speech and Hearing

Vol:
E93-D No:6
Page(s):
1589-1598
In this paper, we describe a novel model training method for one-to-many eigenvoice conversion (EVC). One-to-many EVC is a technique for converting a specific source speaker's voice into an arbitrary target speaker's voice. An eigenvoice Gaussian mixture model (EV-GMM) is trained in advance using multiple parallel data sets consisting of utterance-pairs of the source speaker and many pre-stored target speakers. The EV-GMM can be adapted to new target speakers using only a few of their arbitrary utterances by estimating a small number of adaptive parameters. In the adaptation process, several parameters of the EV-GMM to be fixed for different target speakers strongly affect the conversion performance of the adapted model. In order to improve the conversion performance in one-to-many EVC, we propose an adaptive training method of the EV-GMM. In the proposed training method, both the fixed parameters and the adaptive parameters are optimized by maximizing a total likelihood function of the EV-GMMs adapted to individual pre-stored target speakers. We conducted objective and subjective evaluations to demonstrate the effectiveness of the proposed training method. The experimental results show that the proposed adaptive training yields significant quality improvements in the converted speech.
A Robust Room Inverse Filtering Algorithm for Speech Dereverberation Based on a Kurtosis Maximization
Jae-woong JEONG Young-cheol PARK Dae-hee YOUN Seok-Pil LEE

LETTER-Speech and Hearing

Vol:
E93-D No:5
Page(s):
1309-1312
In this paper, we propose a robust room inverse filtering algorithm for speech dereverberation based on a kurtosis maximization. The proposed algorithm utilizes a new normalized kurtosis function that nonlinearly maps the input kurtosis onto a finite range from zero to one, which results in a kurtosis warping. Due to the kurtosis warping, the proposed algorithm provides more stable convergence and, in turn, better performance than the conventional algorithm. Experimental results are presented to confirm the robustness of the proposed algorithm.
An Adaptive Wavelet-Based Denoising Algorithm for Enhancing Speech in Non-stationary Noise Environment
Kun-Ching WANG

PAPER-Speech and Hearing

Vol:
E93-D No:2
Page(s):
341-349
Traditional wavelet-based speech enhancement algorithms are ineffective in the presence of highly non-stationary noise because of the difficulties in the accurate estimation of the local noise spectrum. In this paper, a simple method of noise estimation employing the use of a voice activity detector is proposed. We can improve the output of a wavelet-based speech enhancement algorithm in the presence of random noise bursts according to the results of VAD decision. The noisy speech is first preprocessed using bark-scale wavelet packet decomposition ( BSWPD ) to convert a noisy signal into wavelet coefficients (WCs). It is found that the VAD using bark-scale spectral entropy, called as BS-Entropy, parameter is superior to other energy-based approach especially in variable noise-level. The wavelet coefficient threshold (WCT) of each subband is then temporally adjusted according to the result of VAD approach. In a speech-dominated frame, the speech is categorized into either a voiced frame or an unvoiced frame. A voiced frame possesses a strong tone-like spectrum in lower subbands, so that the WCs of lower-band must be reserved. On the contrary, the WCT tends to increase in lower-band if the speech is categorized as unvoiced. In a noise-dominated frame, the background noise can be almost completely removed by increasing the WCT. The objective and subjective experimental results are then used to evaluate the proposed system. The experiments show that this algorithm is valid on various noise conditions, especially for color noise and non-stationary noise conditions.
Voice Communications over 802.11 Ad Hoc Networks: Modeling, Optimization and Call Admission Control
Changchun XU Yanyi XU Gan LIU Kezhong LIU

PAPER-Networks

Vol:
E93-D No:1
Page(s):
50-58
Supporting quality-of-service (QoS) of multimedia communications over IEEE 802.11 based ad hoc networks is a challenging task. This paper develops a simple 3-D Markov chain model for queuing analysis of IEEE 802.11 MAC layer. The model is applied for performance analysis of voice communications over IEEE 802.11 single-hop ad hoc networks. By using the model, we finish the performance optimization of IEEE MAC layer and obtain the maximum number of voice calls in IEEE 802.11 ad hoc networks as well as the statistical performance bounds. Furthermore, we design a fully distributed call admission control (CAC) algorithm which can provide strict statistical QoS guarantee for voice communications over IEEE 802.11 ad hoc networks. Extensive simulations indicate the accuracy of the analytical model and the CAC scheme.
Analysis and Experiments of Maximum Throughput in Wireless Multi-Hop Networks for VoIP Application
Masahiko INABA Yoshihiro TSUCHIYA Hiroo SEKIYA Shiro SAKATA Kengo YAGYU

PAPER-Network

Vol:
E92-B No:11
Page(s):
3422-3431
This paper quantitatively analyzes the maximum UDP (User Datagram Protocol) throughput for two-way flows in wireless string multi-hop networks. The validity of the analysis is shown by the comparison with the simulation and the experiment results. The authors also clarify the difference fundamental characteristics between a one-way flow and a two-way flow in detail based on the simulation results. The result shows that collisions at both ends' nodes are decisive in determining the throughput for two-way flows. The analyses are applicable to the estimation of VoIP (Voice over Internet Protocol) capacity for string multi-hop networks represented by WLAN (Wireless Local Area Network) mesh networks.
Efficient Implementation of Voiced/Unvoiced Sounds Classification Based on GMM for SMV Codec
Ji-Hyun SONG Joon-Hyuk CHANG

LETTER-Speech and Hearing

Vol:
E92-A No:8
Page(s):
2120-2123
In this letter, we propose an efficient method to improve the performance of voiced/unvoiced (V/UV) sounds decision for the selectable mode vocoder (SMV) of 3GPP2 using the Gaussian mixture model (GMM). We first present an effective analysis of the features and the classification method adopted in the SMV. And feature vectors which are applied to the GMM are then selected from relevant parameters of the SMV for the efficient V/UV classification. The performance of the proposed algorithm are evaluated under various conditions and yield better results compared to the conventional method of the SMV.
Performance Analysis of the ertPS Algorithm and Enhanced ertPS Algorithm for VoIP Services in IEEE 802.16e Systems
Bong Joo KIM Gang Uk HWANG

PAPER-Network

Vol:
E92-B No:6
Page(s):
2000-2007
In this paper, we analyze the extended real-time Polling Service (ertPS) algorithm in IEEE 802.16e systems, which is designed to support Voice-over-Internet-Protocol (VoIP) services with data packets of various sizes and silence suppression. The analysis uses a two-dimensional Markov Chain, where the grant size and the voice packet state are considered, and an approximation formula for the total throughput in the ertPS algorithm is derived. Next, to improve the performance of the ertPS algorithm, we propose an enhanced uplink resource allocation algorithm, called the e 2rtPS algorithm, for VoIP services in IEEE 802.16e systems. The e 2rtPS algorithm considers the queue status information and tries to alleviate the queue congestion as soon as possible by using remaining network resources. Numerical results are provided to show the accuracy of the approximation analysis for the ertPS algorithm and to verify the effectiveness of the e 2rtPS algorithm.
HMM-Based Style Control for Expressive Speech Synthesis with Arbitrary Speaker's Voice Using Model Adaptation
Takashi NOSE Makoto TACHIBANA Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E92-D No:3
Page(s):
489-497
This paper presents methods for controlling the intensity of emotional expressions and speaking styles of an arbitrary speaker's synthetic speech by using a small amount of his/her speech data in HMM-based speech synthesis. Model adaptation approaches are introduced into the style control technique based on the multiple-regression hidden semi-Markov model (MRHSMM). Two different approaches are proposed for training a target speaker's MRHSMMs. The first one is MRHSMM-based model adaptation in which the pretrained MRHSMM is adapted to the target speaker's model. For this purpose, we formulate the MLLR adaptation algorithm for the MRHSMM. The second method utilizes simultaneous adaptation of speaker and style from an average voice model to obtain the target speaker's style-dependent HSMMs which are used for the initialization of the MRHSMM. From the result of subjective evaluation using adaptation data of 50 sentences of each style, we show that the proposed methods outperform the conventional speaker-dependent model training when using the same size of speech data of the target speaker.

41-60hit(140hit)

Keyword Search Result

[Keyword] voice(140hit)

Voice Activity Detection Using Global Speech Absence Probability Based on Teager Energy for Speech Enhancement

Voice-Activity Detection Using Long-Term Sub-Band Entropy Measure

Spectral Features for Perceptually Natural Phoneme Replacement by Another Speaker's Speech

Tense-Lax Vowel Classification with Energy Trajectory and Voice Quality Measurements

GTS Allocation Scheme for Bidirectional Voice Traffic in IEEE 802.15.4 Multihop Networks

Numerical Simulation of Air Flow through Glottis during Very Weak Whisper Sound Production

Optimal Selection Criterion of the Modulation and Coding Scheme in Consideration of the Signaling Overhead of Mobile WiMAX Systems

Perceptual-Based Playout Mechanisms for Multi-Stream Voice over IP Networks

HMM-Based Voice Conversion Using Quantized F0 Context

Improvements of the One-to-Many Eigenvoice Conversion System

Esophageal Speech Enhancement Based on Statistical Voice Conversion with Gaussian Mixture Models

Evaluation of Extremely Small Sound Source Signals Used in Speaking-Aid System with Statistical Voice Conversion

Adaptive Training for Voice Conversion Based on Eigenvoices

A Robust Room Inverse Filtering Algorithm for Speech Dereverberation Based on a Kurtosis Maximization

An Adaptive Wavelet-Based Denoising Algorithm for Enhancing Speech in Non-stationary Noise Environment

Voice Communications over 802.11 Ad Hoc Networks: Modeling, Optimization and Call Admission Control

Analysis and Experiments of Maximum Throughput in Wireless Multi-Hop Networks for VoIP Application

Efficient Implementation of Voiced/Unvoiced Sounds Classification Based on GMM for SMV Codec

Performance Analysis of the ertPS Algorithm and Enhanced ertPS Algorithm for VoIP Services in IEEE 802.16e Systems

HMM-Based Style Control for Expressive Speech Synthesis with Arbitrary Speaker's Voice Using Model Adaptation

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles