This study presents a fast adaptive algorithm for noise estimation in non-stationary environments. To make noise estimation adapt quickly to non-stationary noise environments, a robust entropy-based voice activity detection (VAD) is thus required. It is well-known that the entropy-based measure defined in spectral domain is very insensitive to the changing level of nose. To exploit the specific nature of straight lines existing on speech-only spectrogram, the proposed spectrum entropy measurement improved from spectrum entropy proposed by Shen et al. is further presented and is named band-splitting spectrum entropy (BSE). Consequently, the proposed recursive noise estimator including BSE-based VAD can update noise power spectrum accurately even if the noise-level quickly changes.
An efficient and simple approach to consonant/vowel (C/V) segmentation by incorporating the SNR improvement of a speech enhancement system with the energy variation of two adjacent frames is proposed. Experimental results show that the proposed scheme performs well in segmenting C/V for a spontaneously spoken utterance.
Tatsuya MIZUTANI Takehiko KAGOSHIMA
This paper proposes a novel speech synthesis method to generate human-like natural speech. The conventional unit-selection-based synthesis method selects speech units from a large database, and concatenates them with or without modifying the prosody to generate synthetic speech. This method features highly human-like voice quality. The method, however, has a problem that a suitable speech unit is not necessarily selected. Since the unsuitable speech unit selection causes discontinuity between the consecutive speech units, the synthesized speech quality deteriorates. It might be considered that the conventional method can attain higher speech quality if the database size increases. However, preparation of a larger database requires a longer recording time. The narrator's voice quality does not remain constant throughout the recording period. This fact deteriorates the database quality, and still leaves the problem of unsuitable selection. We propose the plural unit selection and fusion method which avoids this problem. This method integrates the unit fusion used in the unit-training-based method with the conventional unit-selection-based method. The proposed method selects plural speech units for each segment, fuses the selected speech units for each segment, modifies the prosody of the fused speech units, and concatenates them to generate synthetic speech. This unit fusion creates speech units which are connected to one another with much less voice discontinuity, and realizes high quality speech. A subjective evaluation test showed that the proposed method greatly improves the speech quality compared with the conventional method. Also, it showed that the speech quality of the proposed method is kept high regardless of the database size, from small (10 minutes) to large (40 minutes). The proposed method is a new framework in the sense that it is a hybrid method between the unit-selection-based method and the unit-training-based method. In the framework, the algorithms of the unit selection and the unit fusion are exchangeable for more efficient techniques. Thus, the framework is expected to lead to new synthesis methods.
Masahiro ARAKI Akiko KOUZAWA Kenji TACHIBANA
In this paper, we propose a new multimodal interaction description language, MIML (Multimodal Interaction Markup Language), which defines dialogue patterns between human and various types of interactive agents. The feature of this language is three-layered description of agent-based interactive systems. The high-level description is a task definition that can easily construct typical agent-based interactive task control information. The middle-level description is an interaction description that defines agent's behavior and user's input at the granularity of dialogue segment. The low-level description is a platform dependent description that can override the pre-defined function in the interaction description. The connection between task-level and interaction-level is realized by generation of interaction description templates from the task level description. The connection between interaction-level and platform-level is realized by a binding mechanism of XML. As a result of the comparison with other languages, MIML has advantages in high-level interaction description, modality extensibility and compatibility with standardized technologies.
Sang-Kil LEE Tae-Kyung CHO Seong-Ho KIM Myung-Ryul CHOI
This letter mathematically proves that the performance of the new protocol in Ref. [1] is better than that of the existing protocol. It was proposed that a frame from an access device is delivered over the access link and then it is multiplexed and packed into ATM cell at an access node and then the cell is carried toward a voice gateway, by using a method to split two sublayers in AAL2. That means one sublayer is implemented at the subscriber access device and the other sublayer is implemented at the access node. Access devices using the protocol achieve higher utilization of CID and waste fewer ATM resource per the access device. Mathematical analysis is performed on the proposed and existing protocol, and both upstream cell rate and padding probability are calculated. The proposed protocol shows lower upstream traffic rate and padding cell probability than the existing protocol.
This paper proposes a Voice Activity Detection (VAD) algorithm using Radial Basis Function (RBF) network. The k-means clustering and Least Mean Square (LMS) algorithm are used to update the RBF network to the underlying speech condition. The inputs for RBF are the three parameters a Code Excited Linear Prediction (CELP) coder, which works stably under various background noise levels. Adaptive hangover threshold applies in RBF-VAD for reducing error, because threshold value has trade off effect in VAD decision. The experimental results show that the proposed VAD algorithm achieves better performance than G.729 Annex B at any noise level.
Hiroyuki SUZUKI Heiga ZEN Yoshihiko NANKAKU Chiyomi MIYAJIMA Keiichi TOKUDA Tadashi KITAMURA
This paper describes continuous speech recognition incorporating the additional complement information, e.g., voice characteristics, speaking styles, linguistic information and noise environment, into HMM-based acoustic modeling. In speech recognition systems, context-dependent HMMs, i.e., triphone, and the tree-based context clustering have commonly been used. Several attempts to utilize not only phonetic contexts, but additional complement information based on context (factor) dependent HMMs have been made in recent years. However, when the additional factors for testing data are unobserved, methods for obtaining factor labels is required before decoding. In this paper, we propose a model integration technique based on general factor dependent HMMs for decoding. The integrated HMMs can be used by a conventional decoder as standard triphone HMMs with Gaussian mixture densities. Moreover, by using the results of context clustering, the proposed method can determine an optimal number of mixture components for each state dependently of the degree of influence from additional factors. Phoneme recognition experiments using voice characteristic labels show significant improvements with a small number of model parameters, and a 19.3% error reduction was obtained in noise environment experiments.
Takashi SAITO Masaharu SAKAMOTO
This paper presents a new framework for effectively creating VoiceFonts for speech synthesis. A VoiceFont in this paper represents a voice inventory aimed at generating personalized voices. Creating well-formed voice inventories is a time-consuming and laborious task. This has become a critical issue for speech synthesis systems that make an attempt to synthesize many high quality voice personalities. The framework we propose here aims to drastically reduce the burden with a twofold approach. First, in order to substantially enhance the accuracy and robustness of automatic speech segmentation, we introduce a multi-layered speech segmentation algorithm with a new measure of segmental reliability. Secondly, to minimize the amount of human intervention in the process of VoiceFont creation, we provide easy-to-use functions in a data viewer and compiler to facilitate checking and validation of the automatically extracted data. We conducted experiments to investigate the accuracy of the automatic speech segmentation, and its robustness to speaker and style variations. The results of the experiments on six speech corpora with a fairly large variation of speaking styles show that the speech segmentation algorithm is quite accurate and robust in extracting segments of both phonemes and accentual phrases. In addition, to subjectively evaluate VoiceFonts created by using the framework, we conducted a listening test for speaker recognizability. The results show that the voice personalities of synthesized speech generated by the VoiceFont-based speech synthesizer are fairly close to those of the donor speakers.
Nobuhiko KITAWAKI Kou NAGAI Takeshi YAMADA
Recently, wideband speech communication using 7 kHz-wideband speech coding, as described in ITU-T Recommendations G.722, G.722.1, and G.722.2, has become increasingly necessary for use in advanced IP telephony using PCs, since, for this application, hands-free communication using separate microphones and loudspeakers is indispensable, and in this situation wideband speech is particularly helpful in enhancing the naturalness of communication. An objective quality measurement methodology for wideband-speech coding has been studied, its essential components being an objective quality measure and an input test signal. This paper describes Wideband-PESQ conforming to the draft Annex to ITU-T Recommendation P.862, "Perceptual Evaluation of Speech Quality (PESQ)," as the objective quality measure, by evaluating the consistency between the subjectively evaluated MOS (Mean Opinion Score) and objectively estimated MOS. This paper also describes the verification of artificial voice conforming to Recommendation P.50 "Artificial Voices," as the input test signal for such measurements, by evaluating the consistency between the objectively estimated MOS using a real voice and that obtained using an artificial voice.
Abbas ASOSHEH Mohammad SHIKH-BAHAEI Jonathon A. CHAMBERS
This paper proposes a new FEC scheme using backup channel to send redundant information instead of piggybacking the main packet. This is particularly applicable to the modern IP networks which are distributed all over the world. In this method only one source coder for both the main and the redundant payload is used to reduce the overall computational complexity. The Gilbert loss model (GLM) is employed to verify the improvement of the packet loss probability in this new method compared with that in a single path FEC scheme. It is shown, through simulation results that using our proposed backup channel can considerably improve the packet loss and delay performance of the VoIP networks.
Yoshihiro ISHIKAWA Kazuhiko FUKAWA Hiroshi SUZUKI
In communication systems such as mobile telecommunication systems and the Internet, resource sharing among coexisting real-time and non-real-time services is extremely important to provide multimedia services. This paper analytically investigates the performance of the packet data control algorithm proposed in. This algorithm efficiently uses radio resources by utilizing the remaining capacity that is not used by real-time services. The state probability vectors and transition probability matrices of both the real-time and non-real-time services are first derived and then the delay characteristics, the outage probability of voice users, and the outage probability of data users are evaluated. A performance analysis with high bit-rate non-real-time services is also presented.
Hiroki MORI Wakana ODAGIRI Hideki KASUYA
Transitional fundamental frequency (F0) characteristics comprise a crucial part of F0 dynamics in singing. This paper examines the F0 characteristics during the note transition period. An analysis of the singing voice of a professional baritone strongly suggests that asymmetries exist in the mechanisms used for controlling rising and falling. Specifically, the F0 contour in rising transitions can be modeled as a step response from a critically-damped second-order linear system with fixed average/maximum speed of change, whereas that in falling transitions can be modeled as a step response from an underdamped second-order linear system with fixed transition time. The validity of the model is examined through auditory experiments using synthesized singing voice.
We propose Optimal Temporal Decomposition (OTD) of speech for voice morphing preserving Δ cepstrum. OTD is an optimal modification of the original Temporal Decomposition (TD) by B. Atal. It is theoretically shown that OTD can achieve minimal spectral distortion for the TD-based approximation of time-varying LPC parameters. Moreover, by applying OTD to preserving Δ cepstrum, it is also theoretically shown that Δ cepstrum of a target speaker can be reflected to that of a source speaker. In frequency domain interpolation, the Laplacian Spectral Distortion (LSD) measure is introduced to improve the Inverse Function of Integrated Spectrum (IFIS) based non-uniform frequency warping. Experimental results indicate that Δ cepstrum of the OTD-based morphing spectra of a source speaker is mostly equal to that of a target speaker except for a piecewise constant factor and subjective listening tests show that the speech intelligibility of the proposed morphing method is superior to the conventional method.
In speech enhancement with adaptive microphone array, the voice activity detection (VAD) is indispensable for the adaptation control. Even though many VAD methods have been proposed as a pre-processor for speech recognition and compression, they can hardly discriminate nonstationary interferences which frequently exist in real environment. In this research, we propose a novel VAD method with array signal processing in the wavelet domain. In that domain we can integrate the temporal, spectral and spatial information to achieve robust voice activity discriminability for a nonstationary interference arriving from close direction of speech. The signals acquired by microphone array are at first decomposed into appropriate subbands using wavelet packet to extract its temporal and spectral features. Then directionality check and direction estimation on each subbands are executed to do VAD with respect to the spatial information. Computer simulation results for sound data demonstrate that the proposed method keeps its discriminability even for the interference arriving from close direction of speech.
A special group of voice application services (VASs) are promising contents for wireless as well as wireline networks. Without a designated call admission policy, VAS calls are expected to suffer from relatively high probability of blocking since they normally require better signal quality than ordinary voice calls. In this letter, we consider a prioritized call admission design in order to reduce the blocking probability of VAS calls, which makes the users feel the newly-provided VAS in belief. The VAS calls are given a priority by reserving a number of channel-processing hardwares. With the reservation, the blocking probability of prioritized VAS calls can be evidently reduced. That of ordinary calls, however, is increasing instead. This letter provides a system model that counts the blocking probabilities of VAS and ordinary calls simultaneously, and numerically examines an adequate level of the prioritization for VAS calls.
This letter proposes that by separating the two sublayers within AAL2 for VoDSL, a non-ATM-based IAD at customer premises can support AAL2 service through a DSLAM including new functions at the central office. To achieve this goal, AAL2 SSCS for bearer channels is located at the CPE. Also, AAL2 CPS and AAL2 SSCS for frame mode service including SSSAR and SSTED are located at the DSLAM. By doing so, one endpoint of an ATM connection at the customer side moves to the DSLAM. All bearer channels, CAS or CCS signaling and DSS1 relay messages from the customer side are transmitted to voice gateway transparently. As a result, the ATM connection using AAL2 can multiplex CPS packets from more AAL2 users, which improves multiplexing gain, minimizes waiting probability, and significantly decreases the number of cells into ATM networks. The simulation shows that the proposed method results in less ATM traffic and padded cell ratio, compared with the existing method.
Junichi YAMAGISHI Masatsune TAMURA Takashi MASUKO Keiichi TOKUDA Takao KOBAYASHI
This paper describes a new training method of average voice model for speech synthesis in which arbitrary speaker's voice is generated based on speaker adaptation. When the amount of training data is limited, the distributions of average voice model often have bias depending on speaker and/or gender and this will degrade the quality of synthetic speech. In the proposed method, to reduce the influence of speaker dependence, we incorporate a context clustering technique called shared decision tree context clustering and speaker adaptive training into the training procedure of average voice model. From the results of subjective tests, we show that the average voice model trained using the proposed method generates more natural sounding speech than the conventional average voice model. Moreover, it is shown that voice characteristics and prosodic features of synthetic speech generated from the adapted model using the proposed method are closer to the target speaker than the conventional method.
Baek-Hyun KIM Seung-Hoon SHIN Kyung-Sup KWAK
Different quality of service (QoS) requirements must be guaranteed in multimedia code division multiple access (CDMA) mobile communication system to support various applications such as voice, video, file transfer, e-mail, and Internet access. In this paper, we analyze the system in a mixed-traffic environment consisting of voice, stream data, and packet data where preemptive priority is granted to delay-intolerant voice service and a buffer is offered to delay-tolerant stream data services. In the case of best-effort packet data service, the probability of access control by transmission permission is applied to obtain throughput improvement. To analyze the multimedia CDMA mobile communication system, we built a two-dimensional Markov chain model on high-priority voice and stream data services, and then performed numerical analyses in conjunction with packet data services based on a residual capacity equation. We investigate the performance of an integrated voice/stream-data/packet-data CDMA mobile system with the finite buffer size of stream data in terms of voice service blocking probability, average stream data service delay, average packet data service delay, and throughput.
Salvatore M. CARTA Luigi RAFFO
A reconfigurable coprocessor for ETSI-GSM voice coding application domain is presented, synthesized and tested. An average overall reduction of more than 55% cycles with respect to standard RISC processors with DSP features is obtained. Such improvement together with locality and temporal correlation allows a reduction of power consumption, while standard interfacing technique ensures maximum flexibility.
Junichi YAMAGISHI Masatsune TAMURA Takashi MASUKO Keiichi TOKUDA Takao KOBAYASHI
This paper describes a new context clustering technique for average voice model, which is a set of speaker independent speech synthesis units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a decision tree common to these speaker dependent models for context clustering. When a node of the decision tree is split, only the context related questions which are applicable to all speaker dependent models are adopted. As a result, every node of the decision tree always has training data of all speakers. After construction of the decision tree, all speaker dependent models are clustered using the common decision tree and a speaker independent model, i.e., an average voice model is obtained by combining speaker dependent models. From the results of subjective tests, we show that the average voice models trained using the proposed technique can generate more natural sounding speech than the conventional average voice models.