1-7hit |
Kazunori KOMATANI Naoki HOTTA Satoshi SATO Mikio NAKANO
Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).
Xiaoming WANG Xiaohong JIANG Tao YANG Qiaoliang LI Yingshu LI
Routing is still a challenging issue for wireless sensor networks (WSNs), in particular for WSNs with a non-uniform deployment of nodes. This paper introduces a Node Aggregation Degree-aware Random Routing (NADRR) algorithm for non-uniform WSNs with the help of two new concepts, namely the Local Vertical Aggregation Degree (LVAD) and Local Horizontal Aggregation Degree (LHAD). Our basic idea is to first apply the LVAD and LHAD to determine one size-proper forwarding region (rather than a fixed-size one as in uniform node deployment case) for each node participating in routing, then select the next hop node from the size-proper forwarding region in a probabilistic way, considering both the residual energy and distribution of nodes. In this way, a good adaptability to the non-uniform deployment of nodes can be guaranteed by the new routing algorithm. Extensive simulation results show that in comparison with other classical geographic position based routing algorithms, such as GPSR, TPGF and CR, the proposed NADRR algorithm can result in lower node energy consumption, better balance of node energy consumption, higher routing success rate and longer network lifetime.
Yasunari OBUCHI Takashi SUMIYOSHI
In this paper we introduce a new framework of audio processing, which is essential to achieve a trigger-free speech interface for home appliances. If the speech interface works continually in real environments, it must extract occasional voice commands and reject everything else. It is extremely important to reduce the number of false alarms because the number of irrelevant inputs is much larger than the number of voice commands even for heavy users of appliances. The framework, called Intentional Voice Command Detection, is based on voice activity detection, but enhanced by various speech/audio processing techniques such as emotion recognition. The effectiveness of the proposed framework is evaluated using a newly-collected large-scale corpus. The advantages of combining various features were tested and confirmed, and the simple LDA-based classifier demonstrated acceptable performance. The effectiveness of various methods of user adaptation is also discussed.
Yusuke HIWASAKI Toru MORINAGA Jotaro IKEDO Akitoshi KATAOKA
This paper presents a way of using a linear regression model to produce a single-valued criterion that indicates the perceived importance of each block in a stream of speech blocks. This method is superior to the conventional approach, voice activity detection (VAD), in that it provides a dynamically changing priority value for speech segments with finer granularity. The approach can be used in conjunction with scalable speech coding techniques in the context of IP QoS services to achieve a flexible form of quality control for speech transmission. A simple linear regression model is used to estimate a mean opinion score (MOS) of the various cases of missing speech segments. The estimated MOS is a continuous value that can be mapped to priority levels with arbitrary granularity. Through subjective evaluation, we show the validity of the calculated priority values.
Masahiro SERIZAWA Hironori ITO Toshiyuki NOMURA
This paper proposes a silence compression algorithm operating at multi-rates (MR) and with dual-bandwidths (DB), a narrowband and a wideband, for the MPEG (Moving Picture Experts Group)-4 CELP (Code Excited Linear Prediction) standard. The MR/DB operations are implemented by a Variable-Frame-size/Dual-Bandwidth Voice Activity Detection (VF/DB-VAD) module with bandwidth conversions of the input signal, and a Variable-Frame-size Comfort Noise Generator (VF-CNG) module. The CNG module adaptively smoothes the Root Mean Square (RMS) value of the input signal to improve the coding quality during transition periods. The algorithm also employs a Dual-Rate Discontinuous Transmission (DR-DTX) module to reduce an average transmission bitrate during silence periods. Subjective test results show that the proposed silence compression algorithm gives no degradation in coding quality for clean and noisy speech signals. These signals include about 20 to 30% non-speech frames and the average transmission bitrates are reduced by 20 to 40%. The proposed algorithm has been adopted as a part of the ISO/IEC MPEG-4 CELP version 2 standard.
In 1995, 8 kb/s CS-ACELP coder of G.729 is standardized by ITU-T SG15 and it has been reported that the speech quality of G.729 is better than or equal to that of 32 kb/s ADPCM (G.726). However G.729 is the fixed rate speech coder, and it does not consider the property of voice activity in mutual conversation. If we use the voice activity, we can reduce the average bit rate in half without any degradations of the speech quality. In this paper, we propose an efficient variable rate algorithm for G.729. The variable rate algorithm consists of two main subjects, the rate determination algorithm and the design of sub rate coders. For the robust VAD algorithm, we combine the energy-thresholding method, the phonetic segmentation method by integration of various feature parameters obtained through the analysis procedure, and the variable hangover period method. Through the analysis of noise features, the 1 kb/s sub rate coder is designed for coding the background noise signal. Also, we design the 4 kb/s sub rate coder for the unvoiced parts. The performance of the variable rate algorithm is evaluated by the comparison of speech quality and average bit rate with G.729. Subjective quality test is also done by MOS test. Conclusively, it is verified that the proposed variable rate CS-ACELP coder produces the same speech quality as G.729, at the average bit rate of 4.4 kb/s.
Daoud BERKANI Hisham HASSANEIN Jean-Pierre ADOUL
The development of saturation diving in civil and defense applications has enabled man to work in the sea at great depths and for long periods of time. This advance has resulted, in part, as a consequence of the substitution of helium for nitrogen in breathing gas mixtures. However, utilization of HeO2 breathing mixture at high ambient pressures has caused problems in speech communication; in turn, helium speech enhancement systems have been developed to improve diver communication. These speech unscramblers attempt to process variously the grossly unintelligible speech resulting from the effect of breathing mixtures and ambient pressure, and to reconstruct such signals in order to provide adequate voice communication. It is known that the glottal excitation is quasi-periodic and the vocal tract filter is quasi-stationary. Hence, it is possible to use an auto regressive modelisation to restore speech intelligibility in hyperbaric conditions. Corrections are made on the vocal tract transfer function, either in the frequency domain, or directly on the autocorrelation function. A spectral subtraction or noise reduction may be added to improve speech quality. A new VAD enhanced helium speech unscrambler is proposed for use in adverse conditions or in speech recognition. This system, implementable on single chip DSP of current technology, is capable to work in real time.