The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] VAD(7hit)

1-7hit
  • Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances

    Kazunori KOMATANI  Naoki HOTTA  Satoshi SATO  Mikio NAKANO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2015/07/24
      Vol:
    E98-D No:11
      Page(s):
    1923-1931

    Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).

  • Node Aggregation Degree-Aware Random Routing for Non-uniform Wireless Sensor Networks

    Xiaoming WANG  Xiaohong JIANG  Tao YANG  Qiaoliang LI  Yingshu LI  

     
    PAPER-Network

      Vol:
    E94-B No:1
      Page(s):
    97-108

    Routing is still a challenging issue for wireless sensor networks (WSNs), in particular for WSNs with a non-uniform deployment of nodes. This paper introduces a Node Aggregation Degree-aware Random Routing (NADRR) algorithm for non-uniform WSNs with the help of two new concepts, namely the Local Vertical Aggregation Degree (LVAD) and Local Horizontal Aggregation Degree (LHAD). Our basic idea is to first apply the LVAD and LHAD to determine one size-proper forwarding region (rather than a fixed-size one as in uniform node deployment case) for each node participating in routing, then select the next hop node from the size-proper forwarding region in a probabilistic way, considering both the residual energy and distribution of nodes. In this way, a good adaptability to the non-uniform deployment of nodes can be guaranteed by the new routing algorithm. Extensive simulation results show that in comparison with other classical geographic position based routing algorithms, such as GPSR, TPGF and CR, the proposed NADRR algorithm can result in lower node energy consumption, better balance of node energy consumption, higher routing success rate and longer network lifetime.

  • Intentional Voice Command Detection for Trigger-Free Speech Interface

    Yasunari OBUCHI  Takashi SUMIYOSHI  

     
    PAPER-Robust Speech Recognition

      Vol:
    E93-D No:9
      Page(s):
    2440-2450

    In this paper we introduce a new framework of audio processing, which is essential to achieve a trigger-free speech interface for home appliances. If the speech interface works continually in real environments, it must extract occasional voice commands and reject everything else. It is extremely important to reduce the number of false alarms because the number of irrelevant inputs is much larger than the number of voice commands even for heavy users of appliances. The framework, called Intentional Voice Command Detection, is based on voice activity detection, but enhanced by various speech/audio processing techniques such as emotion recognition. The effectiveness of the proposed framework is evaluated using a newly-collected large-scale corpus. The advantages of combining various features were tested and confirmed, and the simple LDA-based classifier demonstrated acceptable performance. The effectiveness of various methods of user adaptation is also discussed.

  • Measuring the Perceived Importance of Speech Segments for Transmission over IP Networks Open Access

    Yusuke HIWASAKI  Toru MORINAGA  Jotaro IKEDO  Akitoshi KATAOKA  

     
    PAPER

      Vol:
    E89-B No:2
      Page(s):
    326-333

    This paper presents a way of using a linear regression model to produce a single-valued criterion that indicates the perceived importance of each block in a stream of speech blocks. This method is superior to the conventional approach, voice activity detection (VAD), in that it provides a dynamically changing priority value for speech segments with finer granularity. The approach can be used in conjunction with scalable speech coding techniques in the context of IP QoS services to achieve a flexible form of quality control for speech transmission. A simple linear regression model is used to estimate a mean opinion score (MOS) of the various cases of missing speech segments. The estimated MOS is a continuous value that can be mapped to priority levels with arbitrary granularity. Through subjective evaluation, we show the validity of the calculated priority values.

  • A Silence Compression Algorithm for the Multi-Rate Dual-Bandwidth MPEG-4 CELP Standard

    Masahiro SERIZAWA  Hironori ITO  Toshiyuki NOMURA  

     
    PAPER-Speech and Audio Coding

      Vol:
    E86-D No:3
      Page(s):
    412-417

    This paper proposes a silence compression algorithm operating at multi-rates (MR) and with dual-bandwidths (DB), a narrowband and a wideband, for the MPEG (Moving Picture Experts Group)-4 CELP (Code Excited Linear Prediction) standard. The MR/DB operations are implemented by a Variable-Frame-size/Dual-Bandwidth Voice Activity Detection (VF/DB-VAD) module with bandwidth conversions of the input signal, and a Variable-Frame-size Comfort Noise Generator (VF-CNG) module. The CNG module adaptively smoothes the Root Mean Square (RMS) value of the input signal to improve the coding quality during transition periods. The algorithm also employs a Dual-Rate Discontinuous Transmission (DR-DTX) module to reduce an average transmission bitrate during silence periods. Subjective test results show that the proposed silence compression algorithm gives no degradation in coding quality for clean and noisy speech signals. These signals include about 20 to 30% non-speech frames and the average transmission bitrates are reduced by 20 to 40%. The proposed algorithm has been adopted as a part of the ISO/IEC MPEG-4 CELP version 2 standard.

  • Design of a Variable Rate Algorithm for the CS-ACELP Coder

    Woosung CHUNG  Sangwon KANG  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E82-D No:10
      Page(s):
    1364-1371

    In 1995, 8 kb/s CS-ACELP coder of G.729 is standardized by ITU-T SG15 and it has been reported that the speech quality of G.729 is better than or equal to that of 32 kb/s ADPCM (G.726). However G.729 is the fixed rate speech coder, and it does not consider the property of voice activity in mutual conversation. If we use the voice activity, we can reduce the average bit rate in half without any degradations of the speech quality. In this paper, we propose an efficient variable rate algorithm for G.729. The variable rate algorithm consists of two main subjects, the rate determination algorithm and the design of sub rate coders. For the robust VAD algorithm, we combine the energy-thresholding method, the phonetic segmentation method by integration of various feature parameters obtained through the analysis procedure, and the variable hangover period method. Through the analysis of noise features, the 1 kb/s sub rate coder is designed for coding the background noise signal. Also, we design the 4 kb/s sub rate coder for the unvoiced parts. The performance of the variable rate algorithm is evaluated by the comparison of speech quality and average bit rate with G.729. Subjective quality test is also done by MOS test. Conclusively, it is verified that the proposed variable rate CS-ACELP coder produces the same speech quality as G.729, at the average bit rate of 4.4 kb/s.

  • A Single DSP System for High Quality Enhancement of Diver's Speech

    Daoud BERKANI  Hisham HASSANEIN  Jean-Pierre ADOUL  

     
    PAPER-Neural Networks/Signal Processing/Information Storage

      Vol:
    E81-A No:10
      Page(s):
    2151-2158

    The development of saturation diving in civil and defense applications has enabled man to work in the sea at great depths and for long periods of time. This advance has resulted, in part, as a consequence of the substitution of helium for nitrogen in breathing gas mixtures. However, utilization of HeO2 breathing mixture at high ambient pressures has caused problems in speech communication; in turn, helium speech enhancement systems have been developed to improve diver communication. These speech unscramblers attempt to process variously the grossly unintelligible speech resulting from the effect of breathing mixtures and ambient pressure, and to reconstruct such signals in order to provide adequate voice communication. It is known that the glottal excitation is quasi-periodic and the vocal tract filter is quasi-stationary. Hence, it is possible to use an auto regressive modelisation to restore speech intelligibility in hyperbaric conditions. Corrections are made on the vocal tract transfer function, either in the frequency domain, or directly on the autocorrelation function. A spectral subtraction or noise reduction may be added to improve speech quality. A new VAD enhanced helium speech unscrambler is proposed for use in adverse conditions or in speech recognition. This system, implementable on single chip DSP of current technology, is capable to work in real time.