The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] voice(140hit)

61-80hit(140hit)

  • Voice Activity Detection Based on High Order Statistics and Online EM Algorithm

    David COURNAPEAU  Tatsuya KAWAHARA  

     
    PAPER-Speech and Hearing

      Vol:
    E91-D No:12
      Page(s):
    2854-2861

    A new online, unsupervised voice activity detection (VAD) method is proposed. The method is based on a feature derived from high-order statistics (HOS), enhanced by a second metric based on normalized autocorrelation peaks to improve its robustness to non-Gaussian noises. This feature is also oriented for discriminating between close-talk and far-field speech, thus providing a VAD method in the context of human-to-human interaction independent of the energy level. The classification is done by an online variation of the Expectation-Maximization (EM) algorithm, to track and adapt to noise variations in the speech signal. Performance of the proposed method is evaluated on an in-house data and on CENSREC-1-C, a publicly available database used for VAD in the context of automatic speech recognition (ASR). On both test sets, the proposed method outperforms a simple energy-based algorithm and is shown to be more robust against the change in speech sparsity, SNR variability and the noise type.

  • Objective Pathological Voice Quality Assessment Based on HOS Features

    Ji-Yeoun LEE  Sangbae JEONG  Hong-Shik CHOI  Minsoo HAHN  

     
    LETTER-Speech and Hearing

      Vol:
    E91-D No:12
      Page(s):
    2888-2891

    This work proposes new features to improve the pathological voice quality classification performance. They are the means, the variances, and the perturbations of the higher-order statistics (HOS) such as the skewness and the kurtosis. The HOS-based features show meaningful differences among normal, grade 1, grade 2, and grade 3 voices classified in the GRBAS scale. The jitter, the shimmer, the harmonic-to-noise ratio (HNR), and the variance of the short-time energy are utilized as the conventional features. The performances are measured by the classification and regression tree (CART) method. Specifically, the CART-based method by utilizing both the conventional features and the HOS-based ones shows its effectiveness in the pathological voice quality measurement, with the classification accuracy of 87.8%.

  • Objective Speech Quality Assessment Based on Payload Discrimination of Lost Packets for Cellular Phones in NGN Environment

    Satoshi UEMURA  Norihiro FUKUMOTO  Hideaki YAMADA  Hajime NAKAMURA  

     
    PAPER-Network Management/Operation

      Vol:
    E91-B No:11
      Page(s):
    3667-3676

    A feature of services provided in a Next Generation Network (NGN) is that the end-to-end quality is guaranteed. This is quite a challenging issue, given the considerable fluctuation in network conditions within a Fixed Mobile Convergence (FMC) network. Therefore, a novel approach, whereby a network node and a mobile terminal such as a cellular phone cooperate with each other to control service quality is essential. In order to achieve such cooperation, the mobile terminal needs to become more intelligent so it can estimate the service quality, including the user's perceptual quality, and notify the measurement result to the network node. Subsequently, the network node implements some kind of service control function, such as a resource and admission control function, based on the notification from the mobile terminal. In this paper, the role of the mobile terminal in such collaborative system is focused on. As a part of a QoS/QoE measurement system, we describe an objective speech quality assessment with payload discrimination of lost packets to measure the user's perceptual quality of VoIP. The proposed assessment is so simple that it can be implemented on a cellular phone. We therefore did this as part of the QoS/QoE measurement system. By using the implemented system, we can measure the user's perceptual quality of VoIP as well as the network QoS metrics, in terms of criteria such as packet loss rate, jitter and burstiness in real time.

  • Utterance Verification Using Word Voiceprint Models Based on Probabilistic Distributions of Phone-Level Log-Likelihood Ratio and Phone Duration

    Suk-Bong KWON  HoiRin KIM  

     
    LETTER-Speech and Hearing

      Vol:
    E91-D No:11
      Page(s):
    2746-2750

    This paper suggests word voiceprint models to verify the recognition results obtained from a speech recognition system. Word voiceprint models have word-dependent information based on the distributions of phone-level log-likelihood ratio and duration. Thus, we can obtain a more reliable confidence score for a recognized word by using its word voiceprint models that represent the more proper characteristics of utterance verification for the word. Additionally, when obtaining a log-likelihood ratio-based word voiceprint score, this paper proposes a new log-scale normalization function using the distribution of the phone-level log-likelihood ratio, instead of the sigmoid function widely used in obtaining a phone-level log-likelihood ratio. This function plays a role of emphasizing a mis-recognized phone in a word. This individual information of a word is used to help achieve a more discriminative score against out-of-vocabulary words. The proposed method requires additional memory, but it shows that the relative reduction in equal error rate is 16.9% compared to the baseline system using simple phone log-likelihood ratios.

  • A Support Vector Machine-Based Voice Activity Detection Employing Effective Feature Vectors

    Q-Haing JO  Yun-Sik PARK  Kye-Hwan LEE  Joon-Hyuk CHANG  

     
    LETTER-Multimedia Systems for Communications

      Vol:
    E91-B No:6
      Page(s):
    2090-2093

    In this letter, we propose effective feature vectors to improve the performance of voice activity detection (VAD) employing a support vector machine (SVM), which is known to incorporate an optimized nonlinear decision over two different classes. To extract the effective feature vectors, we present a novel scheme that combines the a posteriori SNR, a priori SNR, and predicted SNR, widely adopted in conventional statistical model-based VAD.

  • Noise Robust Voice Activity Detection Based on Switching Kalman Filter

    Masakiyo FUJIMOTO  Kentaro ISHIZUKA  

     
    PAPER-Voice Activity Detection

      Vol:
    E91-D No:3
      Page(s):
    467-477

    This paper addresses the problem of voice activity detection (VAD) in noisy environments. The VAD method proposed in this paper is based on a statistical model approach, and estimates statistical models sequentially without a priori knowledge of noise. Namely, the proposed method constructs a clean speech / silence state transition model beforehand, and sequentially adapts the model to the noisy environment by using a switching Kalman filter when a signal is observed. In this paper, we carried out two evaluations. In the first, we observed that the proposed method significantly outperforms conventional methods as regards voice activity detection accuracy in simulated noise environments. Second, we evaluated the proposed method on a VAD evaluation framework, CENSREC-1-C. The evaluation results revealed that the proposed method significantly outperforms the baseline results of CENSREC-1-C as regards VAD accuracy in real environments. In addition, we confirmed that the proposed method helps to improve the accuracy of concatenated speech recognition in real environments.

  • Pathological Voice Detection Using Efficient Combination of Heterogeneous Features

    Ji-Yeoun LEE  Sangbae JEONG  Minsoo HAHN  

     
    LETTER-Speech and Hearing

      Vol:
    E91-D No:2
      Page(s):
    367-370

    Combination of mutually complementary features is necessary to cope with various changes in pattern classification between normal and pathological voices. This paper proposes a method to improve pathological/normal voice classification performance by combining heterogeneous features. Different combinations of auditory-based and higher-order features are investigated. Their performances are measured by Gaussian mixture models (GMMs), linear discriminant analysis (LDA), and a classification and regression tree (CART) method. The proposed classification method by using the CART analysis is shown to be an effective method for pathological voice detection, with a 92.7% classification performance rate. This is a noticeable improvement of 54.32% compared to the MFCC-based GMM algorithm in terms of error reduction.

  • Voice Navigation in Web-Based Learning Materials--An Investigation Using Eye Tracking

    Kiyoshi NOSU  Ayako KANDA  Takeshi KOIKE  

     
    PAPER-Human-computer Interaction

      Vol:
    E90-D No:11
      Page(s):
    1772-1778

    Eye tracking is a useful tool for accurately mapping where and for how long an individual learner looks at a video/image, in order to obtain immediate information regarding the distribution of a learner's attention among the elements of a video/image. This paper describes a quantitative investigation into the effect of voice navigation in web-based learning materials.

  • VLSI Architecture for the Low-Computation Cycle and Power-Efficient Recursive DFT/IDFT Design

    Lan-Da VAN  Chin-Teng LIN  Yuan-Chu YU  

     
    PAPER-Digital Signal Processing

      Vol:
    E90-A No:8
      Page(s):
    1644-1652

    In this paper, we propose one low-computation cycle and power-efficient recursive discrete Fourier transform (DFT)/inverse DFT (IDFT) architecture adopting a hybrid of input strength reduction, the Chebyshev polynomial, and register-splitting schemes. Comparing with the existing recursive DFT/IDFT architectures, the proposed recursive architecture achieves a reduction in computation-cycle by half. Appling this novel low-computation cycle architecture, we could double the throughput rate and the channel density without increasing the operating frequency for the dual tone multi-frequency (DTMF) detector in the high channel density voice over packet (VoP) application. From the chip implementation results, the proposed architecture is capable of processing over 128 channels and each channel consumes 9.77 µW under 1.2 V@20 MHz in TSMC 0.13 1P8M CMOS process. The proposed VLSI implementation shows the power-efficient advantage by the low-computation cycle architecture.

  • Identification of ARMA Speech Models Using an Effective Representation of Voice Source

    M. Shahidur RAHMAN  Tetsuya SHIMAMURA  

     
    LETTER-Speech and Hearing

      Vol:
    E90-D No:5
      Page(s):
    863-867

    A two-stage least square identification method is proposed for estimating ARMA (autoregressive moving average) coefficients from speech signals. A pulse-train like input sequence is often employed to account for the source effects in estimating vocal tract parameters of voiced speech. Due to glottal and radiation effects, the pulse train, however, does not represent the effective voice source. The authors have already proposed a simple but effective model of voice source for estimating AR (autoregressive) coefficients. This letter extends our approach to ARMA analysis to wider varieties of speech sounds including nasal vowels and consonants. Analysis results on both synthetic and natural nasal speech are presented to demonstrate the analysis ability of the method.

  • Delay Distribution of Data Calls in Integrated Voice/Data CDMA Systems

    Insoo KOO  Jeongrok YANG  Kiseon KIM  

     
    LETTER-Wireless Communication Technologies

      Vol:
    E90-B No:3
      Page(s):
    668-671

    In this letter, we present a procedure to analyze the delay distribution of data traffic in CDMA systems supporting voice and delay-tolerant data services with a finite buffer. The queueing method using a buffer for a delay-tolerant traffic can be used to improve the system utilization or the availability of system resources. Under the first-come and first-serve (FCFS) service discipline, we present a numerical procedure for the formation of delay distribution that is defined as the probability that a new data call get a service within the maximum tolerable delay requirement, based on a two-dimensional Markov model.

  • Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training

    Junichi YAMAGISHI  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:2
      Page(s):
    533-543

    In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.

  • Two-Band Excitation for HMM-Based Speech Synthesis

    Sang-Jin KIM  Minsoo HAHN  

     
    LETTER-Speech and Hearing

      Vol:
    E90-D No:1
      Page(s):
    378-381

    This letter describes a two-band excitation model for HMM-based speech synthesis. The HMM-based speech synthesis system generates speech from the HMM training data of the spectral and excitation parameters. Synthesized speech has a typical quality of "vocoded sound" mostly because of the simple excitation model with the voiced/unvoiced selection. In this letter, two-band excitation based on the harmonic plus noise speech model is proposed for generating the mixed excitation source. With this model, we can generate the mixed excitation more accurately and reduce the memory for the trained excitation data as well.

  • Hybrid Voice Conversion of Unit Selection and Generation Using Prosody Dependent HMM

    Tadashi OKUBO  Ryo MOCHIZUKI  Tetsunori KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E89-D No:11
      Page(s):
    2775-2782

    We propose a hybrid voice conversion method which employs a combination of techniques using HMM-based unit selection and spectrum generation. In the proposed method, the HMM-based unit selection selects the most likely unit for the required phoneme context from the target speaker's corpus when candidates of the target unit exist in the corpus. Unit selection is performed based on the sequence of the spectral probability distribution obtained from the adapted HMMs. On the other hand, when a target unit does not exist in a corpus, a target waveform is generated from the adapted HMM sequence by maximizing the spectral likelihood. The proposed method also employs the HMM in which the spectral probability distribution is adjusted to the target prosody using the weight defined by the prosodic probability of each distribution. To show the effectiveness of the proposed method, sound quality and speaker individuality tests were conducted. The results revealed that the proposed method could produce high-quality speech and individuality of the synthesized sound was more similar to the target speaker compared to conventional methods.

  • Analysis and Synthesis of Emotional Voice Based on Time-Frequency Pitch Distributions

    Mamoru KOBAYASHI  Shigeo WADA  

     
    PAPER

      Vol:
    E89-A No:8
      Page(s):
    2100-2106

    In this paper, analysis and synthesis methods of emotional voice for man-machine natural interface is developed. First, the emotional voice (neutral, anger, sadness, joy, dislike) is analyzed using time-frequency representation of speech and similarity analysis. Then, based on the result of emotional analysis, a voice with neutral emotion is transformed to synthesize the particular emotional voice using time-frequency modifications. In the simulations, five types of emotion are analyzed using 50 samples of speech signals. The high average discrimination rate is achieved in the similarity analysis. Further, the synthesized emotional voice is subjectively evaluated. It is confirmed that the emotional voice is naturally generated by the proposed time-frequency based approach.

  • Adaptive Voice Smoothing with Optimal E-Model Method for VoIP Services

    Shyh-Fang HUANG  Pao-Chi CHANG  Eric Hsiao-kuang WU  

     
    PAPER-Multimedia Systems for Communications

      Vol:
    E89-B No:6
      Page(s):
    1862-1868

    VoIP, one of emerging technologies, offers high quality of real time voice services over IP-based broadband networks; however, the quality of voice would easily be degraded by IP network impairments such as delay, jitter and packet loss, hereon initiate the presence of new technologies to help solve out the problems. Among those, playout buffer at the receiving end can compensate for the jitter effects by its function of tradeoff between delay and loss. Adaptive smoothing algorithms are capable of the dynamical adjustment of smoothing size by introducing a variable delay based on the use of the network parameters so as to avoid the quality decay problem. This paper introduces an efficient and feasible perceived quality method for buffer optimization to achieve the best voice quality. This work formulates an online loss model which incorporates buffer sizes and applies the ITU-T E-model approach to optimize the delay-loss problem. Distinct from other optimal smoothers, the proposed optimal smoother can be applied for most codecs and carries the lowest complexity. Since the adaptive smoothing scheme introduces variable playback delays, the buffer re-synchronization between the capture and the playback becomes essential. This work also presents a buffer re-synchronization algorithm based on silence skipping to prevent unacceptable increase in the buffer preloading delay and even buffer overflow. Simulation experiments validate that the proposed adaptive smoother achieves significant improvement in the voice quality.

  • Statistical Model-Based VAD Algorithm with Wavelet Transform

    Yoon-Chang LEE  Sang-Sik AHN  

     
    PAPER

      Vol:
    E89-A No:6
      Page(s):
    1594-1600

    This paper presents a new statistical model-based voice activity detection (VAD) algorithm in the wavelet domain to improve the performance in non-stationary environments. Due to the efficient time-frequency localization and the multi-resolution characteristics of the wavelet representations, the wavelet transforms are quite suitable for processing non-stationary signals such as speech. To utilize the fact that the wavelet packet is very efficient approximation of discrete Fourier transform and has built-in de-noising capability, we first apply wavelet packet decomposition to effectively localize the energy in frequency space, use spectral subtraction, and employ matched filtering to enhance the SNR. Since the conventional wavelet-based spectral subtraction eliminates the low-power speech signal in onset and offset regions and generates musical noise, we derive an improved multi-band spectral subtraction. On the other hand, noticing that fixed threshold cannot follow fluctuations of time varying noise power and the inability to adapt to a time-varying environment severely limits the VAD performance, we propose a statistical model-based VAD algorithm in wavelet domain with an adaptive threshold. We perform extensive computer simulations and compare with the conventional algorithms to demonstrate performance improvement of the proposed algorithm under various noise environments.

  • Applying Elliptical Basis Function Neural Networks to VAD for Wireless Communication Systems

    Hosun LEE  Sukyung KIM  Sungkwon PARK  

     
    LETTER-Fundamental Theories for Communications

      Vol:
    E89-B No:4
      Page(s):
    1423-1424

    Voice activity detection (VAD) determines whether a slice of waveform is voice or silence. The proposed VAD algorithm applying Elliptical Basis Function (EBF) neural networks uses k-means clustering and least mean square for the update algorithm. The error rates achieved by the EBF network have superior performance to those of G.729 Annex B and RBF.

  • Speech Analysis Based on Modeling the Effective Voice Source

    M. Shahidur RAHMAN  Tetsuya SHIMAMURA  

     
    PAPER-Speech Analysis

      Vol:
    E89-D No:3
      Page(s):
    1107-1115

    A new system identification based method has been proposed for accurate estimation of vocal tract parameters. An often encountered problem in using the conventional linear prediction analysis is due to the harmonic structure of the excitation source of voiced speech. This harmonic characteristic is coupled with the estimation of autoregressive (AR) coefficients that results in difficulties in estimating the vocal tract filter. This paper models the effective voice source from the residual obtained through the covariance analysis in the first-pass which is then used as input to the second-pass least-square analysis. A better source-filter separation is thus achieved. The formant frequencies and corresponding bandwidths obtained using the proposed method for synthetic vowels are found to be accurate up to a factor of more than three (in percent) compared to the conventional method. Since the source characteristic is taken into account, local variations due to the positioning of analysis window are reduced significantly. The validity of the proposed method is also examined by inspecting the spectra obtained from natural vowel sounds uttered by high-pitched female speaker.

  • Generating F0 Contours by Statistical Manipulation of Natural F0 Shapes

    Takashi SAITO  

     
    PAPER-Speech Analysis

      Vol:
    E89-D No:3
      Page(s):
    1100-1106

    This paper describes a method of generating F0 contours from natural F0 segmental shapes for speech synthesis. The extracted shapes of the F0 units are basically held invariant by eliminating any averaging operations in the analysis phase and by minimizing modification operations in the synthesis phase. The use of natural F0 shapes has great potential to cover a wide variety of speaking styles with the same framework, including not only read-aloud speech, but also dialogues and emotional speech. A linear-regression statistical model is used to "manipulate" the stored raw F0 shapes to build them up into a sentential F0 contour. Through experimental evaluations, the proposed model is shown to provide stable and robust F0 contour prediction for various speakers. By using this model, linguistically derived information about a sentence can be directly mapped, in a purely data-driven manner, to acoustic F0 values of the sentential intonation contour for a given target speaker.

61-80hit(140hit)