The search functionality is under construction.

Keyword Search Result

[Keyword] HMM(88hit)

1-20hit(88hit)

  • Effectiveness of Speech Mode Adaptation for Improving Dialogue Speech Synthesis

    Kazuki KAYA  Hiroki MORI  

     
    LETTER-Speech and Hearing

      Pubricized:
    2019/06/13
      Vol:
    E102-D No:10
      Page(s):
    2064-2066

    The effectiveness of model adaptation in dialogue speech synthesis is explored. The proposed adaptation method is based on a conversion from a base model learned with a large dataset into a target, dialogue-style speech model. The proposed method is shown to improve the intelligibility of synthesized dialogue speech, while maintaining the speaking style of dialogue.

  • Prosody Correction Preserving Speaker Individuality for Chinese-Accented Japanese HMM-Based Text-to-Speech Synthesis Open Access

    Daiki SEKIZAWA  Shinnosuke TAKAMICHI  Hiroshi SARUWATARI  

     
    LETTER-Speech and Hearing

      Pubricized:
    2019/03/11
      Vol:
    E102-D No:6
      Page(s):
    1218-1221

    This article proposes a prosody correction method based on partial model adaptation for Chinese-accented Japanese hidden Markov model (HMM)-based text-to-speech synthesis. Although text-to-speech synthesis built from non-native speech accurately reproduces the speaker's individuality in synthetic speech, the naturalness of the synthetic speech is strongly degraded. In the proposed model, to improve the naturalness while preserving the speaker individuality of Chinese-accented Japanese text-to-speech synthesis, we partially utilize HMM parameters of native Japanese speech to synthesize prosody-corrected synthetic speech. Results of an experimental evaluation demonstrate that duration and F0 correction are significantly effective for improving naturalness.

  • Text-Independent Online Writer Identification Using Hidden Markov Models

    Yabei WU  Huanzhang LU  Zhiyong ZHANG  

     
    PAPER-Human-computer Interaction

      Pubricized:
    2016/11/02
      Vol:
    E100-D No:2
      Page(s):
    332-339

    In text-independent online writer identification, the Gaussian Mixture Model(GMM) writer model trained with the GMM-Universal Background Model(GMM-UBM) framework has acquired excellent performance. However, the system assumes the items in the observation sequence are independent, which neglects the dynamic information between observations. This work shows that although in the text-independent application, the dynamic information between observations is still important for writer identification. In order to extend the GMM-UBM system to use the dynamic information, the hidden Markov model(HMM) with Gaussian observation model is used to model each writer's handwriting in this paper and a new training schematic is proposed. In particular, the observation model parameters of the writer specific HMM are set with the Gaussian component parameters of the GMM writer model trained with the GMM-UBM framework and the state transition matrix parameters are learned from the writer specific data. Experiments show that incorporating the dynamic information is capable of improving the performance of the GMM-based system and the proposed training method is effective for learning the HMM writer model.

  • Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics

    Yuji OSHIMA  Shinnosuke TAKAMICHI  Tomoki TODA  Graham NEUBIG  Sakriani SAKTI  Satoshi NAKAMURA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2016/08/30
      Vol:
    E99-D No:12
      Page(s):
    3132-3139

    This paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Cross-lingual speech synthesis based on voice conversion or Hidden Markov Model (HMM)-based speech synthesis is a technique to synthesize foreign language speech using a target speaker's natural speech uttered in his/her mother tongue. Although the technique holds promise to improve a wide variety of applications, it tends to cause degradation of target speaker's individuality in synthetic speech compared to intra-lingual speech synthesis. This paper proposes a new approach to speech synthesis that preserves speaker individuality by using non-native speech spoken by the target speaker. Although the use of non-native speech makes it possible to preserve the speaker individuality in the synthesized target speech, naturalness is significantly degraded as the synthesized speech waveform is directly affected by unnatural prosody and pronunciation often caused by differences in the linguistic systems of the source and target languages. To improve naturalness while preserving speaker individuality, we propose (1) a prosody correction method based on model adaptation, and (2) a phonetic correction method based on spectrum replacement for unvoiced consonants. The experimental results using English speech uttered by native Japanese speakers demonstrate that (1) the proposed methods are capable of significantly improving naturalness while preserving the speaker individuality in synthetic speech, and (2) the proposed methods also improve intelligibility as confirmed by a dictation test.

  • Speaker Adaptive Training Localizing Speaker Modules in DNN for Hybrid DNN-HMM Speech Recognizers

    Tsubasa OCHIAI  Shigeki MATSUDA  Hideyuki WATANABE  Xugang LU  Chiori HORI  Hisashi KAWAI  Shigeru KATAGIRI  

     
    PAPER-Acoustic modeling

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2431-2443

    Among various training concepts for speaker adaptation, Speaker Adaptive Training (SAT) has been successfully applied to a standard Hidden Markov Model (HMM) speech recognizer, whose state is associated with Gaussian Mixture Models (GMMs). On the other hand, focusing on the high discriminative power of Deep Neural Networks (DNNs), a new type of speech recognizer structure, which combines DNNs and HMMs, has been vigorously investigated in the speaker adaptation research field. Along these two lines, it is natural to conceive of further improvement to a DNN-HMM recognizer by employing the training concept of SAT. In this paper, we propose a novel speaker adaptation scheme that applies SAT to a DNN-HMM recognizer. Our SAT scheme allocates a Speaker Dependent (SD) module to one of the intermediate layers of DNN, treats its remaining layers as a Speaker Independent (SI) module, and jointly trains the SD and SI modules while switching the SD module in a speaker-by-speaker manner. We implement the scheme using a DNN-HMM recognizer, whose DNN has seven layers, and elaborate its utility over TED Talks corpus data. Our experimental results show that in the supervised adaptation scenario, our Speaker-Adapted (SA) SAT-based recognizer reduces the word error rate of the baseline SI recognizer and the lowest word error rate of the SA SI recognizer by 8.4% and 0.7%, respectively, and by 6.4% and 0.6% in the unsupervised adaptation scenario. The error reductions gained by our SA-SAT-based recognizers proved to be significant by statistical testing. The results also show that our SAT-based adaptation outperforms, regardless of the SD module layer selection, its counterpart SI-based adaptation, and that the inner layers of DNN seem more suitable for SD module allocation than the outer layers.

  • Investigation of DNN-Based Audio-Visual Speech Recognition

    Satoshi TAMURA  Hiroshi NINOMIYA  Norihide KITAOKA  Shin OSUGA  Yurie IRIBE  Kazuya TAKEDA  Satoru HAYAMIZU  

     
    PAPER-Acoustic modeling

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2444-2451

    Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandem-based method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.

  • Unsupervised Learning of Continuous Density HMM for Variable-Length Spoken Unit Discovery

    Meng SUN  Hugo VAN HAMME  Yimin WANG  Xiongwei ZHANG  

     
    LETTER-Speech and Hearing

      Pubricized:
    2015/10/21
      Vol:
    E99-D No:1
      Page(s):
    296-299

    Unsupervised spoken unit discovery or zero-source speech recognition is an emerging research topic which is important for spoken document analysis of languages or dialects with little human annotation. In this paper, we extend our earlier joint training framework for unsupervised learning of discrete density HMM to continuous density HMM (CDHMM) and apply it to spoken unit discovery. In the proposed recipe, we first cluster a group of Gaussians which then act as initializations to the joint training framework of nonnegative matrix factorization and semi-continuous density HMM (SCDHMM). In SCDHMM, all the hidden states share the same group of Gaussians but with different mixture weights. A CDHMM is subsequently constructed by tying the top-N activated Gaussians to each hidden state. Baum-Welch training is finally conducted to update the parameters of the Gaussians, mixture weights and HMM transition probabilities. Experiments were conducted on word discovery from TIDIGITS and phone discovery from TIMIT. For TIDIGITS, units were modeled by 10 states which turn out to be strongly related to words; while for TIMIT, units were modeled by 3 states which are likely to be phonemes.

  • F0 Parameterization of Glottalized Tones in HMM-Based Speech Synthesis for Hanoi Vietnamese

    Duy Khanh NINH  Yoichi YAMASHITA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2015/09/07
      Vol:
    E98-D No:12
      Page(s):
    2280-2289

    A conventional HMM-based speech synthesis system for Hanoi Vietnamese often suffers from hoarse quality due to incomplete F0 parameterization of glottalized tones. Since estimating F0 from glottalized waveform is rather problematic for usual F0 extractors, we propose a pitch marking algorithm where pitch marks are propagated from regular regions of a speech signal to glottalized ones, from which complete F0 contours for the glottalized tones are derived. The proposed F0 parameterization scheme was confirmed to significantly reduce the hoarseness whilst slightly improving the tone naturalness of synthetic speech by both objective and listening tests. The pitch marking algorithm works as a refinement step based on the results of an F0 extractor. Therefore, the proposed scheme can be combined with any F0 extractor.

  • Image Recognition Based on Separable Lattice Trajectory 2-D HMMs

    Akira TAMAMORI  Yoshihiko NANKAKU  Keiichi TOKUDA  

     
    PAPER-Pattern Recognition

      Vol:
    E97-D No:7
      Page(s):
    1842-1854

    In this paper, a novel statistical model based on 2-D HMMs for image recognition is proposed. Recently, separable lattice 2-D HMMs (SL2D-HMMs) were proposed to model invariance to size and location deformation. However, their modeling accuracy is still insufficient because of the following two assumptions, which are inherited from 1-D HMMs: i) the stationary statistics within each state and ii) the conditional independent assumption of state output probabilities. To overcome these shortcomings in 1-D HMMs, trajectory HMMs were proposed and successfully applied to speech recognition and speech synthesis. This paper derives 2-D trajectory HMMs by reformulating the likelihood of SL2D-HMMs through the imposition of explicit relationships between static and dynamic features. The proposed model can efficiently capture dependencies between adjacent observations without increasing the number of model parameters. The effectiveness of the proposed model was evaluated in face recognition experiments on the XM2VTS database.

  • Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis

    Kazuhiro NAKAMURA  Kei HASHIMOTO  Yoshihiko NANKAKU  Keiichi TOKUDA  

     
    PAPER-HMM-based Speech Synthesis

      Vol:
    E97-D No:6
      Page(s):
    1438-1448

    This paper proposes a novel approach for integrating spectral feature extraction and acoustic modeling in hidden Markov model (HMM) based speech synthesis. The statistical modeling process of speech waveforms is typically divided into two component modules: the frame-by-frame feature extraction module and the acoustic modeling module. In the feature extraction module, the statistical mel-cepstral analysis technique has been used and the objective function is the likelihood of mel-cepstral coefficients for given speech waveforms. In the acoustic modeling module, the objective function is the likelihood of model parameters for given mel-cepstral coefficients. It is important to improve the performance of each component module for achieving higher quality synthesized speech. However, the final objective of speech synthesis systems is to generate natural speech waveforms from given texts, and the improvement of each component module does not always lead to the improvement of the quality of synthesized speech. Therefore, ideally all objective functions should be optimized based on an integrated criterion which well represents subjective speech quality of human perception. In this paper, we propose an approach to model speech waveforms directly and optimize the final objective function. Experimental results show that the proposed method outperformed the conventional methods in objective and subjective measures.

  • Improving Naturalness of HMM-Based TTS Trained with Limited Data by Temporal Decomposition

    Trung-Nghia PHUNG  Thanh-Son PHAN  Thang Tat VU  Mai Chi LUONG  Masato AKAGI  

     
    PAPER-Speech and Hearing

      Vol:
    E96-D No:11
      Page(s):
    2417-2426

    The most important advantage of HMM-based TTS is its highly intelligible. However, speech synthesized by HMM-based TTS is muffled and far from natural, especially under limited data conditions, which is mainly caused by its over-smoothness. Therefore, the motivation for this paper is to improve the naturalness of HMM-based TTS trained under limited data conditions while preserving its intelligibility. To achieve this motivation, a hybrid TTS between HMM-based TTS and the modified restricted Temporal Decomposition (MRTD), named HTD in this paper, was proposed. Here, TD is an interpolation model of decomposing a spectral or prosodic sequence of speech into sparse event targets and dynamic event functions, and MRTD is one simplified version of TD. With a determination of event functions close to the concept of co-articulation in speech, MRTD can synthesize smooth speech and the smoothness in synthesized speech can be adjusted by manipulating event targets of MRTD. Previous studies have also found that event functions of MRTD can represent linguistic information of speech, which is important to perceive speech intelligibility, while sparse event targets can convey the non-linguistics information, which is important to perceive the naturalness of speech. Therefore, prosodic trajectories and MRTD event functions of the spectral trajectory generated by HMM-based TTS were kept unchanged to preserve the high and stable intelligibility of HMM-based TTS. Whereas MRTD event targets of the spectral trajectory generated by HMM-based TTS were rendered with an original speech database to enhance the naturalness of synthesized speech. Experimental results with small Vietnamese datasets revealed that the proposed HTD was equivalent to HMM-based TTS in terms of intelligibility but was superior to it in terms of naturalness. Further discussions show that HTD had a small footprint. Therefore, the proposed HTD showed its strong efficiency under limited data conditions.

  • A 168-mW 2.4-Real-Time 60-kWord Continuous Speech Recognition Processor VLSI

    Guangji HE  Takanobu SUGAHARA  Yuki MIYAMOTO  Shintaro IZUMI  Hiroshi KAWAGUCHI  Masahiko YOSHIMOTO  

     
    PAPER

      Vol:
    E96-C No:4
      Page(s):
    444-453

    This paper describes a low-power VLSI chip for speaker-independent 60-kWord continuous speech recognition based on a context-dependent Hidden Markov Model (HMM). It features a compression-decoding scheme to reduce the external memory bandwidth for Gaussian Mixture Model (GMM) computation and multi-path Viterbi transition units. We optimize the internal SRAM size using the max-approximation GMM calculation and adjusting the number of look-ahead frames. The test chip, fabricated in 40 nm CMOS technology, occupies 1.77 mm2.18 mm containing 2.52 M transistors for logic and 4.29 Mbit on-chip memory. The measured results show that our implementation achieves 34.2% required frequency reduction (83.3 MHz), 48.5% power consumption reduction (74.14 mW) for 60 k-Word real-time continuous speech recognition compared to the previous work while 30% of the area is saved with recognition accuracy of 90.9%. This chip can maximally process 2.4faster than real-time at 200 MHz and 1.1 V with power consumption of 168 mW. By increasing the beam width, better recognition accuracy (91.45%) can be achieved. In that case, the power consumption for real-time processing is increased to 97.4 mW and the max-performance is decreased to 2.08because of the increased computation workload.

  • Statistical Approaches to Excitation Modeling in HMM-Based Speech Synthesis

    June Sig SUNG  Doo Hwa HONG  Hyun Woo KOO  Nam Soo KIM  

     
    LETTER-Speech and Hearing

      Vol:
    E96-D No:2
      Page(s):
    379-382

    In our previous study, we proposed the waveform interpolation (WI) approach to model the excitation signals for hidden Markov model (HMM)-based speech synthesis. This letter presents several techniques to improve excitation modeling within the WI framework. We propose both the time domain and frequency domain zero padding techniques to reduce the spectral distortion inherent in the synthesized excitation signal. Furthermore, we apply non-negative matrix factorization (NMF) to obtain a low-dimensional representation of the excitation signals. From a number of experiments, including a subjective listening test, the proposed method has been found to enhance the performance of the conventional excitation modeling techniques.

  • Online Speaker Clustering Using Incremental Learning of an Ergodic Hidden Markov Model

    Takafumi KOSHINAKA  Kentaro NAGATOMO  Koichi SHINODA  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:10
      Page(s):
    2469-2478

    A novel online speaker clustering method based on a generative model is proposed. It employs an incremental variant of variational Bayesian learning and provides probabilistic (non-deterministic) decisions for each input utterance, on the basis of the history of preceding utterances. It can be expected to be robust against errors in cluster estimation and the classification of utterances, and hence to be applicable to many real-time applications. Experimental results show that it produces 50% fewer classification errors than does a conventional online method. They also show that it is possible to reduce the number of speech recognition errors by combining the method with unsupervised speaker adaptation.

  • Outlier Detection and Removal for HMM-Based Speech Synthesis with an Insufficient Speech Database

    Doo Hwa HONG  June Sig SUNG  Kyung Hwan OH  Nam Soo KIM  

     
    LETTER-Speech and Hearing

      Vol:
    E95-D No:9
      Page(s):
    2351-2354

    Decision tree-based clustering and parameter estimation are essential steps in the training part of an HMM-based speech synthesis system. These two steps are usually performed based on the maximum likelihood (ML) criterion. However, one of the drawbacks of the ML criterion is that it is sensitive to outliers which usually result in quality degradation of the synthesized speech. In this letter, we propose an approach to detect and remove outliers for HMM-based speech synthesis. Experimental results show that the proposed approach can improve the synthetic speech, particularly when the available training speech database is insufficient.

  • Selective Gammatone Envelope Feature for Robust Sound Event Recognition

    Yi Ren LENG  Huy Dat TRAN  Norihide KITAOKA  Haizhou LI  

     
    PAPER-Audio Processing

      Vol:
    E95-D No:5
      Page(s):
    1229-1237

    Conventional features for Automatic Speech Recognition and Sound Event Recognition such as Mel-Frequency Cepstral Coefficients (MFCCs) have been shown to perform poorly in noisy conditions. We introduce an auditory feature based on the gammatone filterbank, the Selective Gammatone Envelope Feature (SGEF), for Robust Sound Event Recognition where channel selection and the filterbank envelope is used to reduce the effect of noise for specific noise environments. In the experiments with Hidden Markov Model (HMM) recognizers, we shall show that our feature outperforms MFCCs significantly in four different noisy environments at various signal-to-noise ratios.

  • A VLSI Architecture with Multiple Fast Store-Based Block Parallel Processing for Output Probability and Likelihood Score Computations in HMM-Based Isolated Word Recognition

    Kazuhiro NAKAMURA  Ryo SHIMAZAKI  Masatoshi YAMAMOTO  Kazuyoshi TAKAGI  Naofumi TAKAGI  

     
    PAPER

      Vol:
    E95-C No:4
      Page(s):
    456-467

    This paper presents a memory-efficient VLSI architecture for output probability computations (OPCs) of continuous hidden Markov models (HMMs) and likelihood score computations (LSCs). These computations are the most time consuming part of HMM-based isolated word recognition systems. We demonstrate multiple fast store-based block parallel processing (MultipleFastStoreBPP) for OPCs and LSCs and present a VLSI architecture that supports it. Compared with conventional fast store-based block parallel processing (FastStoreBPP) and stream-based block parallel processing (StreamBPP) architectures, the proposed architecture requires fewer registers and less processing time. The processing elements (PEs) used in the FastStoreBPP and StreamBPP architectures are identical to those used in the MultipleFastStoreBPP architecture. From a VLSI architectural viewpoint, a comparison shows that the proposed architecture is an improvement over the others, through efficient use of PEs and registers for storing input feature vectors.

  • Enhancing Eigenspace-Based MLLR Speaker Adaptation Using a Fuzzy Logic Learning Control Scheme

    Ing-Jr DING  

     
    PAPER

      Vol:
    E94-D No:10
      Page(s):
    1909-1916

    This study develops a fuzzy logic control mechanism in eigenspace-based MLLR speaker adaptation. Specifically, this mechanism can determine hidden Markov model parameters to enhance overall recognition performance despite ordinary or adverse conditions in both training and operating stages. The proposed mechanism regulates the influence of eigenspace-based MLLR adaptation given insufficient training data from a new speaker. This mechanism accounts for the amount of adaptation data available in transformation matrix parameter smoothing, and thus ensures the robustness of eigenspace-based MLLR adaptation against data scarcity. The proposed adaptive learning mechanism is computationally inexpensive. Experimental results show that eigenspace-based MLLR adaptation with fuzzy control outperforms conventional eigenspace-based MLLR, and especially when the adaptation data acquired from a new speaker is insufficient.

  • VLSI Architecture of GMM Processing and Viterbi Decoder for 60,000-Word Real-Time Continuous Speech Recognition

    Hiroki NOGUCHI  Kazuo MIURA  Tsuyoshi FUJINAGA  Takanobu SUGAHARA  Hiroshi KAWAGUCHI  Masahiko YOSHIMOTO  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    458-467

    We propose a low-memory-bandwidth, high-efficiency VLSI architecture for 60-k word real-time continuous speech recognition. Our architecture includes a cache architecture using the locality of speech recognition, beam pruning using a dynamic threshold, two-stage language model searching, a parallel Gaussian Mixture Model (GMM) architecture based on the mixture level and frame level, a parallel Viterbi architecture, and pipeline operation between Viterbi transition and GMM processing. Results show that our architecture achieves 88.24% required frequency reduction (66.74 MHz) and 84.04% memory bandwidth reduction (549.91 MB/s) for real-time 60-k word continuous speech recognition.

  • Bayesian Context Clustering Using Cross Validation for Speech Recognition

    Kei HASHIMOTO  Heiga ZEN  Yoshihiko NANKAKU  Akinobu LEE  Keiichi TOKUDA  

     
    PAPER-Speech and Hearing

      Vol:
    E94-D No:3
      Page(s):
    668-678

    This paper proposes Bayesian context clustering using cross validation for hidden Markov model (HMM) based speech recognition. The Bayesian approach is a statistical technique for estimating reliable predictive distributions by treating model parameters as random variables. The variational Bayesian method, which is widely used as an efficient approximation of the Bayesian approach, has been applied to HMM-based speech recognition, and it shows good performance. Moreover, the Bayesian approach can select an appropriate model structure while taking account of the amount of training data. Since prior distributions which represent prior information about model parameters affect estimation of the posterior distributions and selection of model structure (e.g., decision tree based context clustering), the determination of prior distributions is an important problem. However, it has not been thoroughly investigated in speech recognition, and the determination technique of prior distributions has not performed well. The proposed method can determine reliable prior distributions without any tuning parameters and select an appropriate model structure while taking account of the amount of training data. Continuous phoneme recognition experiments show that the proposed method achieved a higher performance than the conventional methods.

1-20hit(88hit)