The search functionality is under construction.

Author Search Result

[Author] Tsuneo NITTA(11hit)

1-11hit
  • Mapping Articulatory-Features to Vocal-Tract Parameters for Voice Conversion

    Narpendyah Wisjnu ARIWARDHANI  Masashi KIMURA  Yurie IRIBE  Kouichi KATSURADA  Tsuneo NITTA  

     
    PAPER-Speech and Hearing

      Vol:
    E97-D No:4
      Page(s):
    911-918

    In this paper, we propose voice conversion (VC) based on articulatory features (AF) to vocal-tract parameters (VTP) mapping. An artificial neural network (ANN) is applied to map AF to VTP and to convert a speaker's voice to a target-speaker's voice. The proposed system is not only text-independent VC, in which it does not need parallel utterances between source and target-speakers, but can also be used for an arbitrary source-speaker. This means that our approach does not require source-speaker data to build the VC model. We are also focusing on a small number of target-speaker training data. For comparison, a baseline system based on Gaussian mixture model (GMM) approach is conducted. The experimental results for a small number of training data show that the converted voice of our approach is intelligible and has speaker individuality of the target-speaker.

  • Using Reversed Sequences and Grapheme Generation Rules to Extend the Feasibility of a Phoneme Transition Network-Based Grapheme-to-Phoneme Conversion

    Seng KHEANG  Kouichi KATSURADA  Yurie IRIBE  Tsuneo NITTA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2016/01/06
      Vol:
    E99-D No:4
      Page(s):
    1182-1192

    The automatic transcription of out-of-vocabulary words into their corresponding phoneme strings has been widely adopted for speech synthesis and spoken-term detection systems. By combining various methods in order to meet the challenges of grapheme-to-phoneme (G2P) conversion, this paper proposes a phoneme transition network (PTN)-based architecture for G2P conversion. The proposed method first builds a confusion network using multiple phoneme-sequence hypotheses generated from several G2P methods. It then determines the best final-output phoneme from each block of phonemes in the generated network. Moreover, in order to extend the feasibility and improve the performance of the proposed PTN-based model, we introduce a novel use of right-to-left (reversed) grapheme-phoneme sequences along with grapheme-generation rules. Both techniques are helpful not only for minimizing the number of required methods or source models in the proposed architecture but also for increasing the number of phoneme-sequence hypotheses, without increasing the number of methods. Therefore, the techniques serve to minimize the risk from combining accurate and inaccurate methods that can readily decrease the performance of phoneme prediction. Evaluation results using various pronunciation dictionaries show that the proposed model, when trained using the reversed grapheme-phoneme sequences, often outperformed conventional left-to-right grapheme-phoneme sequences. In addition, the evaluation demonstrates that the proposed PTN-based method for G2P conversion is more accurate than all baseline approaches that were tested.

  • Solving the Phoneme Conflict in Grapheme-to-Phoneme Conversion Using a Two-Stage Neural Network-Based Approach

    Seng KHEANG  Kouichi KATSURADA  Yurie IRIBE  Tsuneo NITTA  

     
    PAPER-Speech and Hearing

      Vol:
    E97-D No:4
      Page(s):
    901-910

    To achieve high quality output speech synthesis systems, data-driven grapheme-to-phoneme (G2P) conversion is usually used to generate the phonetic transcription of out-of-vocabulary (OOV) words. To improve the performance of G2P conversion, this paper deals with the problem of conflicting phonemes, where an input grapheme can, in the same context, produce many possible output phonemes at the same time. To this end, we propose a two-stage neural network-based approach that converts the input text to phoneme sequences in the first stage and then predicts each output phoneme in the second stage using the phonemic information obtained. The first-stage neural network is fundamentally implemented as a many-to-many mapping model for automatic conversion of word to phoneme sequences, while the second stage uses a combination of the obtained phoneme sequences to predict the output phoneme corresponding to each input grapheme in a given word. We evaluate the performance of this approach using the American English words-based pronunciation dictionary known as the auto-aligned CMUDict corpus[1]. In terms of phoneme and word accuracy of the OOV words, on comparison with several proposed baseline approaches, the evaluation results show that our proposed approach improves on the previous one-stage neural network-based approach for G2P conversion. The results of comparison with another existing approach indicate that it provides higher phoneme accuracy but lower word accuracy on a general dataset, and slightly higher phoneme and word accuracy on a selection of words consisting of more than one phoneme conflicts.

  • Development of TTS Card for PCs and TTS Software for WSs

    Yoshiyuki HARA  Tsuneo NITTA  Hiroyoshi SAITO  Ken'ichiro KOBAYASHI  

     
    PAPER

      Vol:
    E76-A No:11
      Page(s):
    1999-2007

    Text-to-speech synthesis (TTS) is currently one of the most important media conversion techniques. In this paper, we describe a Japanese TTS card developed for constructing a personal-computer-based multimedia platform, and a TTS software package developed for a workstation-based multimedia platform. Some applications of this hardware and software are also discussed. The TTS consists of a linguistic processing stage for converting text into phonetic and prosodic information, and a speech processing stage for producing speech from the phonetic and prosodic symbols. The linguistic processing stage uses morphological analysis, rewriting rules for accent movement and pause insertion, and other techniques to impart correct accentuation and a natural-sounding intonation to the synthesized speech. The speech processing stage employs the cepstrum method with consonant-vowel (CV) syllables as the synthesis unit to achieve clear and smooth synthesized speech. All of the processing for converting Japanese text (consisting of mixed Japanese Kanji and Kana characters) to synthesized speech is done internally on the TTS card. This allows the card to be used widely in various applications, including electronic mail and telephone service systems without placing any processing burden on the personal computer. The TTS software was used for an E-mail reading tool on a workstation.

  • Confidence Scoring for Accurate HMM-Based Speech Recognition by Using Monophone-Level Normalization Based on Subspace Method

    Muhammad GHULAM  Takaharu SATO  Takashi FUKUDA  Tsuneo NITTA  

     
    PAPER-Speech and Speaker Recognition

      Vol:
    E86-D No:3
      Page(s):
    430-437

    In this paper, a novel confidence scoring method that is applied to N-best hypotheses (word candidates) output from an HMM-based classifier is proposed. In the first pass of the proposed method, the HMM-based classifier with monophone models outputs N-best hypotheses and boundaries of all monophones in the hypotheses. In the second pass, an SM (Subspace Method)-based verifier tests the hypotheses by comparing confidence scores. To test the hypotheses, at first, the SM-based verifier calculates the similarity between phone vectors and an eigen vector set of monophones, then this similarity score is converted into a likelihood score with normalization of acoustic quality, and finally, an HMM-based likelihood of word level and an SM-based likelihood of monophone level are combined to formulate the confidence measure. Two kinds of experiments were performed to evaluate this confidence measure on speaker-independent word recognition. The results showed that the proposed confidence scoring method significantly reduced the word error rate from 4.7% obtained by the standard HMM classifier to 2.0%, and in an unknown word rejection, it reduced the equal error rate from 9.0% to 6.5%.

  • Pitch-Synchronous Peak-Amplitude (PS-PA)-Based Feature Extraction Method for Noise-Robust ASR

    Muhammad GHULAM  Kouichi KATSURADA  Junsei HORIKAWA  Tsuneo NITTA  

     
    PAPER-Speech and Hearing

      Vol:
    E89-D No:11
      Page(s):
    2766-2774

    A novel pitch-synchronous auditory-based feature extraction method for robust automatic speech recognition (ASR) is proposed. A pitch-synchronous zero-crossing peak-amplitude (PS-ZCPA)-based feature extraction method was proposed previously and it showed improved performances except when modulation enhancement was integrated with Wiener filter (WF)-based noise reduction and auditory masking. However, since zero-crossing is not an auditory event, we propose a new pitch-synchronous peak-amplitude (PS-PA)-based method to render the feature extractor of ASR more auditory-like. We also examine the effects of WF-based noise reduction, modulation enhancement, and auditory masking in the proposed PS-PA method using the Aurora-2J database. The experimental results show superiority of the proposed method over the PS-ZCPA and other conventional methods. Furthermore, the problem due to the reconstruction of zero-crossings from a modulated envelope is eliminated. The experimental results also show the superiority of PS over PA in terms of the robustness of ASR, though PS and PA lead to significant improvement when applied together.

  • Canonicalization of Feature Parameters for Robust Speech Recognition Based on Distinctive Phonetic Feature (DPF) Vectors

    Mohammad NURUL HUDA  Muhammad GHULAM  Takashi FUKUDA  Kouichi KATSURADA  Tsuneo NITTA  

     
    PAPER-Feature Extraction

      Vol:
    E91-D No:3
      Page(s):
    488-498

    This paper describes a robust automatic speech recognition (ASR) system with less computation. Acoustic models of a hidden Markov model (HMM)-based classifier include various types of hidden factors such as speaker-specific characteristics, coarticulation, and an acoustic environment, etc. If there exists a canonicalization process that can recover the degraded margin of acoustic likelihoods between correct phonemes and other ones caused by hidden factors, the robustness of ASR systems can be improved. In this paper, we introduce a canonicalization method that is composed of multiple distinctive phonetic feature (DPF) extractors corresponding to each hidden factor canonicalization, and a DPF selector which selects an optimum DPF vector as an input of the HMM-based classifier. The proposed method resolves gender factors and speaker variability, and eliminates noise factors by applying the canonicalzation based on the DPF extractors and two-stage Wiener filtering. In the experiment on AURORA-2J, the proposed method provides higher word accuracy under clean training and significant improvement of word accuracy in low signal-to-noise ratio (SNR) under multi-condition training compared to a standard ASR system with mel frequency ceptral coeffient (MFCC) parameters. Moreover, the proposed method requires a reduced, two-fifth, Gaussian mixture components and less memory to achieve accurate ASR.

  • PS-ZCPA Based Feature Extraction with Auditory Masking, Modulation Enhancement and Noise Reduction for Robust ASR

    Muhammad GHULAM  Takashi FUKUDA  Kouichi KATSURADA  Junsei HORIKAWA  Tsuneo NITTA  

     
    PAPER-Speech Recognition

      Vol:
    E89-D No:3
      Page(s):
    1015-1023

    A pitch-synchronous (PS) auditory feature extraction method based on ZCPA (Zero-Crossings Peak-Amplitudes) was proposed previously and showed more robustness over a conventional ZCPA and MFCC based features. In this paper, firstly, a non-linear adaptive threshold adjustment procedure is introduced into the PS-ZCPA method to get optimal results in noisy conditions with different signal-to-noise ratio (SNR). Next, auditory masking, a well-known auditory perception, and modulation enhancement that simulates a strong relationship between modulation spectrums and intelligibility of speech are embedded into the PS-ZCPA method. Finally, a Wiener filter based noise reduction procedure is integrated into the method to make it more noise-robust, and the performance is evaluated against ETSI ES202 (WI008), which is a standard front-end for distributed speech recognition. All the experiments were carried out on Aurora-2J database. The experimental results demonstrated improved performance of the PS-ZCPA method by embedding auditory masking into it, and a slightly improved performance by using modulation enhancement. The PS-ZCPA method with Wiener filter based noise reduction also showed better performance than ETSI ES202 (WI008).

  • Distinctive Phonetic Feature (DPF) Extraction Based on MLNs and Inhibition/Enhancement Network

    Mohammad Nurul HUDA  Hiroaki KAWASHIMA  Tsuneo NITTA  

     
    PAPER-Speech and Hearing

      Vol:
    E92-D No:4
      Page(s):
    671-680

    This paper describes a distinctive phonetic feature (DPF) extraction method for use in a phoneme recognition system; our method has a low computation cost. This method comprises three stages. The first stage uses two multilayer neural networks (MLNs): MLNLF-DPF, which maps continuous acoustic features, or local features (LFs), onto discrete DPF features, and MLNDyn, which constrains the DPF context at the phoneme boundaries. The second stage incorporates inhibition/enhancement (In/En) functionalities to discriminate whether the DPF dynamic patterns of trajectories are convex or concave, where convex patterns are enhanced and concave patterns are inhibited. The third stage decorrelates the DPF vectors using the Gram-Schmidt orthogonalization procedure before feeding them into a hidden Markov model (HMM)-based classifier. In an experiment on Japanese Newspaper Article Sentences (JNAS) utterances, the proposed feature extractor, which incorporates two MLNs and an In/En network, was found to provide a higher phoneme correct rate with fewer mixture components in the HMMs.

  • Orthogonalized Distinctive Phonetic Feature Extraction for Noise-Robust Automatic Speech Recognition

    Takashi FUKUDA  Tsuneo NITTA  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1110-1118

    In this paper, we propose a noise-robust automatic speech recognition system that uses orthogonalized distinctive phonetic features (DPFs) as input of HMM with diagonal covariance. In an orthogonalized DPF extraction stage, first, a speech signal is converted to acoustic features composed of local features (LFs) and ΔP, then a multilayer neural network (MLN) with 153 output units composed of context-dependent DPFs of a preceding context DPF vector, a current DPF vector, and a following context DPF vector maps the LFs to DPFs. Karhunen-Loeve transform (KLT) is then applied to orthogonalize each DPF vector in the context-dependent DPFs, using orthogonal bases calculated from a DPF vector that represents 38 Japanese phonemes. Each orthogonalized DPF vector is finally decorrelated one another by using Gram-Schmidt orthogonalization procedure. In experiments, after evaluating the parameters of the MLN input and output units in the DPF extractor, the orthogonalized DPFs are compared with original DPFs. The orthogonalized DPFs are then evaluated in comparison with a standard parameter set of MFCCs and dynamic features. Next, noise robustness is tested using four types of additive noise. The experimental results show that the use of the proposed orthogonalized DPFs can significantly reduce the error rate in an isolated spoken-word recognition task both with clean speech and with speech contaminated by additive noise. Furthermore, we achieved significant improvements when combining the orthogonalized DPFs with conventional static MFCCs and ΔP.

  • Interaction Builder: A Rapid Prototyping Tool for Developing Web-Based MMI Applications

    Kouichi KATSURADA  Hiroaki ADACHI  Kunitoshi SATO  Hirobumi YAMADA  Tsuneo NITTA  

     
    PAPER

      Vol:
    E88-D No:11
      Page(s):
    2461-2468

    We have developed Interaction Builder (IB), a rapid prototyping tool for constructing web-based Multi-Modal Interaction (MMI) applications. The goal of IB is making it easy to develop MMI applications with speech recognition, life-like agents, speech synthesis, web browsing, etc. For this purpose, IB supports the following interface and functions: (1) GUI for implementing MMI systems without the details of MMI and MMI description language, (2) functionalities of handling synchronized multimodal inputs/outputs, (3) a test run mode for run-time testing. The results of evaluation tests showed that the application development cycle using IB was significantly shortened in comparison with the time using a text editor both for MMI description language experts and for beginners.