The search functionality is under construction.

Author Search Result

[Author] Hisashi KAWAI(5hit)

1-5hit
  • Development of the “VoiceTra” Multi-Lingual Speech Translation System Open Access

    Shigeki MATSUDA  Teruaki HAYASHI  Yutaka ASHIKARI  Yoshinori SHIGA  Hidenori KASHIOKA  Keiji YASUDA  Hideo OKUMA  Masao UCHIYAMA  Eiichiro SUMITA  Hisashi KAWAI  Satoshi NAKAMURA  

     
    INVITED PAPER

      Pubricized:
    2017/01/13
      Vol:
    E100-D No:4
      Page(s):
    621-632

    This study introduces large-scale field experiments of VoiceTra, which is the world's first speech-to-speech multilingual translation application for smart phones. In the study, approximately 10 million input utterances were collected since the experiments commenced. The usage of collected data was analyzed and discussed. The study has several important contributions. First, it explains system configuration, communication protocol between clients and servers, and details of multilingual automatic speech recognition, multilingual machine translation, and multilingual speech synthesis subsystems. Second, it demonstrates the effects of mid-term system updates using collected data to improve an acoustic model, a language model, and a dictionary. Third, it analyzes system usage.

  • Probabilistic Concatenation Modeling for Corpus-Based Speech Synthesis

    Shinsuke SAKAI  Tatsuya KAWAHARA  Hisashi KAWAI  

     
    PAPER-Speech and Hearing

      Vol:
    E94-D No:10
      Page(s):
    2006-2014

    The measure of the goodness, or inversely the cost, of concatenating synthesis units plays an important role in concatenative speech synthesis. In this paper, we present a probabilistic approach to concatenation modeling in which the goodness of concatenation is measured by the conditional probability of observing the spectral shape of the current candidate unit given the previous unit and the current phonetic context. This conditional probability is modeled by a conditional Gaussian density whose mean vector has a form of linear transform of the past spectral shape. Decision tree-based parameter tying is performed to achieve robust training that balances between model complexity and the amount of training data available. The concatenation models are implemented for a corpus-based speech synthesizer, and the effectiveness of the proposed method was confirmed by an objective evaluation as well as a subjective listening test. We also demonstrate that the proposed method generalizes some popular conventional methods in that those methods can be derived as the special cases of the proposed method.

  • Recognition of Connected Digit Speech in Japanese Collected over the Telephone Network

    Hisashi KAWAI  Tohru SHIMIZU  Norio HIGUCHI  

     
    PAPER-Speech and Hearing

      Vol:
    E84-D No:3
      Page(s):
    374-383

    This paper describes experimental results on whole word HMM-based speech recognition of connected digits in Japanese with special focus on the training data size and the "sheep and goats" problem. The training data comprises 757000 digits uttered by 2000 speakers, while the testing data comprises 399000 digits uttered by 1700 speakers. The best word error rate for unknown length strings was 1.64% obtained using context dependent HMMs. The word error rate was measured for various subsets of the training data reduced both in the number of speakers (s) and the number of utterances per speakers (u). As a result, an empirical formula of s[{min(0.62s0.75, u)}0.74 + {max(0, u-0.62s0.75)}0.27] = D(Ew) was developed, where Ew and D(Ew) designate word error rate and effective data size, respectively. Analyses were conducted on several aspects of the low performance speakers accounting for the major part of recognition errors. Attempts were also made to improve their recognition performance. It was found that 33% of the low performance speakers are improved to the normal level by speaker clustering centered around each low performance speaker.

  • A Portable Text-to-Speech System Using a Pocket-Sized Formant Speech Synthesizer

    Norio HIGUCHI  Tohru SHIMIZU  Hisashi KAWAI  Seiichi YAMAMOTO  

     
    PAPER

      Vol:
    E76-A No:11
      Page(s):
    1981-1989

    The authors developed a portable Japanese text-to-speech system using a pocket-sized formant speech synthesizer. It consists of a linguistic processor and an acoustic processor. The linguistic processor runs on an MS-DOS personal computer and has functions to determine readings and prosodic information for input sentences written in kana-kanji-mixed style. New techniques, such as minimization of a cost function for phrases, rare-compound flag, semantic information, information of reading selection and restriction by associated particles, are used to increase the accuracy of readings and accent positions. The accuracy of determining readings and accent positions is 98.6% for sentences in newspaper articles. It is possible to use the linguistic processor through an interface library which has also been developed by the authors. Consequently, it has become possible not only to convert whole texts stored in text files but also to convert parts of sentences sent by the interface library sequentially, and the readings and prosodic information are optimized for the whole sentence at one time. The acoustic processor is custom-made hardware, and it has adopted new techniques, for the improvement of rules for vowel devoicing, control of phoneme durations, control of the phrase components of voice fundamental frequency and the construction of the acoustic parameter database. Due to the above-mentioned modifications, the naturalness of synthetic speech generated by a Klatt-type formant speech synthesizer was improved. On a naturalness test it was rated 3.61 on a scale of 5 points from 0 to 4.

  • Speaker Adaptive Training Localizing Speaker Modules in DNN for Hybrid DNN-HMM Speech Recognizers

    Tsubasa OCHIAI  Shigeki MATSUDA  Hideyuki WATANABE  Xugang LU  Chiori HORI  Hisashi KAWAI  Shigeru KATAGIRI  

     
    PAPER-Acoustic modeling

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2431-2443

    Among various training concepts for speaker adaptation, Speaker Adaptive Training (SAT) has been successfully applied to a standard Hidden Markov Model (HMM) speech recognizer, whose state is associated with Gaussian Mixture Models (GMMs). On the other hand, focusing on the high discriminative power of Deep Neural Networks (DNNs), a new type of speech recognizer structure, which combines DNNs and HMMs, has been vigorously investigated in the speaker adaptation research field. Along these two lines, it is natural to conceive of further improvement to a DNN-HMM recognizer by employing the training concept of SAT. In this paper, we propose a novel speaker adaptation scheme that applies SAT to a DNN-HMM recognizer. Our SAT scheme allocates a Speaker Dependent (SD) module to one of the intermediate layers of DNN, treats its remaining layers as a Speaker Independent (SI) module, and jointly trains the SD and SI modules while switching the SD module in a speaker-by-speaker manner. We implement the scheme using a DNN-HMM recognizer, whose DNN has seven layers, and elaborate its utility over TED Talks corpus data. Our experimental results show that in the supervised adaptation scenario, our Speaker-Adapted (SA) SAT-based recognizer reduces the word error rate of the baseline SI recognizer and the lowest word error rate of the SA SI recognizer by 8.4% and 0.7%, respectively, and by 6.4% and 0.6% in the unsupervised adaptation scenario. The error reductions gained by our SA-SAT-based recognizers proved to be significant by statistical testing. The results also show that our SAT-based adaptation outperforms, regardless of the SD module layer selection, its counterpart SI-based adaptation, and that the inner layers of DNN seem more suitable for SD module allocation than the outer layers.