The search functionality is under construction.

Author Search Result

[Author] Tohru SHIMIZU(4hit)

1-4hit
  • A Speech Translation System Applied to a Real-World Task/Domain and Its Evaluation Using Real-World Speech Data

    Atsushi NAKAMURA  Masaki NAITO  Hajime TSUKADA  Rainer GRUHN  Eiichiro SUMITA  Hideki KASHIOKA  Hideharu NAKAJIMA  Tohru SHIMIZU  Yoshinori SAGISAKA  

     
    PAPER-Speech and Hearing

      Vol:
    E84-D No:1
      Page(s):
    142-154

    This paper describes an application of a speech translation system to another task/domain in the real-world by using developmental data collected from real-world interactions. The total cost for this task-alteration was calculated to be 9 Person-Month. The newly applied system was also evaluated by using speech data collected from real-world interactions. For real-world speech having a machine-friendly speaking style, the newly applied system could recognize typical sentences with a word accuracy of 90% or better. We also found that, concerning the overall speech translation performance, the system could translate about 80% of the input Japanese speech into acceptable English sentences.

  • Recognition of Connected Digit Speech in Japanese Collected over the Telephone Network

    Hisashi KAWAI  Tohru SHIMIZU  Norio HIGUCHI  

     
    PAPER-Speech and Hearing

      Vol:
    E84-D No:3
      Page(s):
    374-383

    This paper describes experimental results on whole word HMM-based speech recognition of connected digits in Japanese with special focus on the training data size and the "sheep and goats" problem. The training data comprises 757000 digits uttered by 2000 speakers, while the testing data comprises 399000 digits uttered by 1700 speakers. The best word error rate for unknown length strings was 1.64% obtained using context dependent HMMs. The word error rate was measured for various subsets of the training data reduced both in the number of speakers (s) and the number of utterances per speakers (u). As a result, an empirical formula of s[{min(0.62s0.75, u)}0.74 + {max(0, u-0.62s0.75)}0.27] = D(Ew) was developed, where Ew and D(Ew) designate word error rate and effective data size, respectively. Analyses were conducted on several aspects of the low performance speakers accounting for the major part of recognition errors. Attempts were also made to improve their recognition performance. It was found that 33% of the low performance speakers are improved to the normal level by speaker clustering centered around each low performance speaker.

  • Phoneme-Balanced and Digit-Sequence-Preserving Connected Digit Patterns for Text-Prompted Speaker Verification

    Tsuneo KATO  Tohru SHIMIZU  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1194-1199

    This paper presents a novel design of connected digit patterns to achieve high accuracy text-prompted speaker verification over a cellular phone network. To reduce the error rate, a phoneme-balanced connected digit pattern for enrollment, and digit-sequence-preserving connected digit patterns for verification (i.e. patterns preserving partial digit sequences of the enrollment pattern) are proposed. In addition to these, a decision procedure using multiple patterns has been designed to overcome the low quality of cellular phone speech. Experimental results on cellular phone speech showed the phoneme-balanced patterns for enrollment and digit-sequence-preserving patterns for verification reduced more than 50% of equal error rate compared to the conventional method using randomly-selected and randomly-reordered digit patterns. The decision procedure reduced 60% of the error rate. In addition, this paper shows that verification patterns depending on the pattern of a preceding utterance reduced 10% of the error rate. Overall, the error rate obtained by the proposed method was 1% for 99% of clients and 95% of impostors.

  • A Portable Text-to-Speech System Using a Pocket-Sized Formant Speech Synthesizer

    Norio HIGUCHI  Tohru SHIMIZU  Hisashi KAWAI  Seiichi YAMAMOTO  

     
    PAPER

      Vol:
    E76-A No:11
      Page(s):
    1981-1989

    The authors developed a portable Japanese text-to-speech system using a pocket-sized formant speech synthesizer. It consists of a linguistic processor and an acoustic processor. The linguistic processor runs on an MS-DOS personal computer and has functions to determine readings and prosodic information for input sentences written in kana-kanji-mixed style. New techniques, such as minimization of a cost function for phrases, rare-compound flag, semantic information, information of reading selection and restriction by associated particles, are used to increase the accuracy of readings and accent positions. The accuracy of determining readings and accent positions is 98.6% for sentences in newspaper articles. It is possible to use the linguistic processor through an interface library which has also been developed by the authors. Consequently, it has become possible not only to convert whole texts stored in text files but also to convert parts of sentences sent by the interface library sequentially, and the readings and prosodic information are optimized for the whole sentence at one time. The acoustic processor is custom-made hardware, and it has adopted new techniques, for the improvement of rules for vowel devoicing, control of phoneme durations, control of the phrase components of voice fundamental frequency and the construction of the acoustic parameter database. Due to the above-mentioned modifications, the naturalness of synthetic speech generated by a Klatt-type formant speech synthesizer was improved. On a naturalness test it was rated 3.61 on a scale of 5 points from 0 to 4.