The search functionality is under construction.

Author Search Result

[Author] Katsuhiko SHIRAI(11hit)

1-11hit
  • Linguistic Intelligent CAI System Using Speech Data-Base

    Kyu-Keon LEE  Katsuhiko SHIRAI  

     
    LETTER

      Vol:
    E78-A No:11
      Page(s):
    1562-1565

    This paper describes a new intelligent computer assisted instruction (ICAI) system for Japanese beginners to learn Korean composition. This system is supported by speech synthesis which is generated by a new method for arbitrary sentences of Japanese and Korean using the natural speech data-base.

  • Recognizing Reverberant Speech Based on Amplitude and Frequency Modulation

    Yotaro KUBO  Shigeki OKAWA  Akira KUREMATSU  Katsuhiko SHIRAI  

     
    PAPER-ASR under Reverberant Conditions

      Vol:
    E91-D No:3
      Page(s):
    448-456

    We have attempted to recognize reverberant speech using a novel speech recognition system that depends on not only the spectral envelope and amplitude modulation but also frequency modulation. Most of the features used by modern speech recognition systems, such as MFCC, PLP, and TRAPS, are derived from the energy envelopes of narrowband signals by discarding the information in the carrier signals. However, some experiments show that apart from the spectral/time envelope and its modulation, the information on the zero-crossing points of the carrier signals also plays a significant role in human speech recognition. In realistic environments, a feature that depends on the limited properties of the signal may easily be corrupted. In order to utilize an automatic speech recognizer in an unknown environment, using the information obtained from other signal properties and combining them is important to minimize the effects of the environment. In this paper, we propose a method to analyze carrier signals that are discarded in most of the speech recognition systems. Our system consists of two nonlinear discriminant analyzers that use multilayer perceptrons. One of the nonlinear discriminant analyzers is HATS, which can capture the amplitude modulation of narrowband signals efficiently. The other nonlinear discriminant analyzer is a pseudo-instantaneous frequency analyzer proposed in this paper. This analyzer can capture the frequency modulation of narrowband signals efficiently. The combination of these two analyzers is performed by the method based on the entropy of the feature introduced by Okawa et al. In this paper, in Sect. 2, we first introduce pseudo-instantaneous frequencies to capture a property of the carrier signal. The previous AM analysis method are described in Sect. 3. The proposed system is described in Sect. 4. The experimental setup is presented in Sect. 5, and the results are discussed in Sect. 6. We evaluate the performance of the proposed method by continuous digit recognition of reverberant speech. The proposed system exhibits considerable improvement with regard to the MFCC feature extraction system.

  • An Efficient Lip-Reading Method Robust to Illumination Variations

    Jinyoung KIM  Joohun LEE  Katsuhiko SHIRAI  

     
    LETTER-Speech and Hearing

      Vol:
    E85-A No:9
      Page(s):
    2164-2168

    In this paper, for real-time automatic image transform based lip-reading under illumination variations, an efficient (smaller feature data size) and robust (better recognition under different lighting conditions) method is proposed. Image transform based approach obtains a compressed representation of image pixel values of speaker's mouth and is reported to show superior lip-reading performance. However, this approach inevitably produces large feature vectors relevant to lip information to require much computation time for lip-reading even when principal component analysis (PCA) is applied. To reduce the necessary dimension of feature vectors, the proposed method folded the lip image based on its symmetry in a frame image. This method also compensates the unbalanced illumination between the left and the right lip areas. Additionally, to filter out the inter-frame time-domain spectral distortion of each pixel contaminated by illumination noise, our method adapted the hi-pass filtering on the variations of pixel values between consecutive frames. In the experimental results performed on database recorded at various lighting conditions, the proposed lip-folding or/and inter-frame filtering reduced much the necessary number of feature data, principal components in this work, and showed superior recognition rate compared to the conventional method.

  • Phrase Recognition in Conversational Speech Using Prosodic and Phonemic Information

    Shigeki OKAWA  Takashi ENDO  Tetsunori KOBAYASHI  Katsuhiko SHIRAI  

     
    PAPER

      Vol:
    E76-D No:1
      Page(s):
    44-50

    In this paper, a new scheme for ohrase recognition in conversational speech is proposed, in which prosodic and phonemic information processing are usefully combined. This approach is employed both to produce candidates of phrase boundaries and to discriminate phonemes. The fundamental frequency patterns of continuous utterances are statistically analyzed and the likelihood of the occurrence of a phrase boundary is calculated for every frame. At the same time, the likelihood of phonemic characteristics of each frame can be obtained using a hierarchical clustering method. These two scores, along with lexical and grammatical constraints, can be effectively utilized to develop a possible word sequences or a word lattices which correspond to the continuous speech utterances. Our preliminary experjment shows the feasibility of applying prosody for continuous speech recognition especially for conversational style utterances.

  • Development of a Lip-Sync Algorithm Based on an Audio-Visual Corpus

    Jinyoung KIM  Joohun LEE  Katsuhiko SHIRAI  

     
    LETTER-Databases

      Vol:
    E86-D No:2
      Page(s):
    334-339

    In this paper, we propose a corpus-based lip-sync algorithm for natural face animation. For this purpose, we constructed a Korean audio-visual (AV) corpus. Based on this AV corpus, we propose a concatenation method of AV units, which is similar to a corpus-based text-to-speech system. For our AV corpus, lip-related parameters were extracted from every video-recorded facial shot which of speaker reads the given texts selected from newspapers. The spoken utterances were labeled with HTK and such prosodic information as duration, pitch and intensity was extracted as lip-sync parameters. Based on the constructed AV corpus, basic synthetic units are set by CVC-syllable units. For the best concatenation performance, based on the phonetic environment distance and the prosodic distance, the best path is estimated by a general Viterbi search algorithm. From the computer simulation results, we found that the information concerned with not only duration but also pitch and intensity is useful to enhance the lip-sync performance. And the reconstructed lip parameters have almost equal values to those of the original parameters.

  • Sounds of Speech Based Spoken Document Categorization: A Subword Representation Method

    Weidong QU  Katsuhiko SHIRAI  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1175-1184

    In this paper, we explore a method to the problem of spoken document categorization, which is the task of automatically assigning spoken documents into a set of predetermined categories. To categorize spoken documents, subword unit representations are used as an alternative to word units generated by either keyword spotting or large vocabulary continuous speech recognition (LVCSR). An advantage of using subword acoustic unit representations to spoken document categorization is that it does not require prior knowledge about the contents of the spoken documents and addresses the out of vocabulary (OOV) problem. Moreover, this method works in reliance on the sounds of speech rather than exact orthography. The use of subword units instead of words allows approximate matching on inaccurate transcriptions, makes "sounds-like" spoken document categorization possible. We also explore the performance of our method when the training set contains both perfect and errorful phonetic transcriptions, and hope the classifiers can learn from the confusion characteristics of recognizer and pronunciation variants of words to improve the robustness of whole system. Our experiments based on both artificial and real corrupted data sets show that the proposed method is more effective and robust than the word based method.

  • High Quality Synthetic Speech Generation Using Synchronized Oscillators

    Kenji HASHIMOTO  Takemi MOCHIDA  Yasuaki SATO  Tetsunori KOBAYASHI  Katsuhiko SHIRAI  

     
    PAPER

      Vol:
    E76-A No:11
      Page(s):
    1949-1956

    For the production of high quality synthetic sounds in a text-to-speech system, an excellent synthesizing method of speech signals is indispensable. In this paper, a new speech analysis-synthesis method for the text-to-speech system is proposed. The signals of voiced speech, which have a line spectrum structure at intervals of pitch in the linear frequency domain, can be represented approximately by the superposition of sinusoidal waves. In our system, analysis and synthesis are performed using such a harmonic structure of the signals of voiced speech. In the analysis phase, assuming an exact harmonic structure model at intervals of pitch against the fine structure of the short-time power spectrum, the fundamental frequency f0 is decided so as to minimize the error of the log-power spectrum at each peak position. At the same time, according to the value of the above minimized error, the rate of periodicity of the speech signal is detemined. Then the log-power spectrum envelope is represented by the cosine-series interpolating the data which are sampled at every pitch period. In the synthesis phase, numerical solutions of non-linear differential equations which generate sinusoidal waves are used. For voiced sounds, those equations behave as a group of mutually synchronized oscillators. These sinusoidal waves are superposed so as to reconstruct the line spectrum structure. For voiceless sounds, those non-linear differential equations work as passive filters with input noise sources. Our system has some characteristics as follows. (1) Voiced and voiceless sounds can be treated in a same framowork. (2) Since the phase and the power information of each sinusoidal wave can be easily controlled, if necessary, periodic waveforms in the voiced sounds can be precisely reproduced in the time domain. (3) The fundamental frequency f0 and phoneme duration can be easily changed without much degradation of original sound quality.

  • Functional Design of a Special Purpose Processor Based on High Level Specification Description

    Hironobu KITABATAKE  Katsuhiko SHIRAI  

     
    PAPER

      Vol:
    E75-A No:10
      Page(s):
    1182-1190

    A design system for a special purpose processor executing algorithms described by high level language is discussed. The system can generate an optimized architecture for the processor and also supply a specialized high level language compiler for the processor. A new optimization procedure is introduced to find effective functional blocks that can contribute to the improvement of performance. Functional blocks are found by simulation of the frequently appearing patterns of execution in the algorithm and used to yield a useful combined instruction.

  • FOREWORD

    Katsuhiko SHIRAI  Sadaoki FURUI  

     
    FOREWORD

      Vol:
    E74-A No:7
      Page(s):
    1759-1760
  • FOREWORD

    Katsuhiko SHIRAI  

     
    FOREWORD

      Vol:
    E76-D No:1
      Page(s):
    1-1
  • Hybrid Method of Data Collection for Evaluating Speech Dialogue System

    Shu NAKAZATO  Ikuo KUDO  Katsuhiko SHIRAI  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E79-D No:1
      Page(s):
    41-46

    In this paper, we propose a new method of dialogue data collection which can be used to evaluate modules of a spoken dialogue system. To evaluate the module, it is necessary to use suitable data. Human-human dialogue data have not been appropriate to module evaluation, because spontaneous data usually include too much specific phenomena such as fillers, restarts, pauses, and hesitations. Human-machine dialogue data have not been appropriate to module evaluation, because the dialogue was unnatural and the available vocabularies were limited. Here, we propose 'Hybrid method' for the collection of spoken dialogue data. The merit is that, the collected data can be used as test data for the evaluation of a spoken dialogue system without any modification. In our method a human takes the role of some modules of the system and the system, also, works as the other part of the system together. For example, humans works as the speech recognition module and the dialogue management and a machine does the other part, response generation module. The collected data are good for the evaluation of the speech recognition and the dialogue management modules. The reasons are as follows. (1) Lexicon: The lexicon was composed of limited words and dependent on the task. (2) Grammar: The intention expressed by the subjects were concise and clear. (3) Topics: There were few utterances outside the task domain. The collected data can be used test data for the evaluation of a spoken dialogue system without any modification.