The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] continuous speech recognition(10hit)

1-10hit
  • Continuous Speech Recognition Based on General Factor Dependent Acoustic Models

    Hiroyuki SUZUKI  Heiga ZEN  Yoshihiko NANKAKU  Chiyomi MIYAJIMA  Keiichi TOKUDA  Tadashi KITAMURA  

     
    PAPER-Feature Extraction and Acoustic Medelings

      Vol:
    E88-D No:3
      Page(s):
    410-417

    This paper describes continuous speech recognition incorporating the additional complement information, e.g., voice characteristics, speaking styles, linguistic information and noise environment, into HMM-based acoustic modeling. In speech recognition systems, context-dependent HMMs, i.e., triphone, and the tree-based context clustering have commonly been used. Several attempts to utilize not only phonetic contexts, but additional complement information based on context (factor) dependent HMMs have been made in recent years. However, when the additional factors for testing data are unobserved, methods for obtaining factor labels is required before decoding. In this paper, we propose a model integration technique based on general factor dependent HMMs for decoding. The integrated HMMs can be used by a conventional decoder as standard triphone HMMs with Gaussian mixture densities. Moreover, by using the results of context clustering, the proposed method can determine an optimal number of mixture components for each state dependently of the degree of influence from additional factors. Phoneme recognition experiments using voice characteristic labels show significant improvements with a small number of model parameters, and a 19.3% error reduction was obtained in noise environment experiments.

  • A Study on Acoustic Modeling for Speech Recognition of Predominantly Monosyllabic Languages

    Ekkarit MANEENOI  Visarut AHKUPUTRA  Sudaporn LUKSANEEYANAWIN  Somchai JITAPUNKUL  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1146-1163

    This paper presents a study on acoustic modeling for speech recognition of predominantly monosyllabic languages. Various speech units used in speech recognition systems have been investigated. To evaluate the effectiveness of these acoustic models, the Thai language is selected, since it is a predominantly monosyllabic language and has a complex vowel system. Several experiments have been carried out to find the proper speech unit that can accurately create acoustic model and give a higher recognition rate. Results of recognition rates under different acoustic models are given and compared. In addition, this paper proposes a new speech unit for speech recognition, namely onset-rhyme unit. Two models are proposed-the Phonotactic Onset-Rhyme Model (PORM) and the Contextual Onset-Rhyme Model (CORM). The models comprise a pair of onset and rhyme units, which makes up a syllable. An onset comprises an initial consonant and its transition towards the following vowel. Together with the onset, the rhyme consists of a steady vowel segment and a final consonant. Experimental results show that the onset-rhyme model improves on the efficiency of other speech units. The onset-rhyme model improves on the accuracy of the inter-syllable triphone model by nearly 9.3% and of the context-dependent Initial-Final model by nearly 4.7% for the speaker-dependent systems using only an acoustic model, and 5.6% and 4.5% for the speaker-dependent systems using both acoustic and language model respectively. The results show that the onset-rhyme models attain a high recognition rate. Moreover, they also give more efficiency in terms of system complexity.

  • Topic Extraction based on Continuous Speech Recognition in Broadcast News Speech

    Katsutoshi OHTSUKI  Tatsuo MATSUOKA  Shoichi MATSUNAGA  Sadaoki FURUI  

     
    PAPER-Speech and Hearing

      Vol:
    E85-D No:7
      Page(s):
    1138-1144

    In this paper, we propose topic extraction models based on statistical relevance scores between topic words and words in articles, and report results obtained in topic extraction experiments using continuous speech recognition for Japanese broadcast news utterances. We attempt to represent a topic of news speech using a combination of multiple topic words, which are important words in the news article or words relevant to the news. We assume a topic of news is represented by a combination of words. We statistically model mapping from words in an article to topic words. Using the mapping, the topic extraction model can extract topic words even if they do not appear in the article. We train a topic extraction model capable of computing the degree of relevance between a topic word and a word in an article by using newspaper text covering a five-year period. The degree of relevance between those words is calculated based on measures such as mutual information or the χ2-method. In experiments extracting five topic words using a χ2-based model, we achieve 72% precision and 12% recall for speech recognition results. Speech recognition results generally include a number of recognition errors, which degrades topic extraction performance. To avoid this, we employ N-best candidates and likelihood given by acoustic and language models. In experiments, we find that extracting five topic words using N-best candidate and likelihood values achieves significantly improved precision.

  • Neural Networks and the Time-Sliced Paradigm for Speech Recognition

    Ingrid KIRSCHNING  Jun-Ichi AOE  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E79-D No:12
      Page(s):
    1690-1699

    The Time-Slicing paradigm is a newly developed method for the training of neural networks for speech recognition. The neural net is trained to spot the syllables in a continuous stream of speech. It generates a transcription of the utterance, be it a word, a phrase, etc. Combined with a simple error recovery method the desired units (words or phrases) can be retrieved. This paradigm uses a recurrent neural network trained in a modular fashion with natural connectionist glue. It processes the input signal sequentially regardless of the input's length and immediately extracts the syllables spotted in the speech stream. As an example, this character string is then compared to a set of possible words, picking out the five closest candidates. In this paper we describe the time-slicing paradigm and the training of the recurrent neural network together with details about the training samples. It also introduces the concept of natural connectionist glue and the recurrent neural network's architecture used for this purpose. Additionally we explain the errors found in the output and the process to reduce them and recover the correct words. The recognition rates of the network and the recovery rates for the words are also shown. The presented examples and recognition rates demonstrate the potential of the time-slicing method for continuous speech recognition.

  • Continuous Speech Recognition Using a Combination of Syntactic Constraints and Dependency Relationships

    Tsuyoshi MORIMOTO  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E79-D No:1
      Page(s):
    54-62

    This paper proposes a Japanese continuous speech recognition mechanism in which a full-sentence-level context-free-grammar (CFG) and one kind of semantic constraint called dependency relationships between two bunsetsu (a kind of phrase) in Japanese" are used during speech recognition in an integrated way. Each dependency relationship is a modification relationship between two bunsetsu; these relationships include the case-frame relationship of a noun bunsetsu to a predicate bunsetsu, or adnominal modification relationships such as a noun bunsetsu to a noun bunsetsu. To suppress the processing overhead caused by using relationships of this type during speech recognition, no rigorous semantic analysis is performed. Instead, a simple matching with examples" approach is adopted. An experiment was carried out and results were compared with a case employing only CFG constraints. They show that the speech recognition accuracy is improved and that the overhead is small enough.

  • A Speech Dialogue System with Multimodal Interface for Telephone Directory Assistance

    Osamu YOSHIOKA  Yasuhiro MINAMI  Kiyohiro SHIKANO  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    616-621

    This paper describes a multimodal dialogue system employing speech input. This system uses three input methods (through a speech recognizer, a mouse, and a keyboard) and two output methods (through a display and using sound). For the speech recognizer, an algorithm is employed for large-vocabulary speaker-independent continuous speech recognition based on the HMM-LR technique. This system is implemented for telephone directory assistance to evaluate the speech recognition algorithm and to investigate the variations in speech structure that users utter to computers. Speech input is used in a multimodal environment. The collecting of dialogue data between computers and users is also carried out. Twenty telephone-number retrieval tasks are used to evaluate this system. In the experiments, all the users are equally trained in using the dialogue system with an interactive guidance system implemented on a workstation. Simplified city maps that indicate subscriber names and addresses are used to reduce the implicit restrictions imposed by written sentences, thus allowing each user to develop his own forms of expression. The task completion rate is 99.0% and approximately 75% of the users say that they prefer this system to using a telephone book. Moreover, there is a significant decrease in nonkeyword usage, i.e., the usage of words other than names and addresses, for users who receive more utterance practice.

  • A Continuous Speech Recognition Algorithm Utilizing Island-Driven A* Search

    Yoshikazu YAMAGUCHI  Akio OGIHARA  Yasuhisa HAYASHI  Nobuyuki TAKASU  Kunio FUKUNAGA  

     
    LETTER

      Vol:
    E76-A No:7
      Page(s):
    1184-1186

    We propose a continuous speech recognition algorithm utilizing island-driven A* search. Conventional left-to-right A* search is probable to lose the optimal solution from a finite stack if some obscurities appear at the start of an input speech. Proposed island-driven A* search proceeds searching forward and backward from the clearest part of an input speech, and thus can avoid to lose the optimal solution from a finite stack.

  • A Linguistic Procedure for an Extension Number Guidance System

    Naomi INOUE  Izuru NOGAITO  Masahiko TAKAHASHI  

     
    PAPER

      Vol:
    E76-D No:1
      Page(s):
    106-111

    This paper describes the linguistic procedure of our speech dialogue system. The procedure is composed of two processes, syntactic analysis using a finite state network, and discourse analysis using a plan recognition model. The finite state network is compiled from regular grammar. The regular grammar is described in order to accept sentences with various styles, for example ellipsis and inversion. The regular grammar is automatically generated from the skeleton of the grammar. The discourse analysis module understands the utterance, generates the next question for users and also predicts words which will be in the next utterance. For an extension number guidance task, we obtained correct recognition results for 93% of input sentences without word prediction and for 98% if prediction results include proper words.

  • System Design, Data Collection and Evaluation of a Speech Dialogue System

    Katunobu ITOU  Satoru HAYAMIZU  Kazuyo TANAKA  Hozumi TANAKA  

     
    PAPER

      Vol:
    E76-D No:1
      Page(s):
    121-127

    This paper describes design issues of a speech dialogue system, the evaluation of the system, and the data collection of spontaneous speech in a transportation guidance domain. As it is difficult to collect spontaneous speech and to use a real system for the collection and evaluation, the phenomena related with dialogues have not been quantitatively clarified yet. The authors constructed a speech dialogue system which operates in almost real time, with acceptable recognition accuracy and flexible dialogue control. The system was used for spontaneous speech collection in a transportation guidance domain. The system performance evaluated in the domain is the understanding rate of 84.2% for the utterances within the predefined grammar and the lexicon. Also some statistics of the spontaneous speech collected are given.

  • An Efficient One-Pass Search Algorithm for Parsing Spoken Language

    Michio OKADA  

     
    PAPER-Speech

      Vol:
    E75-A No:7
      Page(s):
    944-953

    Spoken language systems such as speech-to-speech dialog translation systems have been gaining more attention in recent years. These systems require full integration of speech recognition and natural language understanding. This paper presents an efficient parsing algorithm that integrates the search problems of speech processing and language processing. The parsing algorithm we propose here is regarded as an extension of the finite-state-network directed, one-pass search algorithm to one directed by a context-free grammar with retention of the time-synchronous procedure. The extended search algorithm is used to find approximately globally optimal sentence hypotheses; it does not have overhead which exists in, for example, hierarchical systems based on the lattice parsing approach. The computational complexity of this search algorithm is proportional to the length of the input speech. As the search process in the speech recognition can directly take account of the predictive information in the sentence parsing, this framework can be extended to sopken language systems which deal with dynamically varying constraints in dialogue situations.