The search functionality is under construction.

IEICE TRANSACTIONS on Fundamentals

  • Impact Factor

    0.48

  • Eigenfactor

    0.003

  • article influence

    0.1

  • Cite Score

    1.1

Advance publication (published online immediately after acceptance)

Volume E74-A No.7  (Publication Date:1991/07/25)

    Special Issue on Continuous Speech Recognition and Understanding
  • FOREWORD

    Katsuhiko SHIRAI  Sadaoki FURUI  

     
    FOREWORD

      Page(s):
    1759-1760
  • Robustness of Phoneme-Based HMMs against Speaking-Style Variations

    Tatsuo MATSUOKA  Kiyohiro SHIKANO  

     
    PAPER-Phoneme Recognition and Word Spotting

      Page(s):
    1761-1767

    In a practical continuous speech recognition system, the target speech is often spoken in a different speaking-style (e.g., speed or loudness) from the training speech. It is difficult to cope with such speaking style variations because the amount of training speech is limited. Therefore, acoustic modeling should be robust against different styles of speech in order to obtain high recognition performance from the limited training speech. This paper describes robustness of six of phoneme-based HMMs against speaking-style variations. The six types of model were VQ-and FVQ-based discrete HMMs, and single-Gaussian and mixture-Gaussian HMMs with either diagonal or full covariance matrices. They were investigated using isolated word utterance, phrase-by-phrase utterance and fluently spoken utterance, with different utterance types for training and testing. The experimental results show that the mixture-Gaussian HMM with diagonal covariance matrices is the most promising choice. The FVQ-based HMM and the single-Gaussian HMM with full covariance matrices also achieved good results. The mixture-Gaussian HMM with full covariance matrices sometime achieved very high accuracies, but often suffered from "overtuning" or a lack of training data. Finally this paper proposes a new model-adaptation technique that combines multiple models with appropriate weighting factors. Each model has different characteristics (e.g., coverage of speaking styles and sensitivety to data), and weighting factors can be estimated using "deletedinterpolation". When the mixture-Gaussian diagonal covariance models were used as baseline models, this technique achieved better recognition accuracy than a model trained using all three utterance types at a time. The advantage of this technique is that estimating the weighting factors is stable even from a limited amount of training speech because there are few free parameters to be estimated.

  • Word Spotting Using Context-Dependent Phoneme-Based HMMs

    Tatsuo MATSUOKA  

     
    PAPER-Phoneme Recognition and Word Spotting

      Page(s):
    1768-1772

    In a practical continuous speech recognition system, input speech includes many extraneous words. Furthermore, detecting the beginning point of the target word is very difficult. Under those circumstances, word-spotting is useful for extracting and recognizing the target speech from such input speech. On the other hand, a phoneme-based HMM is useful for large-vocabulary word recognition. Training a phoneme-based HMM is easier and more stable than training a word-based HMM when there is not so much training speech, because there are several times more phoneme tokens than word tokens in the training speech. For these reasons, we use word-spotting with phoneme-based HMMs. Furthermore, for more precise modeling, we chose context-dependent phoneme modeling. This paper proposes a new clustering method for context-dependent phoneme HMMs. This clustering method uses triphone context when training samples are sufficient, and automatically selects biphone and uniphone contexts if only a few training samples are given. Using this clustering method, context-dependent models were created and tested in phoneme recognition experiments and word spotting experiments. The context-dependent models achieved 90.0% phoneme recognition accuracy that is 7.6% higher than the context-independent models, and they achieved 69.2% word spotting accuracy that is 7.0% higher than the context-independent models.

  • A Japanese Text Dictation System Based on Phoneme Recognition and a Dependency Grammar

    Shozo MAKINO  Akinori ITO  Mitsuru ENDO  Ken'iti KIDO  

     
    PAPER-Dictation Systems

      Page(s):
    1773-1782

    This paper describes an overview of Japanese text dictation system composed of an acoustic processor and a linguistic processor. The system deals with 843 conceptual words and 431 functional words. The phoneme recognition is carried out using a modified LVQ2 method which we propose. The phoneme recognition score was 86.1% for 226 sentences uttered by two male speakers. The linguistic processor is composed of a processor for spotting Bunsetsu-units and a syntactic processor. The structure of the Bunsetsu-unit is effectively described by a finite-state automaton. The test-set perplexity of the finite-state automaton is 230. In the processor for spotting Bunsetsu-units, using a syntax-driven continuous-DP matching algorithm, the Bunsetsu-units are spotted from a recognized phoneme sequence and then a Bunsetsu-unit lattice is generated. In the syntactic processor, the Bunsetsu-unit lattice is parsed based on the dependency grammar. The dependency grammar is expressed as the correspondence between a FEATURE marker in a modifier-Bunsetsu and a SLOT-FILLER marker in a head-Bunsetsu. The recognition scores of the Bunsetsu-unit and conceptual words were 73.2% and 85.7% for 226 sentences uttered by the two male speakers.

  • Japanese Phonetic Typewriter Using HMM Phone Recognition and Stochastic Phone-Sequence Modeling

    Takeshi KAWABATA  Toshiyuki HANAZAWA  Katsunobu ITOH  Kiyohiro SHIKANO  

     
    PAPER-Dictation Systems

      Page(s):
    1783-1787

    A phonetic typewriter is an unlimitedvocabulary continuous speech recognition system recognizing each phone in speech without the need for lexical information. This paper describes a Japanese phonetic typewriter system based on HMM phone recognition and syllable-based stochastic phone sequence modeling. Even though HMM methods have considerable capacity for recognizing speech, it is difficult to recognize individual phones in continuous speech without lexical information. HMM phone recognition is improved by incorporating syllable trigrams for phone sequence modeling. HMM phone units are trained using an isolated word database, and their duration parameters are modified according to speaking rate. Syllable trigram tables are made from a text database of over 300,000 syllables, and phone sequence probabilities calculated from the trigrams are combined with HMM probabilities. Using these probabilities, to limit the number of intermediate candidates leads to an accurate phonetic typewriter system without requiring excessive computation time. An interpolated n-gram approach to phone sequence modeling, is shown to be more effective than a simple trigram method.

  • Connected Spoken Word Recognition Using the Markov Model for the Feature Vector

    Tomio TAKARA  Tomoki YAKABU  

     
    PAPER-Continuous Speech Recognition

      Page(s):
    1788-1796

    This paper reports on a new application of the Markov model to an automatic speech recognition system, in which the feature vectors of speech are regarded to represent the states and the output symbols of the Markov model. The transition-probability of the states and the symbol-output probability are assumed to be represented by multidimensional normal density functions of the feature vector. The DP-matching algorithm is used for calculating optimum time sequence of observed feature vectors. In order to confirm the efficiency of this system, we compared experimentally performance of this system to that of other approaches, such as those using Maharanobis' distance or Euclidean distance. Based on experimentation, in a speaker independent mode, using a vocabulary of Japanese single-digit and four-digit numerals, the current system is shown to be more effective than others.

  • An Integration of Knowledge and Neural Networks toward a Phoneme Typewriter without a Language Model

    Yasuhiro KOMORI  Kaichiro HATAZAKI  

     
    PAPER-Continuous Speech Recognition

      Page(s):
    1797-1805

    In this paper, a speech recognition system toward a phoneme typewriter without a language model is proposed. The system is realized as an integration of spectrogram reading knowledge and Time-Delay Neural Networks (TDNNs). The system mainly consists of two parts: in the consonant recognition part, a sophisticated integration of knowledge and TDNN is proposed. This improves not only recognition performance and segmentation accuracy, but also reduces insertion errors drastically. In the vowel recognition part, a TDNN is used for detection and rough segmentation using its time shift tolerance advantage. The knowledge part is mainly used for verification of categories and boundaries. A phoneme recognition experiment on 2,620 Japanese words, uttered by one male speaker showed a 91.4% (11,612/12,710) recognition rate, a 3.6% deletion error rate, a 5.0% substitution error rate and a 20.7% insertion error rate, for all Japanese phonemes. This good result was obtained without any language model.

  • Continuous Speech Recognition Using Two-Level LR Parsing

    Kenji KITA  Toshiyuki TAKEZAWA  Tsuyoshi MORIMOTO  

     
    PAPER-Continuous Speech Recognition

      Page(s):
    1806-1810

    This paper describes a continuous speech recognition system using two-level LR parsing and phone based HMMs. ATR has already implemented a predictive LR parsing algorithm in an HMM-based speech recognition system for Japanese. However, up to now, this system has used only intra-phrase grammatical constraints. In Japanese, a sentence is composed of several phrases and thus, two kinds of grammars, namely an intra-phrase grammar and an inter-phrase grammar, are sufficient for recognizing sentences. Two-level LR parsing makes it possible to use not only intra-phrase grammatical constraints but also inter-phrase grammatical constraints during speech recognition. The system is applied to Japanese sentence recognition where sentences were uttered phrase by phrase, and attains a word accuracy of 95.9% and a sentence accuracy of 84.7%.

  • Processing Unknown Words in Continuous Speech Recognition

    Kenji KITA  Terumasa EHARA  Tsuyoshi MORIMOTO  

     
    PAPER-Continuous Speech Recognition

      Page(s):
    1811-1816

    Current continuous speech recognition systems essentially ignore unknown words. Systems are designed to recognize words in the lexicon. However, for using speech recognition systems in a real application such as spoken-language processing, it is very important to process unknown words. This paper proposes a continuous speech recognition method which accepts any utterance that might include unknown words. In this method, words not in the lexicon are transcribed as phone sequences, while words in the lexicon are recognized correctly. The HMM-LR speech recognition system, which is an integration of Hidden Markov Models and generalized LR parsing, is used as the baseline system, and enhanced with the trigram model of syllables to take into account the stochastic characteristics of a language. In our approach, two kinds of grammars, a task grammar which describes the task and a phonetic grammar which describes constraints between phones, are merged and used in the HMM-LR system. The system can output a phonetic transcription for an unknown word by using the phonetic grammar. Experiment results indicate that our approach is very promising.

  • A Large Vocabulary Continuous Speech Recognition System with High Predictability

    Minoru SHIGENAGA  Yoshihiro SEKIGUCHI  Takehiro YAMAGUCHI  Ryouta MASUDA  

     
    PAPER-Continuous Speech Recognition

      Page(s):
    1817-1825

    A large vocabulary (with 1019 words and 1382 kinds of inflectional endings) continuous speech recognition system with high predictability applicable to any task and have an unsupervised speaker adaptation capability is described. Phoneme identification is based on various features. Speaker adaptation is done using reliable identified phonemes. Using prosodic information, phrase boundaries are detected. The syntactic analyzer uses a syntactic state transition network and outputs syntactic interpretations. The semantic analyser deals with the meaning of each word, the dependency relationships between words, the extended case structures of predicates, associative function, in universally applicable forms. The extended case grammar with a set of four-items of the case structure and the dependency relationships between words are based on semantic attributes of relating words, and realizes, together with associative function, universally applicable high prediction capability.

  • Continuous Speech Recognition Using a Dependency Grammar and Phoneme-Based HMMs

    Sho-ichi MATSUNAGA  Shigeru HOMMA  Shigeki SAGAYAMA  Sadaoki FURUI  

     
    PAPER-Continuous Speech Recognition

      Page(s):
    1826-1833

    This paper describes two Japanese continuous speech recognition systems (system-1 and system-2) based on phoneme-based HMMs and a two-level grammar approach. Two grammars are an intra-phrase transition network grammar for phrase recognition, and an inter-phrase dependency grammar for sentence recognition. A joint score, combining acoustic likelihood and linguistic certainty factors derived from phonemebased HMMs and dependency rules, is maximized to obtain the best sentence recognition results. System-1 is tuned for sentences uttered phrase-by-phrase and system-2 is tuned for sentence utterances, to make the amount of computation practical. In system-1, two efficient parsing algorithms are used for each grammar. They are a bi-directional network parser and a breadth-first dependency parser. With the phrase-network parser, input phrase utterances are parsed bi-directionally both left-to-right and right-to-left, and optimal Viterbi paths are found along which the accumulated phonetic likelihood is maximized. The dependency parser utilizes efficient breadth-first search and beam search algorithms. For system-2, we have extended the dependency analysis algorithm for sentence utterances, using a technique for detecting most-likely multi-phrase candidates based on the Viterbi phrase alignment. Where the perplexity of the phrase syntax is 40, system-1 and system-2 increase phrase recognition performance in the sentence by approximately 6% and 14%, showing the effectiveness of semantic dependency analysis.

  • Connectionist Approaches to Large Vocabulary Continuous Speech Recognition

    Hidefumi SAWAI  Yasuhiro MINAMI  Masanori MIYATAKE  Alex WAIBEL  Kiyohiro SHIKANO  

     
    PAPER-Continuous Speech Recognition

      Page(s):
    1834-1844

    This paper describes recent progress in a connectionist large-vocabulary continuous speech recognition system integrating speech recognition and language processing. The speech recognition part consists of Large Phonemic Time-Delay Neural Networks (TDNNs) which can automatically spot all 24 Japanese phonemes (i.e., 18 consonants /b/, /d/, /g/, /p/, /t/, /k/, /m/, /n/, /N/, /s/, /sh/ ([]), /h/, /z/, /ch/ ([t]), /ts/, /r/, /w/, /y/([j]) and 5 vowels /a/, /i/, /u/, /e/, /o/ and a double consonant /Q/ or silence) by simply scanning among input speech without any specific segmentation techniques. On the other hand, the language processing part is made up of a predictive LR parser in which the LR parser is guided by the LR parsing table automatically generated from context-free grammar rules, and proceeds left-to-right without backtracking. Time alignment between the predicted phonemes and a sequence of the TDNN phoneme outputs is carried out by the DTW matching method. We call this 'hybrid' integrated recognition system the 'TDNN-LR' method. We report that large-vocabulary isolated word and continuous speech recognition using the TDNN-LR method provided excellent speaker-dependent recognition performance, where incremental training using a small number of training tokens is found to be very effective for adaptation of speaking rate. Furthermore, we report some new achievements as extensions of the TDNN-LR method: (1) two proposed NN architectures provide robust phoneme recognition performance on variations of speaking manner, (2) a speaker-adaptation technique can be realized using a NN mapping function between input and standard speakers and (3) new architectures proposed for speaker-independent recognition provide performance that nearly matches speaker-dependent recognition performance.

  • Spontaneous Speech Understanding Based on Cooperative Problem-Solving

    Akio KOMATSU  Eiji OOHIRA  Akira ICHIKAWA  

     
    PAPER-Speech Understanding

      Page(s):
    1845-1853

    Natural spontaneous speech is so ambiguous that a system for understanding it requires the cooperation of many knowledge sources. Thus, in order to integrate speech processing and language processing, it is necessary to provide a system with a mechanism for supporting such cooperation. We propose here a general framework for cooperative problemsolving, based on the blackboard model and a TMS (truth maintenance system), with an enhanced proving function. In this framework, a reasonably consistent interpretation is automatically kept on the blackboard, while each knowledge source performs its own inference and puts the results on the blackboard. Based on this framework, a model has been established for a system which can understand spontaneous speech through the cooperation of independent knowledge sources. Most notably, prosodic information is used as suprasegmental cues to infer the structure of spontaneous speech. This allows robust parsing of spoken sentences. The feasibility and validity of our basic framework have been confirmed by computer simulation experiments on spontaneous speech.

  • Comparison of Syntax-0riented Spoken Japanese Understanding System with Semantic-Oriented System

    Seiichi NAKAGAWA  Yoshimitsu HIRATA  Isao MURASE  Tomohiro TANOUE  

     
    PAPER-Speech Understanding

      Page(s):
    1854-1862

    This paper describes syntax/semantics oriented spoken Japanese understanding systems named "SPOJUSSYNO/SEMO" and compares them. At first these systems make Hidden-Markov-Models (HMM) based on word units automatically by concatenating syllables. Then a word lattice is hypothsized by using a word spotting algorithm and word-based HMMs for an input utterance. In SPOJUS-SYNO, the time-synchronous left-to-right parsing algorithm is executed to find the best word sequence from the word lattice according to syntactic & semantic knowledge represented by a context free semantic grammar. In SPOJUS-SEMO, the knowledges of syntax and semantics are represented by a dependency and case grammar. These systems were implemented in the "UNIX-QA" task with the vocabulary size of 521 words. Experimental result shows that the sentence recognition/understanding rate was about 80/87% for six male speakers for the SPOJUS-SYNO, but was very low performance for the SPOJUS-SEMO.

  • SUSKIT---A Speech Understanding System Based on Robust Phone Spotting--

    Yutaka KOBAYASHI  Masanori OMOTE  Hidenori ENDO  Yasuhisa NIIMI  

     
    PAPER-Speech Understanding

      Page(s):
    1863-1869

    This paper describes an overview of our speech understanding system and reports on the recent results of the sentence recognition experiments. The system, we call SUSKIT-, recognizes database queries in natural Japanese sentences. The user is expected to speak sentence by sentence. Among the difficult problems to overcome, this study paid the prime attentions to how to cope with the contextual variations of pronunciations and how to verify partial sentence hypotheses in a hierarchical system. The SUSKIT- predicts words strings in a top-down manner, however, the verification of hypotheses against the input speech is done using a unit independent of word boundaries. Words are not suitable units of verification because the smoothing effect owing to phonetic contexts makes it difficult to recognize short words. In order to avoid the misrecognition caused by the smoothing effect across word boundaries, the SUSKIT- dynamically extracts those phoneme strings bounded by the easily detectable phonemes from the predicted word string as verification templates. The left-to-right timesynchronous beam-search strategy was adopted for searching likely sentences. We carried out sentence recognition experiments using the speech corpus consists of 159 sentences read by three Japanese male speakers. The task perplexity was 8.3. Using the speaker-dependent HMM parameters, we obtained the sentence recognition rates of 83.0-92.5%.

  • A Generic Framework Based on ATMS for Speech Understanding Systems

    Shingo NISHIOKA  Osamu KAKUSHO  Riichiro MIZOGUCHI  

     
    PAPER-Speech Understanding

      Page(s):
    1870-1880

    A speech understanding system confronts with the ambiguities caused by the acoustic-phonetic errors and multiple-meaning of words. Thus the effective framework is required to resolve the ambiguity. The speech understanding system described in this paper deals with two different kind of phrase to avoid the combinatorial explosion. And the speech understanding system is constructed on ATMS based problem solving system to extract maximum performance. Experimental results show that, the time consumed by the speech understanding system reduces into 10. Furthermore, to evaluate the generality and effectiveness of the ATMS based problem solving system, the results of an another experiment are also presented in this paper.

  • MASCOTS: Dialog Management System for Speech Understanding System

    Tetsuya YAMAMOTO  Yoshikazu OHTA  Yoichi YAMASHITA  Osamu KAKUSHO  Riichiro MIZOGUCHI  

     
    PAPER-Speech Understanding

      Page(s):
    1881-1888

    This paper describes a dialog management system called MASCOTS which manages a dialog between a user and a problem solving system through spoken Japanese and helps the speech understanding system in its language processing. MASCOTS tries to predict the next user utterance based on the architecture for managing dialog with two stacks and plan information. MASCOTS not only contributes to making language processing efficient, but also works for a problem solving system. MASCOTS identifies the kind of the utterance and standardizes its representation form in place of a problem solving system. In this paper, the architecture of MASCOTS is discussed focusing on the characteristics of dialog and two ways of predicting the next user utterance exchanging the information with the language processing system.

  • Integration of Speech Recognition and Language Processing in a Japanese to English Spoken Language Translation System

    Tsuyoshi MORIMOTO  Kiyohiro SHIKANO  Kiyoshi KOGURE  Hitoshi IIDA  Akira KUREMATSU  

     
    PAPER-Speech Understanding

      Page(s):
    1889-1896

    The experimental spoken language translation system (SL-TRANS) has been implemented. It can recognize Japanese speech, translate it to English, and output a synthesized English speech. One of the most important problems in realizing such a system is how to integrate, or connect, speech recognition and language processing. In this paper, a new method realized in the system is described. The method is composed of three processes: grammar-driven predictive speech recognition, Kakariuke-dependency-based candidate filtering, and HPSG-based lattice parsing which is supplemented with a sentence preference mechanism. Input speech is uttered phrase by phrase. The speech recognizer takes an input phrase utterance and outputs several candidates with recognition scores for each phrase. Japanese phrasal grammar is used in recognition. It contributes to the output of grammatically well-formed phrase candidates, as well as to the reduction of phone perplexity. The candidate filter takes a phrase lattice, which is a sequence of multiple candidates for a phrase, and outputs a reduced phrase lattice. It removes semantically inappropriate phrase candidates by applying the Kakariuke dependency relationship between phrases. Finally, the HPSG-based lattice parser takes a phrase lattice and chooses the most plausible sentence by checking syntactic and semantic legitimacy or evaluating sentential preference. Experiment results for the system are also reported and the usefulness of the method is confirmed.

  • Comparison of Language Models by Context-Free Grammar, Bigram and Quasi/Simplified-Trigram

    Seiichi NAKAGAWA  Isao MURASE  

     
    PAPER-Language Modeling

      Page(s):
    1897-1905

    In this paper, we investigate the language models using context-free grammar, bigram and quasi/simplified-trigram. For calculating of statistics of bigram and quasi/simplified-trigram, we used the set of sentences generated randomly from CFG that are legal in terms of semantics. We compared them on the perplexities for their models and the sentence recognition accuracies. The sentence recognition was experimented in the "UNIX-QA" task with the vocabulary size of 521 words. From these results, the perplexities of bigram and quasi-trigram were about 1.5-1.7 times and 1.2-1.3 times larger than the perplexity of CFG that corresponds to the most restricted grammar (perplexity=10.0), and we realized that quasi-trigram has the almost same ability of modeling as the restricted CFG when the set of plausible sentences in the task is given.

  • Creating Speech Copora for Speech Science and Technology

    Shuichi ITAHASHI  

     
    PAPER-Speech Database

      Page(s):
    1906-1910

    This paper describes recent speech database efforts in Japan in which the author has been involved. The JEIDA Japanese Common Speech Data Corpus was first reported in 1986. It has been converted to DAT recently. The JEIDA Noise Database has been released to the public recently. It contains various kinds of environmental noise and standard noise for sound level calibration. The 'Spoken Language' project collected speech data including continuous speech spoken by 10 males and 10 females. The 'Spoken Japanese' project, started in 1989, attempts to collect various dialectal speech from all over Japan and create speech databases. A compact disc containing a fairy tale and weather forecast spoken by 20 dialect speakers has been produced. It also describes the Continuous Speech Database Committee which was established recently by the Acoustical Society of Japan.

  • Regular Section
  • Alternate Approach to the Stability of Linear Combinations of Polynomials

    Norio FUKUMA  Takehiro MORI  

     
    PAPER-Control and Computing

      Page(s):
    1911-1914

    A stability of convex combinations of polynomials and a stability margin of stable polynomials are studied using Hermite matrices for continuous-time systems. Available results are found to give a heavy computational burden especially in checking the stability of a polytope of polynomials by means of "the edge theorem". We propose alternate stability conditions and margin which reduce the computational burden. In our approach, the stability condition reported by Bialas and Garloff can be derived readily.

  • A Note on Dual Trail Partition of a Plane Graph

    Shuichi UENO  Katsufumi TSUJI  Yoji KAJITANI  

     
    LETTER-Graphs, Networks and Matroids

      Page(s):
    1915-1917

    Given a plane graph G, a trail of G is said to be dual if it is also a trail in the geometric dual of G. We show that the problem of partitioning the edges of G into the minimum number of dual trails is NP-hard.