1-10hit |
Hiroyuki SUZUKI Heiga ZEN Yoshihiko NANKAKU Chiyomi MIYAJIMA Keiichi TOKUDA Tadashi KITAMURA
This paper describes continuous speech recognition incorporating the additional complement information, e.g., voice characteristics, speaking styles, linguistic information and noise environment, into HMM-based acoustic modeling. In speech recognition systems, context-dependent HMMs, i.e., triphone, and the tree-based context clustering have commonly been used. Several attempts to utilize not only phonetic contexts, but additional complement information based on context (factor) dependent HMMs have been made in recent years. However, when the additional factors for testing data are unobserved, methods for obtaining factor labels is required before decoding. In this paper, we propose a model integration technique based on general factor dependent HMMs for decoding. The integrated HMMs can be used by a conventional decoder as standard triphone HMMs with Gaussian mixture densities. Moreover, by using the results of context clustering, the proposed method can determine an optimal number of mixture components for each state dependently of the degree of influence from additional factors. Phoneme recognition experiments using voice characteristic labels show significant improvements with a small number of model parameters, and a 19.3% error reduction was obtained in noise environment experiments.
Ekkarit MANEENOI Visarut AHKUPUTRA Sudaporn LUKSANEEYANAWIN Somchai JITAPUNKUL
This paper presents a study on acoustic modeling for speech recognition of predominantly monosyllabic languages. Various speech units used in speech recognition systems have been investigated. To evaluate the effectiveness of these acoustic models, the Thai language is selected, since it is a predominantly monosyllabic language and has a complex vowel system. Several experiments have been carried out to find the proper speech unit that can accurately create acoustic model and give a higher recognition rate. Results of recognition rates under different acoustic models are given and compared. In addition, this paper proposes a new speech unit for speech recognition, namely onset-rhyme unit. Two models are proposed-the Phonotactic Onset-Rhyme Model (PORM) and the Contextual Onset-Rhyme Model (CORM). The models comprise a pair of onset and rhyme units, which makes up a syllable. An onset comprises an initial consonant and its transition towards the following vowel. Together with the onset, the rhyme consists of a steady vowel segment and a final consonant. Experimental results show that the onset-rhyme model improves on the efficiency of other speech units. The onset-rhyme model improves on the accuracy of the inter-syllable triphone model by nearly 9.3% and of the context-dependent Initial-Final model by nearly 4.7% for the speaker-dependent systems using only an acoustic model, and 5.6% and 4.5% for the speaker-dependent systems using both acoustic and language model respectively. The results show that the onset-rhyme models attain a high recognition rate. Moreover, they also give more efficiency in terms of system complexity.
Katsutoshi OHTSUKI Tatsuo MATSUOKA Shoichi MATSUNAGA Sadaoki FURUI
In this paper, we propose topic extraction models based on statistical relevance scores between topic words and words in articles, and report results obtained in topic extraction experiments using continuous speech recognition for Japanese broadcast news utterances. We attempt to represent a topic of news speech using a combination of multiple topic words, which are important words in the news article or words relevant to the news. We assume a topic of news is represented by a combination of words. We statistically model mapping from words in an article to topic words. Using the mapping, the topic extraction model can extract topic words even if they do not appear in the article. We train a topic extraction model capable of computing the degree of relevance between a topic word and a word in an article by using newspaper text covering a five-year period. The degree of relevance between those words is calculated based on measures such as mutual information or the χ2-method. In experiments extracting five topic words using a χ2-based model, we achieve 72% precision and 12% recall for speech recognition results. Speech recognition results generally include a number of recognition errors, which degrades topic extraction performance. To avoid this, we employ N-best candidates and likelihood given by acoustic and language models. In experiments, we find that extracting five topic words using N-best candidate and likelihood values achieves significantly improved precision.
Ingrid KIRSCHNING Jun-Ichi AOE
The Time-Slicing paradigm is a newly developed method for the training of neural networks for speech recognition. The neural net is trained to spot the syllables in a continuous stream of speech. It generates a transcription of the utterance, be it a word, a phrase, etc. Combined with a simple error recovery method the desired units (words or phrases) can be retrieved. This paradigm uses a recurrent neural network trained in a modular fashion with natural connectionist glue. It processes the input signal sequentially regardless of the input's length and immediately extracts the syllables spotted in the speech stream. As an example, this character string is then compared to a set of possible words, picking out the five closest candidates. In this paper we describe the time-slicing paradigm and the training of the recurrent neural network together with details about the training samples. It also introduces the concept of natural connectionist glue and the recurrent neural network's architecture used for this purpose. Additionally we explain the errors found in the output and the process to reduce them and recover the correct words. The recognition rates of the network and the recovery rates for the words are also shown. The presented examples and recognition rates demonstrate the potential of the time-slicing method for continuous speech recognition.
This paper proposes a Japanese continuous speech recognition mechanism in which a full-sentence-level context-free-grammar (CFG) and one kind of semantic constraint called dependency relationships between two bunsetsu (a kind of phrase) in Japanese" are used during speech recognition in an integrated way. Each dependency relationship is a modification relationship between two bunsetsu; these relationships include the case-frame relationship of a noun bunsetsu to a predicate bunsetsu, or adnominal modification relationships such as a noun bunsetsu to a noun bunsetsu. To suppress the processing overhead caused by using relationships of this type during speech recognition, no rigorous semantic analysis is performed. Instead, a simple matching with examples" approach is adopted. An experiment was carried out and results were compared with a case employing only CFG constraints. They show that the speech recognition accuracy is improved and that the overhead is small enough.
Osamu YOSHIOKA Yasuhiro MINAMI Kiyohiro SHIKANO
This paper describes a multimodal dialogue system employing speech input. This system uses three input methods (through a speech recognizer, a mouse, and a keyboard) and two output methods (through a display and using sound). For the speech recognizer, an algorithm is employed for large-vocabulary speaker-independent continuous speech recognition based on the HMM-LR technique. This system is implemented for telephone directory assistance to evaluate the speech recognition algorithm and to investigate the variations in speech structure that users utter to computers. Speech input is used in a multimodal environment. The collecting of dialogue data between computers and users is also carried out. Twenty telephone-number retrieval tasks are used to evaluate this system. In the experiments, all the users are equally trained in using the dialogue system with an interactive guidance system implemented on a workstation. Simplified city maps that indicate subscriber names and addresses are used to reduce the implicit restrictions imposed by written sentences, thus allowing each user to develop his own forms of expression. The task completion rate is 99.0% and approximately 75% of the users say that they prefer this system to using a telephone book. Moreover, there is a significant decrease in nonkeyword usage, i.e., the usage of words other than names and addresses, for users who receive more utterance practice.
Yoshikazu YAMAGUCHI Akio OGIHARA Yasuhisa HAYASHI Nobuyuki TAKASU Kunio FUKUNAGA
We propose a continuous speech recognition algorithm utilizing island-driven A* search. Conventional left-to-right A* search is probable to lose the optimal solution from a finite stack if some obscurities appear at the start of an input speech. Proposed island-driven A* search proceeds searching forward and backward from the clearest part of an input speech, and thus can avoid to lose the optimal solution from a finite stack.
Naomi INOUE Izuru NOGAITO Masahiko TAKAHASHI
This paper describes the linguistic procedure of our speech dialogue system. The procedure is composed of two processes, syntactic analysis using a finite state network, and discourse analysis using a plan recognition model. The finite state network is compiled from regular grammar. The regular grammar is described in order to accept sentences with various styles, for example ellipsis and inversion. The regular grammar is automatically generated from the skeleton of the grammar. The discourse analysis module understands the utterance, generates the next question for users and also predicts words which will be in the next utterance. For an extension number guidance task, we obtained correct recognition results for 93% of input sentences without word prediction and for 98% if prediction results include proper words.
Katunobu ITOU Satoru HAYAMIZU Kazuyo TANAKA Hozumi TANAKA
This paper describes design issues of a speech dialogue system, the evaluation of the system, and the data collection of spontaneous speech in a transportation guidance domain. As it is difficult to collect spontaneous speech and to use a real system for the collection and evaluation, the phenomena related with dialogues have not been quantitatively clarified yet. The authors constructed a speech dialogue system which operates in almost real time, with acceptable recognition accuracy and flexible dialogue control. The system was used for spontaneous speech collection in a transportation guidance domain. The system performance evaluated in the domain is the understanding rate of 84.2% for the utterances within the predefined grammar and the lexicon. Also some statistics of the spontaneous speech collected are given.
Spoken language systems such as speech-to-speech dialog translation systems have been gaining more attention in recent years. These systems require full integration of speech recognition and natural language understanding. This paper presents an efficient parsing algorithm that integrates the search problems of speech processing and language processing. The parsing algorithm we propose here is regarded as an extension of the finite-state-network directed, one-pass search algorithm to one directed by a context-free grammar with retention of the time-synchronous procedure. The extended search algorithm is used to find approximately globally optimal sentence hypotheses; it does not have overhead which exists in, for example, hierarchical systems based on the lattice parsing approach. The computational complexity of this search algorithm is proportional to the length of the input speech. As the search process in the speech recognition can directly take account of the predictive information in the sentence parsing, this framework can be extended to sopken language systems which deal with dynamically varying constraints in dialogue situations.