Jiaxin WU Bing LI Li ZHAO Xinzhou XU
Maaki SAKAI Kanon HOKAZONO Yoshiko HANADA
Xuecheng SUN Zheming LU
Yuanhe WANG Chao ZHANG
Jinfeng CHONG Niu JIANG Zepeng ZHUO Weiyu ZHANG
Xiangrun LI Qiyu SHENG Guangda ZHOU Jialong WEI Yanmin SHI Zhen ZHAO Yongwei LI Xingfeng LI Yang LIU
Meiting XUE Wenqi WU Jinfeng LUO Yixuan ZHANG Bei ZHAO
Rong WANG Changjun YU Zhe LYU Aijun LIU
Huijuan ZHOU Zepeng ZHUO Guolong CHEN
Feifei YAN Pinhui KE Zuling CHANG
Manabu HAGIWARA
Ziqin FENG Hong WAN Guan GUI
Sungryul LEE
Feng WANG Xiangyu WEN Lisheng LI Yan WEN Shidong ZHANG Yang LIU
Yanjun LI Jinjie GAO Haibin KAN Jie PENG Lijing ZHENG Changhui CHEN
Ho-Lim CHOI
Feng WEN Haixin HUANG Xiangyang YIN Junguang MA Xiaojie HU
Shi BAO Xiaoyan SONG Xufei ZHUANG Min LU Gao LE
Chen ZHONG Chegnyu WU Xiangyang LI Ao ZHAN Zhengqiang WANG
Izumi TSUNOKUNI Gen SATO Yusuke IKEDA Yasuhiro OIKAWA
Feng LIU Helin WANG Conggai LI Yanli XU
Hongtian ZHAO Hua YANG Shibao ZHENG
Kento TSUJI Tetsu IWATA
Yueying LOU Qichun WANG
Menglong WU Jianwen ZHANG Yongfa XIE Yongchao SHI Tianao YAO
Jiao DU Ziwei ZHAO Shaojing FU Longjiang QU Chao LI
Yun JIANG Huiyang LIU Xiaopeng JIAO Ji WANG Qiaoqiao XIA
Qi QI Liuyi MENG Ming XU Bing BAI
Nihad A. A. ELHAG Liang LIU Ping WEI Hongshu LIAO Lin GAO
Dong Jae LEE Deukjo HONG Jaechul SUNG Seokhie HONG
Tetsuya ARAKI Shin-ichi NAKANO
Shoichi HIROSE Hidenori KUWAKADO
Yumeng ZHANG
Jun-Feng Liu Yuan Feng Zeng-Hui Li Jing-Wei Tang
Keita EMURA Kaisei KAJITA Go OHTAKE
Xiuping PENG Yinna LIU Hongbin LIN
Yang XIAO Zhongyuan ZHOU Mingjie SHENG Qi ZHOU
Kazuyuki MIURA
Yusaku HIRAI Toshimasa MATSUOKA Takatsugu KAMATA Sadahiro TANI Takao ONOYE
Ryuta TAMURA Yuichi TAKANO Ryuhei MIYASHIRO
Nobuyuki TAKEUCHI Kosei SAKAMOTO Takuro SHIRAYA Takanori ISOBE
Shion UTSUMI Kosei SAKAMOTO Takanori ISOBE
You GAO Ming-Yue XIE Gang WANG Lin-Zhi SHEN
Zhimin SHAO Chunxiu LIU Cong WANG Longtan LI Yimin LIU Zaiyan ZHOU
Xiaolong ZHENG Bangjie LI Daqiao ZHANG Di YAO Xuguang YANG
Takahiro IINUMA Yudai EBATO Sou NOBUKAWA Nobuhiko WAGATSUMA Keiichiro INAGAKI Hirotaka DOHO Teruya YAMANISHI Haruhiko NISHIMURA
Takeru INOUE Norihito YASUDA Hidetomo NABESHIMA Masaaki NISHINO Shuhei DENZUMI Shin-ichi MINATO
Zhan SHI
Hakan BERCAG Osman KUKRER Aykut HOCANIN
Ryoto Koizumi Xiaoyan Wang Masahiro Umehira Ran Sun Shigeki Takeda
Hiroya Hachiyama Takamichi Nakamoto
Chuzo IWAMOTO Takeru TOKUNAGA
Changhui CHEN Haibin KAN Jie PENG Li WANG
Pingping JI Lingge JIANG Chen HE Di HE Zhuxian LIAN
Ho-Lim CHOI
Akira KITAYAMA Goichi ONO Hiroaki ITO
Koji NUIDA Tomoko ADACHI
Yingcai WAN Lijin FANG
Yuta MINAMIKAWA Kazumasa SHINAGAWA
Sota MORIYAMA Koichi ICHIGE Yuichi HORI Masayuki TACHI
Sendren Sheng-Dong XU Albertus Andrie CHRISTIAN Chien-Peng HO Shun-Long WENG
Zhikui DUAN Xinmei YU Yi DING
Hongbo LI Aijun LIU Qiang YANG Zhe LYU Di YAO
Yi XIONG Senanayake THILAK Yu YONEZAWA Jun IMAOKA Masayoshi YAMAMOTO
Feng LIU Qian XI Yanli XU
Yuling LI Aihuang GUO
Mamoru SHIBATA Ryutaroh MATSUMOTO
Haiyang LIU Xiaopeng JIAO Lianrong MA
Ruixiao LI Hayato YAMANA
Riaz-ul-haque MIAN Tomoki NAKAMURA Masuo KAJIYAMA Makoto EIKI Michihiro SHINTANI
Kundan LAL DAS Munehisa SEKIKAWA Tadashi TSUBONE Naohiko INABA Hideaki OKAZAKI
Katsuhiko SHIRAI Sadaoki FURUI
Tatsuo MATSUOKA Kiyohiro SHIKANO
In a practical continuous speech recognition system, the target speech is often spoken in a different speaking-style (e.g., speed or loudness) from the training speech. It is difficult to cope with such speaking style variations because the amount of training speech is limited. Therefore, acoustic modeling should be robust against different styles of speech in order to obtain high recognition performance from the limited training speech. This paper describes robustness of six of phoneme-based HMMs against speaking-style variations. The six types of model were VQ-and FVQ-based discrete HMMs, and single-Gaussian and mixture-Gaussian HMMs with either diagonal or full covariance matrices. They were investigated using isolated word utterance, phrase-by-phrase utterance and fluently spoken utterance, with different utterance types for training and testing. The experimental results show that the mixture-Gaussian HMM with diagonal covariance matrices is the most promising choice. The FVQ-based HMM and the single-Gaussian HMM with full covariance matrices also achieved good results. The mixture-Gaussian HMM with full covariance matrices sometime achieved very high accuracies, but often suffered from "overtuning" or a lack of training data. Finally this paper proposes a new model-adaptation technique that combines multiple models with appropriate weighting factors. Each model has different characteristics (e.g., coverage of speaking styles and sensitivety to data), and weighting factors can be estimated using "deletedinterpolation". When the mixture-Gaussian diagonal covariance models were used as baseline models, this technique achieved better recognition accuracy than a model trained using all three utterance types at a time. The advantage of this technique is that estimating the weighting factors is stable even from a limited amount of training speech because there are few free parameters to be estimated.
In a practical continuous speech recognition system, input speech includes many extraneous words. Furthermore, detecting the beginning point of the target word is very difficult. Under those circumstances, word-spotting is useful for extracting and recognizing the target speech from such input speech. On the other hand, a phoneme-based HMM is useful for large-vocabulary word recognition. Training a phoneme-based HMM is easier and more stable than training a word-based HMM when there is not so much training speech, because there are several times more phoneme tokens than word tokens in the training speech. For these reasons, we use word-spotting with phoneme-based HMMs. Furthermore, for more precise modeling, we chose context-dependent phoneme modeling. This paper proposes a new clustering method for context-dependent phoneme HMMs. This clustering method uses triphone context when training samples are sufficient, and automatically selects biphone and uniphone contexts if only a few training samples are given. Using this clustering method, context-dependent models were created and tested in phoneme recognition experiments and word spotting experiments. The context-dependent models achieved 90.0% phoneme recognition accuracy that is 7.6% higher than the context-independent models, and they achieved 69.2% word spotting accuracy that is 7.0% higher than the context-independent models.
Shozo MAKINO Akinori ITO Mitsuru ENDO Ken'iti KIDO
This paper describes an overview of Japanese text dictation system composed of an acoustic processor and a linguistic processor. The system deals with 843 conceptual words and 431 functional words. The phoneme recognition is carried out using a modified LVQ2 method which we propose. The phoneme recognition score was 86.1% for 226 sentences uttered by two male speakers. The linguistic processor is composed of a processor for spotting Bunsetsu-units and a syntactic processor. The structure of the Bunsetsu-unit is effectively described by a finite-state automaton. The test-set perplexity of the finite-state automaton is 230. In the processor for spotting Bunsetsu-units, using a syntax-driven continuous-DP matching algorithm, the Bunsetsu-units are spotted from a recognized phoneme sequence and then a Bunsetsu-unit lattice is generated. In the syntactic processor, the Bunsetsu-unit lattice is parsed based on the dependency grammar. The dependency grammar is expressed as the correspondence between a FEATURE marker in a modifier-Bunsetsu and a SLOT-FILLER marker in a head-Bunsetsu. The recognition scores of the Bunsetsu-unit and conceptual words were 73.2% and 85.7% for 226 sentences uttered by the two male speakers.
Takeshi KAWABATA Toshiyuki HANAZAWA Katsunobu ITOH Kiyohiro SHIKANO
A phonetic typewriter is an unlimitedvocabulary continuous speech recognition system recognizing each phone in speech without the need for lexical information. This paper describes a Japanese phonetic typewriter system based on HMM phone recognition and syllable-based stochastic phone sequence modeling. Even though HMM methods have considerable capacity for recognizing speech, it is difficult to recognize individual phones in continuous speech without lexical information. HMM phone recognition is improved by incorporating syllable trigrams for phone sequence modeling. HMM phone units are trained using an isolated word database, and their duration parameters are modified according to speaking rate. Syllable trigram tables are made from a text database of over 300,000 syllables, and phone sequence probabilities calculated from the trigrams are combined with HMM probabilities. Using these probabilities, to limit the number of intermediate candidates leads to an accurate phonetic typewriter system without requiring excessive computation time. An interpolated n-gram approach to phone sequence modeling, is shown to be more effective than a simple trigram method.
This paper reports on a new application of the Markov model to an automatic speech recognition system, in which the feature vectors of speech are regarded to represent the states and the output symbols of the Markov model. The transition-probability of the states and the symbol-output probability are assumed to be represented by multidimensional normal density functions of the feature vector. The DP-matching algorithm is used for calculating optimum time sequence of observed feature vectors. In order to confirm the efficiency of this system, we compared experimentally performance of this system to that of other approaches, such as those using Maharanobis' distance or Euclidean distance. Based on experimentation, in a speaker independent mode, using a vocabulary of Japanese single-digit and four-digit numerals, the current system is shown to be more effective than others.
Yasuhiro KOMORI Kaichiro HATAZAKI
In this paper, a speech recognition system toward a phoneme typewriter without a language model is proposed. The system is realized as an integration of spectrogram reading knowledge and Time-Delay Neural Networks (TDNNs). The system mainly consists of two parts: in the consonant recognition part, a sophisticated integration of knowledge and TDNN is proposed. This improves not only recognition performance and segmentation accuracy, but also reduces insertion errors drastically. In the vowel recognition part, a TDNN is used for detection and rough segmentation using its time shift tolerance advantage. The knowledge part is mainly used for verification of categories and boundaries. A phoneme recognition experiment on 2,620 Japanese words, uttered by one male speaker showed a 91.4% (11,612/12,710) recognition rate, a 3.6% deletion error rate, a 5.0% substitution error rate and a 20.7% insertion error rate, for all Japanese phonemes. This good result was obtained without any language model.
Kenji KITA Toshiyuki TAKEZAWA Tsuyoshi MORIMOTO
This paper describes a continuous speech recognition system using two-level LR parsing and phone based HMMs. ATR has already implemented a predictive LR parsing algorithm in an HMM-based speech recognition system for Japanese. However, up to now, this system has used only intra-phrase grammatical constraints. In Japanese, a sentence is composed of several phrases and thus, two kinds of grammars, namely an intra-phrase grammar and an inter-phrase grammar, are sufficient for recognizing sentences. Two-level LR parsing makes it possible to use not only intra-phrase grammatical constraints but also inter-phrase grammatical constraints during speech recognition. The system is applied to Japanese sentence recognition where sentences were uttered phrase by phrase, and attains a word accuracy of 95.9% and a sentence accuracy of 84.7%.
Kenji KITA Terumasa EHARA Tsuyoshi MORIMOTO
Current continuous speech recognition systems essentially ignore unknown words. Systems are designed to recognize words in the lexicon. However, for using speech recognition systems in a real application such as spoken-language processing, it is very important to process unknown words. This paper proposes a continuous speech recognition method which accepts any utterance that might include unknown words. In this method, words not in the lexicon are transcribed as phone sequences, while words in the lexicon are recognized correctly. The HMM-LR speech recognition system, which is an integration of Hidden Markov Models and generalized LR parsing, is used as the baseline system, and enhanced with the trigram model of syllables to take into account the stochastic characteristics of a language. In our approach, two kinds of grammars, a task grammar which describes the task and a phonetic grammar which describes constraints between phones, are merged and used in the HMM-LR system. The system can output a phonetic transcription for an unknown word by using the phonetic grammar. Experiment results indicate that our approach is very promising.
Minoru SHIGENAGA Yoshihiro SEKIGUCHI Takehiro YAMAGUCHI Ryouta MASUDA
A large vocabulary (with 1019 words and 1382 kinds of inflectional endings) continuous speech recognition system with high predictability applicable to any task and have an unsupervised speaker adaptation capability is described. Phoneme identification is based on various features. Speaker adaptation is done using reliable identified phonemes. Using prosodic information, phrase boundaries are detected. The syntactic analyzer uses a syntactic state transition network and outputs syntactic interpretations. The semantic analyser deals with the meaning of each word, the dependency relationships between words, the extended case structures of predicates, associative function, in universally applicable forms. The extended case grammar with a set of four-items of the case structure and the dependency relationships between words are based on semantic attributes of relating words, and realizes, together with associative function, universally applicable high prediction capability.
Sho-ichi MATSUNAGA Shigeru HOMMA Shigeki SAGAYAMA Sadaoki FURUI
This paper describes two Japanese continuous speech recognition systems (system-1 and system-2) based on phoneme-based HMMs and a two-level grammar approach. Two grammars are an intra-phrase transition network grammar for phrase recognition, and an inter-phrase dependency grammar for sentence recognition. A joint score, combining acoustic likelihood and linguistic certainty factors derived from phonemebased HMMs and dependency rules, is maximized to obtain the best sentence recognition results. System-1 is tuned for sentences uttered phrase-by-phrase and system-2 is tuned for sentence utterances, to make the amount of computation practical. In system-1, two efficient parsing algorithms are used for each grammar. They are a bi-directional network parser and a breadth-first dependency parser. With the phrase-network parser, input phrase utterances are parsed bi-directionally both left-to-right and right-to-left, and optimal Viterbi paths are found along which the accumulated phonetic likelihood is maximized. The dependency parser utilizes efficient breadth-first search and beam search algorithms. For system-2, we have extended the dependency analysis algorithm for sentence utterances, using a technique for detecting most-likely multi-phrase candidates based on the Viterbi phrase alignment. Where the perplexity of the phrase syntax is 40, system-1 and system-2 increase phrase recognition performance in the sentence by approximately 6% and 14%, showing the effectiveness of semantic dependency analysis.
Hidefumi SAWAI Yasuhiro MINAMI Masanori MIYATAKE Alex WAIBEL Kiyohiro SHIKANO
This paper describes recent progress in a connectionist large-vocabulary continuous speech recognition system integrating speech recognition and language processing. The speech recognition part consists of Large Phonemic Time-Delay Neural Networks (TDNNs) which can automatically spot all 24 Japanese phonemes (i.e., 18 consonants /b/, /d/, /g/, /p/, /t/, /k/, /m/, /n/, /N/, /s/, /sh/ ([
Akio KOMATSU Eiji OOHIRA Akira ICHIKAWA
Natural spontaneous speech is so ambiguous that a system for understanding it requires the cooperation of many knowledge sources. Thus, in order to integrate speech processing and language processing, it is necessary to provide a system with a mechanism for supporting such cooperation. We propose here a general framework for cooperative problemsolving, based on the blackboard model and a TMS (truth maintenance system), with an enhanced proving function. In this framework, a reasonably consistent interpretation is automatically kept on the blackboard, while each knowledge source performs its own inference and puts the results on the blackboard. Based on this framework, a model has been established for a system which can understand spontaneous speech through the cooperation of independent knowledge sources. Most notably, prosodic information is used as suprasegmental cues to infer the structure of spontaneous speech. This allows robust parsing of spoken sentences. The feasibility and validity of our basic framework have been confirmed by computer simulation experiments on spontaneous speech.
Seiichi NAKAGAWA Yoshimitsu HIRATA Isao MURASE Tomohiro TANOUE
This paper describes syntax/semantics oriented spoken Japanese understanding systems named "SPOJUSSYNO/SEMO" and compares them. At first these systems make Hidden-Markov-Models (HMM) based on word units automatically by concatenating syllables. Then a word lattice is hypothsized by using a word spotting algorithm and word-based HMMs for an input utterance. In SPOJUS-SYNO, the time-synchronous left-to-right parsing algorithm is executed to find the best word sequence from the word lattice according to syntactic & semantic knowledge represented by a context free semantic grammar. In SPOJUS-SEMO, the knowledges of syntax and semantics are represented by a dependency and case grammar. These systems were implemented in the "UNIX-QA" task with the vocabulary size of 521 words. Experimental result shows that the sentence recognition/understanding rate was about 80/87% for six male speakers for the SPOJUS-SYNO, but was very low performance for the SPOJUS-SEMO.
Yutaka KOBAYASHI Masanori OMOTE Hidenori ENDO Yasuhisa NIIMI
This paper describes an overview of our speech understanding system and reports on the recent results of the sentence recognition experiments. The system, we call SUSKIT-
Shingo NISHIOKA Osamu KAKUSHO Riichiro MIZOGUCHI
A speech understanding system confronts with the ambiguities caused by the acoustic-phonetic errors and multiple-meaning of words. Thus the effective framework is required to resolve the ambiguity. The speech understanding system described in this paper deals with two different kind of phrase to avoid the combinatorial explosion. And the speech understanding system is constructed on ATMS based problem solving system to extract maximum performance. Experimental results show that, the time consumed by the speech understanding system reduces into 10. Furthermore, to evaluate the generality and effectiveness of the ATMS based problem solving system, the results of an another experiment are also presented in this paper.
Tetsuya YAMAMOTO Yoshikazu OHTA Yoichi YAMASHITA Osamu KAKUSHO Riichiro MIZOGUCHI
This paper describes a dialog management system called MASCOTS which manages a dialog between a user and a problem solving system through spoken Japanese and helps the speech understanding system in its language processing. MASCOTS tries to predict the next user utterance based on the architecture for managing dialog with two stacks and plan information. MASCOTS not only contributes to making language processing efficient, but also works for a problem solving system. MASCOTS identifies the kind of the utterance and standardizes its representation form in place of a problem solving system. In this paper, the architecture of MASCOTS is discussed focusing on the characteristics of dialog and two ways of predicting the next user utterance exchanging the information with the language processing system.
Tsuyoshi MORIMOTO Kiyohiro SHIKANO Kiyoshi KOGURE Hitoshi IIDA Akira KUREMATSU
The experimental spoken language translation system (SL-TRANS) has been implemented. It can recognize Japanese speech, translate it to English, and output a synthesized English speech. One of the most important problems in realizing such a system is how to integrate, or connect, speech recognition and language processing. In this paper, a new method realized in the system is described. The method is composed of three processes: grammar-driven predictive speech recognition, Kakariuke-dependency-based candidate filtering, and HPSG-based lattice parsing which is supplemented with a sentence preference mechanism. Input speech is uttered phrase by phrase. The speech recognizer takes an input phrase utterance and outputs several candidates with recognition scores for each phrase. Japanese phrasal grammar is used in recognition. It contributes to the output of grammatically well-formed phrase candidates, as well as to the reduction of phone perplexity. The candidate filter takes a phrase lattice, which is a sequence of multiple candidates for a phrase, and outputs a reduced phrase lattice. It removes semantically inappropriate phrase candidates by applying the Kakariuke dependency relationship between phrases. Finally, the HPSG-based lattice parser takes a phrase lattice and chooses the most plausible sentence by checking syntactic and semantic legitimacy or evaluating sentential preference. Experiment results for the system are also reported and the usefulness of the method is confirmed.
In this paper, we investigate the language models using context-free grammar, bigram and quasi/simplified-trigram. For calculating of statistics of bigram and quasi/simplified-trigram, we used the set of sentences generated randomly from CFG that are legal in terms of semantics. We compared them on the perplexities for their models and the sentence recognition accuracies. The sentence recognition was experimented in the "UNIX-QA" task with the vocabulary size of 521 words. From these results, the perplexities of bigram and quasi-trigram were about 1.5-1.7 times and 1.2-1.3 times larger than the perplexity of CFG that corresponds to the most restricted grammar (perplexity=10.0), and we realized that quasi-trigram has the almost same ability of modeling as the restricted CFG when the set of plausible sentences in the task is given.
This paper describes recent speech database efforts in Japan in which the author has been involved. The JEIDA Japanese Common Speech Data Corpus was first reported in 1986. It has been converted to DAT recently. The JEIDA Noise Database has been released to the public recently. It contains various kinds of environmental noise and standard noise for sound level calibration. The 'Spoken Language' project collected speech data including continuous speech spoken by 10 males and 10 females. The 'Spoken Japanese' project, started in 1989, attempts to collect various dialectal speech from all over Japan and create speech databases. A compact disc containing a fairy tale and weather forecast spoken by 20 dialect speakers has been produced. It also describes the Continuous Speech Database Committee which was established recently by the Acoustical Society of Japan.
A stability of convex combinations of polynomials and a stability margin of stable polynomials are studied using Hermite matrices for continuous-time systems. Available results are found to give a heavy computational burden especially in checking the stability of a polytope of polynomials by means of "the edge theorem". We propose alternate stability conditions and margin which reduce the computational burden. In our approach, the stability condition reported by Bialas and Garloff can be derived readily.
Shuichi UENO Katsufumi TSUJI Yoji KAJITANI
Given a plane graph G, a trail of G is said to be dual if it is also a trail in the geometric dual of G. We show that the problem of partitioning the edges of G into the minimum number of dual trails is NP-hard.