The search functionality is under construction.

IEICE TRANSACTIONS on Information

  • Impact Factor

    0.59

  • Eigenfactor

    0.002

  • article influence

    0.1

  • Cite Score

    1.4

Advance publication (published online immediately after acceptance)

Volume E88-D No.3  (Publication Date:2005/03/01)

    Special Section on Corpus-Based Speech Technologies
  • FOREWORD

    Kiyohiro SHIKANO  

     
    FOREWORD

      Page(s):
    365-365
  • Recent Progress in Corpus-Based Spontaneous Speech Recognition

    Sadaoki FURUI  

     
    INVITED PAPER

      Page(s):
    366-375

    This paper overviews recent progress in the development of corpus-based spontaneous speech recognition technology. Although speech is in almost any situation spontaneous, recognition of spontaneous speech is an area which has only recently emerged in the field of automatic speech recognition. Broadening the application of speech recognition depends crucially on raising recognition performance for spontaneous speech. For this purpose, it is necessary to build large spontaneous speech corpora for constructing acoustic and language models. This paper focuses on various achievements of a Japanese 5-year national project "Spontaneous Speech: Corpus and Processing Technology" that has recently been completed. Because of various spontaneous-speech specific phenomena, such as filled pauses, repairs, hesitations, repetitions and disfluencies, recognition of spontaneous speech requires various new techniques. These new techniques include flexible acoustic modeling, sentence boundary detection, pronunciation modeling, acoustic as well as language model adaptation, and automatic summarization. Particularly automatic summarization including indexing, a process which extracts important and reliable parts of the automatic transcription, is expected to play an important role in building various speech archives, speech-based information retrieval systems, and human-computer dialogue systems.

  • Developments in Corpus-Based Speech Synthesis: Approaching Natural Conversational Speech

    Nick CAMPBELL  

     
    INVITED PAPER

      Page(s):
    376-383

    This paper describes the special demands of conversational speech in the context of corpus-based speech synthesis. The author proposed the CHATR system of prosody-based unit-selection for concatenative waveform synthesis seven years ago, and now extends this work to incorporate the results of an analysis of five-years of recordings of spontaneous conversational speeech in a wide range of actual daily-life situations. The paper proposes that the expresion of affect (often translated as 'kansei' in Japanese) is the main factor differentiating laboratory speech from real-world conversational speech, and presents a framework for the specification of affect through differences in speaking style and voice quality. Having an enormous corpus of speech samples available for concatenation allows the selection of complete phrase-sized utterance segments, and changes the focus of unit selection from segmental or phonetic continuity to one of prosodic and discoursal appropriateness instead. Samples of the resulting large-corpus-based synthesis can be heard at http://feast.his.atr.jp/AESOP.

  • Multiple Regression of Log Spectra for In-Car Speech Recognition Using Multiple Distributed Microphones

    Weifeng LI  Tetsuya SHINDE  Hiroshi FUJIMURA  Chiyomi MIYAJIMA  Takanori NISHINO  Katunobu ITOU  Kazuya TAKEDA  Fumitada ITAKURA  

     
    PAPER-Feature Extraction and Acoustic Medelings

      Page(s):
    384-390

    This paper describes a new multi-channel method of noisy speech recognition, which estimates the log spectrum of speech at a close-talking microphone based on the multiple regression of the log spectra (MRLS) of noisy signals captured by distributed microphones. The advantages of the proposed method are as follows: 1) The method does not require a sensitive geometric layout, calibration of the sensors nor additional pre-processing for tracking the speech source; 2) System works in very small computation amounts; and 3) Regression weights can be statistically optimized over the given training data. Once the optimal regression weights are obtained by regression learning, they can be utilized to generate the estimated log spectrum in the recognition phase, where the speech of close-talking is no longer required. The performance of the proposed method is illustrated by speech recognition of real in-car dialogue data. In comparison to the nearest distant microphone and multi-microphone adaptive beamformer, the proposed approach obtains relative word error rate (WER) reductions of 9.8% and 3.6%, respectively.

  • Automatic Generation of Non-uniform and Context-Dependent HMMs Based on the Variational Bayesian Approach

    Takatoshi JITSUHIRO  Satoshi NAKAMURA  

     
    PAPER-Feature Extraction and Acoustic Medelings

      Page(s):
    391-400

    We propose a new method both for automatically creating non-uniform, context-dependent HMM topologies, and selecting the number of mixture components based on the Variational Bayesian (VB) approach. Although the Maximum Likelihood (ML) criterion is generally used to create HMM topologies, it has an over-fitting problem. Recently, to avoid this problem, the VB approach has been applied to create acoustic models for speech recognition. We introduce the VB approach to the Successive State Splitting (SSS) algorithm, which can create both contextual and temporal variations for HMMs. Experimental results indicate that the proposed method can automatically create a more efficient model than the original method. We evaluated a method to increase the number of mixture components by using the VB approach and considering temporal structures. The VB approach obtained almost the same performance as the smaller number of mixture components in comparison with that obtained by using ML-based methods.

  • Applying Sparse KPCA for Feature Extraction in Speech Recognition

    Amaro LIMA  Heiga ZEN  Yoshihiko NANKAKU  Keiichi TOKUDA  Tadashi KITAMURA  Fernando G. RESENDE  

     
    PAPER-Feature Extraction and Acoustic Medelings

      Page(s):
    401-409

    This paper presents an analysis of the applicability of Sparse Kernel Principal Component Analysis (SKPCA) for feature extraction in speech recognition, as well as, a proposed approach to make the SKPCA technique realizable for a large amount of training data, which is an usual context in speech recognition systems. Although the KPCA (Kernel Principal Component Analysis) has proved to be an efficient technique for being applied to speech recognition, it has the disadvantage of requiring training data reduction, when its amount is excessively large. This data reduction is important to avoid computational unfeasibility and/or an extremely high computational burden related to the feature representation step of the training and the test data evaluations. The standard approach to perform this data reduction is to randomly choose frames from the original data set, which does not necessarily provide a good statistical representation of the original data set. In order to solve this problem a likelihood related re-estimation procedure was applied to the KPCA framework, thus creating the SKPCA, which nevertheless is not realizable for large training databases. The proposed approach consists in clustering the training data and applying to these clusters a SKPCA like data reduction technique generating the reduced data clusters. These reduced data clusters are merged and reduced in a recursive procedure until just one cluster is obtained, making the SKPCA approach realizable for a large amount of training data. The experimental results show the efficiency of SKPCA technique with the proposed approach over the KPCA with the standard sparse solution using randomly chosen frames and the standard feature extraction techniques.

  • Continuous Speech Recognition Based on General Factor Dependent Acoustic Models

    Hiroyuki SUZUKI  Heiga ZEN  Yoshihiko NANKAKU  Chiyomi MIYAJIMA  Keiichi TOKUDA  Tadashi KITAMURA  

     
    PAPER-Feature Extraction and Acoustic Medelings

      Page(s):
    410-417

    This paper describes continuous speech recognition incorporating the additional complement information, e.g., voice characteristics, speaking styles, linguistic information and noise environment, into HMM-based acoustic modeling. In speech recognition systems, context-dependent HMMs, i.e., triphone, and the tree-based context clustering have commonly been used. Several attempts to utilize not only phonetic contexts, but additional complement information based on context (factor) dependent HMMs have been made in recent years. However, when the additional factors for testing data are unobserved, methods for obtaining factor labels is required before decoding. In this paper, we propose a model integration technique based on general factor dependent HMMs for decoding. The integrated HMMs can be used by a conventional decoder as standard triphone HMMs with Gaussian mixture densities. Moreover, by using the results of context clustering, the proposed method can determine an optimal number of mixture components for each state dependently of the degree of influence from additional factors. Phoneme recognition experiments using voice characteristic labels show significant improvements with a small number of model parameters, and a 19.3% error reduction was obtained in noise environment experiments.

  • Parameter Sharing in Mixture of Factor Analyzers for Speaker Identification

    Hiroyoshi YAMAMOTO  Yoshihiko NANKAKU  Chiyomi MIYAJIMA  Keiichi TOKUDA  Tadashi KITAMURA  

     
    PAPER-Feature Extraction and Acoustic Medelings

      Page(s):
    418-424

    This paper investigates the parameter tying structures of a mixture of factor analyzers (MFA) and discriminative training of MFA for speaker identification. The parameters of factor loading matrices or diagonal matrices are shared in different mixtures of MFA. Then, minimum classification error (MCE) training is applied to the MFA parameters to enhance the discrimination ability. The result of a text-independent speaker identification experiment shows that MFA outperforms the conventional Gaussian mixture model (GMM) with diagonal or full covariance matrices and achieves the best performance when sharing the diagonal matrices, resulting in a relative gain of 26% over the GMM with diagonal covariance matrices. The improvement is more significant especially in sparse training data condition. The recognition performance is further improved by MCE training with an additional gain of 3% error reduction.

  • Deterministic Annealing EM Algorithm in Acoustic Modeling for Speaker and Speech Recognition

    Yohei ITAYA  Heiga ZEN  Yoshihiko NANKAKU  Chiyomi MIYAJIMA  Keiichi TOKUDA  Tadashi KITAMURA  

     
    PAPER-Feature Extraction and Acoustic Medelings

      Page(s):
    425-431

    This paper investigates the effectiveness of the DAEM (Deterministic Annealing EM) algorithm in acoustic modeling for speaker and speech recognition. Although the EM algorithm has been widely used to approximate the ML estimates, it has the problem of initialization dependence. To relax this problem, the DAEM algorithm has been proposed and confirmed the effectiveness in artificial small tasks. In this paper, we applied the DAEM algorithm to practical speech recognition tasks: speaker recognition based on GMMs and continuous speech recognition based on HMMs. Experimental results show that the DAEM algorithm can improve the recognition performance as compared to the standard EM algorithm with conventional initialization algorithms, especially in the flat start training for continuous speech recognition.

  • A Data-Driven Model Parameter Compensation Method for Noise-Robust Speech Recognition

    Yongjoo CHUNG  

     
    LETTER

      Page(s):
    432-434

    A data-driven approach that compensates the HMM parameters for the noisy speech recognition is proposed. Instead of assuming some statistical approximations as in the conventional methods such as the PMC, the various statistical information necessary for the HMM parameter adaptation is directly estimated by using the Baum-Welch algorithm. The proposed method has shown improved results compared with the PMC for the noisy speech recognition.

  • Feature Extraction with Combination of HMT-Based Denoising and Weighted Filter Bank Analysis for Robust Speech Recognition

    Sungyun JUNG  Jongmok SON  Keunsung BAE  

     
    LETTER

      Page(s):
    435-438

    In this paper, we propose a new feature extraction method that combines both HMT-based denoising and weighted filter bank analysis for robust speech recognition. The proposed method is made up of two stages in cascade. The first stage is denoising process based on the wavelet domain Hidden Markov Tree model, and the second one is the filter bank analysis with weighting coefficients obtained from the residual noise in the first stage. To evaluate performance of the proposed method, recognition experiments were carried out for additive white Gaussian and pink noise with signal-to-noise ratio from 25 dB to 0 dB. Experiment results demonstrate the superiority of the proposed method to the conventional ones.

  • Language Model Adaptation Based on PLSA of Topics and Speakers for Automatic Transcription of Panel Discussions

    Yuya AKITA  Tatsuya KAWAHARA  

     
    PAPER-Spoken Language Systems

      Page(s):
    439-445

    Appropriate language modeling is one of the major issues for automatic transcription of spontaneous speech. We propose an adaptation method for statistical language models based on both topic and speaker characteristics. This approach is applied for automatic transcription of meetings and panel discussions, in which multiple participants speak on a given topic in their own speaking style. A baseline language model is a mixture of two models, which are trained with different corpora covering various topics and speakers, respectively. Then, probabilistic latent semantic analysis (PLSA) is performed on the same respective corpora and the initial ASR result to provide two sets of unigram probabilities conditioned on input speech, with regard to topics and speaker characteristics, respectively. Finally, the baseline model is adapted by scaling N-gram probabilities with these unigram probabilities. For speaker adaptation purpose, we make use of a portion of the Corpus of Spontaneous Japanese (CSJ) in which a large number of speakers gave talks for given topics. Experimental evaluation with real discussions showed that both topic and speaker adaptation reduced test-set perplexity, and in total, an average reduction rate of 8.5% was obtained. Furthermore, improvement on word accuracy was also achieved by the proposed adaptation method.

  • Dialogue Speech Recognition by Combining Hierarchical Topic Classification and Language Model Switching

    Ian R. LANE  Tatsuya KAWAHARA  Tomoko MATSUI  Satoshi NAKAMURA  

     
    PAPER-Spoken Language Systems

      Page(s):
    446-454

    An efficient, scalable speech recognition architecture combining topic detection and topic-dependent language modeling is proposed for multi-domain spoken language systems. In the proposed approach, the inferred topic is automatically detected from the user's utterance, and speech recognition is then performed by applying an appropriate topic-dependent language model. This approach enables users to freely switch between domains while maintaining high recognition accuracy. As topic detection is performed on a single utterance, detection errors may occur and propagate through the system. To improve robustness, a hierarchical back-off mechanism is introduced where detailed topic models are applied when topic detection is confident and wider models that cover multiple topics are applied in cases of uncertainty. The performance of the proposed architecture is evaluated when combined with two topic detection methods: unigram likelihood and SVMs (Support Vector Machines). On the ATR Basic Travel Expression Corpus, both methods provide a significant reduction in WER (9.7% and 10.3%, respectively) compared to a single language model system. Furthermore, recognition accuracy is comparable to performing decoding with all topic-dependent models in parallel, while the required computational cost is much reduced.

  • Verification of Multi-Class Recognition Decision: A Classification Approach

    Tomoko MATSUI  Frank K. SOONG  Biing-Hwang JUANG  

     
    PAPER-Spoken Language Systems

      Page(s):
    455-462

    We investigate strategies to improve the utterance verification performance using a 2-class pattern classification approach, including: utilizing N-best candidate scores, modifying segmentation boundaries, applying background and out-of-vocabulary filler models, incorporating contexts, and minimizing verification errors via discriminative training. A connected-digit database recorded in a noisy, moving car with a hands-free microphone mounted on the sun-visor is used to evaluate the verification performance. The equal error rate (EER) of word verification is employed as the sole performance measure. All factors and their effects on the verification performance are presented in detail. The EER is reduced from 29%, using the standard likelihood ratio test, down to 21.4%, when all features are properly integrated.

  • An Unsupervised Speaker Adaptation Method for Lecture-Style Spontaneous Speech Recognition Using Multiple Recognition Systems

    Seiichi NAKAGAWA  Tomohiro WATANABE  Hiromitsu NISHIZAKI  Takehito UTSURO  

     
    PAPER-Spoken Language Systems

      Page(s):
    463-471

    This paper describes an accurate unsupervised speaker adaptation method for lecture style spontaneous speech recognition using multiple LVCSR systems. In an unsupervised speaker adaptation framework, the improvement of recognition performance by adapting acoustic models remarkably depends on the accuracy of labels such as phonemes and syllables. Therefore, extraction of the adaptation data guided by confidence measure is effective for unsupervised adaptation. In this paper, we looked for the high confidence portions based on the agreement between two LVCSR systems, adapted acoustic models using the portions attached with high accurate labels, and then improved the recognition accuracy. We applied our method to the Corpus of Spontaneous Japanese (CSJ) and the method improved the recognition rate by about 2.1% in comparison with a traditional method.

  • Improving Keyword Recognition of Spoken Queries by Combining Multiple Speech Recognizer's Outputs for Speech-driven WEB Retrieval Task

    Masahiko MATSUSHITA  Hiromitsu NISHIZAKI  Takehito UTSURO  Seiichi NAKAGAWA  

     
    PAPER-Spoken Language Systems

      Page(s):
    472-480

    This paper presents speech-driven Web retrieval models which accept spoken search topics (queries) in the NTCIR-3 Web retrieval task. The major focus of this paper is on improving speech recognition accuracy of spoken queries and then improving retrieval accuracy in speech-driven Web retrieval. We experimentally evaluated the techniques of combining outputs of multiple LVCSR models in recognition of spoken queries. As model combination techniques, we compared the SVM learning technique with conventional voting schemes such as ROVER. In addition, for investigating the effects on the retrieval performance in vocabulary size of the language model, we prepared two kinds of language models: the one's vocabulary size was 20,000, the other's one was 60,000. Then, we evaluated the differences in the recognition rates of the spoken queries and the retrieval performance. We showed that the techniques of multiple LVCSR model combination could achieve improvement both in speech recognition and retrieval accuracies in speech-driven text retrieval. Comparing with the retrieval accuracies when an LM with a 20,000/60,000 vocabulary size is used in an LVCSR system, we found that the larger the vocabulary size is, the better the retrieval accuracy is.

  • Perceptually-Related F0 Parameters for Automatic Classification of Phrase Final Tones

    Carlos Toshinori ISHI  

     
    PAPER-Speech Synthesis and Prosody

      Page(s):
    481-488

    Automatic labeling of prosodic features is an important topic when constructing large speech databases for speech synthesis or analysis purposes. Perceptually-related F0 parameters are proposed with the aim of automatically classifying phrase final tones. Analyses are conducted to verify how consistently subjects are able to categorize phrase final tones, and how perceptual features are related with the categories. Three types of acoustic parameters are proposed and analyzed for representing the perceptual features related to the tone categories: one related to pitch movement within the phrase final, one related to pitch reset prior to the phrase final, and one related to the length of the phrase final. A classification tree is constructed to evaluate automatic classification of phrase final tones, resulting in 79.2% accuracy for the consistently categorized samples, using the best combination among the proposed acoustic parameters.

  • Fundamental Frequency Modeling for Speech Synthesis Based on a Statistical Learning Technique

    Shinsuke SAKAI  

     
    PAPER-Speech Synthesis and Prosody

      Page(s):
    489-495

    This paper proposes a novel multi-layer approach to fundamental frequency modeling for concatenative speech synthesis based on a statistical learning technique called additive models. We define an additive F0 contour model consisting of long-term, intonational phrase-level, component and short-term, accentual phrase-level, component, along with a least-squares error criterion that includes a regularization term. A backfitting algorithm, that is derived from this error criterion, estimates both components simultaneously by iteratively applying cubic spline smoothers. When this method is applied to a 7,000 utterance Japanese speech corpus, it achieves F0 RMS errors of 28.9 and 29.8 Hz on the training and test data, respectively, with corresponding correlation coefficients of 0.806 and 0.777. The automatically determined intonational and accentual phrase components turn out to behave smoothly, systematically, and intuitively under a variety of prosodic conditions.

  • Automatic Scoring for Prosodic Proficiency of English Sentences Spoken by Japanese Based on Utterance Comparison

    Yoichi YAMASHITA  Keisuke KATO  Kazunori NOZAWA  

     
    PAPER-Speech Synthesis and Prosody

      Page(s):
    496-501

    This paper describes techniques of scoring prosodic proficiency of English sentences spoken by Japanese. The multiple regression model predicts the prosodic proficiency using new prosodic measures based on the characteristics of Japanese novice learners of English. Prosodic measures are calculated by comparing prosodic parameters, such as F0, power and duration, of learner's and native speaker's speech. The new measures include the approximation error of the fitting line and the comparison result of prosodic parameters for a limited segment of the word boundary rather than the whole utterance. This paper reveals that the introduction of the new measures improved the correlation by 0.1 between the teachers' and automatic scores.

  • Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis

    Junichi YAMAGISHI  Koji ONISHI  Takashi MASUKO  Takao KOBAYASHI  

     
    PAPER-Speech Synthesis and Prosody

      Page(s):
    502-509

    This paper describes the modeling of various emotional expressions and speaking styles in synthetic speech using HMM-based speech synthesis. We show two methods for modeling speaking styles and emotional expressions. In the first method called style-dependent modeling, each speaking style and emotional expression is modeled individually. In the second one called style-mixed modeling, each speaking style and emotional expression is treated as one of contexts as well as phonetic, prosodic, and linguistic features, and all speaking styles and emotional expressions are modeled simultaneously by using a single acoustic model. We chose four styles of read speech -- neutral, rough, joyful, and sad -- and compared the above two modeling methods using these styles. The results of subjective evaluation tests show that both modeling methods have almost the same accuracy, and that it is possible to synthesize speech with the speaking style and emotional expression similar to those of the target speech. In a test of classification of styles in synthesized speech, more than 80% of speech samples generated using both the models were judged to be similar to the target styles. We also show that the style-mixed modeling method gives fewer output and duration distributions than the style-dependent modeling method.

  • Modeling Improved Prosody Generation from High-Level Linguistically Annotated Corpora

    Gerasimos XYDAS  Dimitris SPILIOTOPOULOS  Georgios KOUROUPETROGLOU  

     
    PAPER-Speech Synthesis and Prosody

      Page(s):
    510-518

    Synthetic speech usually suffers from bad F0 contour surface. The prediction of the underlying pitch targets robustly relies on the quality of the predicted prosodic structures, i.e. the corresponding sequences of tones and breaks. In the present work, we have utilized a linguistically enriched annotated corpus to build data-driven models for predicting prosodic structures with increased accuracy. We have then used a linear regression approach for the F0 modeling. An appropriate XML annotation scheme has been introduced to encode syntax, grammar, new or already given information, phrase subject/object information, as well as rhetorical elements in the corpus, by exploiting a Natural Language Generator (NLG) system. To prove the benefits from the introduction of the enriched input meta-information, we first show that while tone and break CART predictors have high accuracy when standing alone (92.35% for breaks, 87.76% for accents and 99.03% for endtones), their application in the TtS chain degrades the Linear Regression pitch target model. On the other hand, the enriched linguistic meta-information minimizes errors of models leading to a more natural F0 surface. Both objective and subjective evaluation were adopted for the intonation contours by taking into account the propagated errors introduced by each model in the synthesis chain.

  • Designing Target Cost Function Based on Prosody of Speech Database

    Kazuki ADACHI  Tomoki TODA  Hiromichi KAWANAMI  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Speech Synthesis and Prosody

      Page(s):
    519-524

    This research aims to construct a high-quality Japanese TTS (Text-to-Speech) system that has high flexibility in treating prosody. Many TTS systems have implemented a prosody control system but such systems have been fundamentally designed to output speech with a standard pitch and speech rate. In this study, we employ a unit selection-concatenation method and also introduce an analysis-synthesis process to provide precisely controlled prosody in output speech. Speech quality degrades in proportion to the amount of prosody modification, therefore a target cost for prosody is set to evaluate prosodic difference between target prosody and speech candidates in such a unit selection system. However, the conventional cost ignores the original prosody of speech segments, although it is assumed that the quality deterioration tendency varies in relation to the pitch or speech rate of original speech. In this paper, we propose a novel cost function design based on the prosody of speech segments. First, we recorded nine databases of Japanese speech with different prosodic characteristics. Then with respect to the speech databases, we investigated the relationships between the amount of prosody modification and the perceptual degradation. The results indicate that the tendency of perceptual degradation differs according to the prosodic features of the original speech. On the basis of these results, we propose a new cost function design, which changes a cost function according to the prosody of a speech database. Results of preference testing of synthetic speech show that the proposed cost functions generate speech of higher quality than the conventional method.

  • A VoiceFont Creation Framework for Generating Personalized Voices

    Takashi SAITO  Masaharu SAKAMOTO  

     
    PAPER-Speech Synthesis and Prosody

      Page(s):
    525-534

    This paper presents a new framework for effectively creating VoiceFonts for speech synthesis. A VoiceFont in this paper represents a voice inventory aimed at generating personalized voices. Creating well-formed voice inventories is a time-consuming and laborious task. This has become a critical issue for speech synthesis systems that make an attempt to synthesize many high quality voice personalities. The framework we propose here aims to drastically reduce the burden with a twofold approach. First, in order to substantially enhance the accuracy and robustness of automatic speech segmentation, we introduce a multi-layered speech segmentation algorithm with a new measure of segmental reliability. Secondly, to minimize the amount of human intervention in the process of VoiceFont creation, we provide easy-to-use functions in a data viewer and compiler to facilitate checking and validation of the automatically extracted data. We conducted experiments to investigate the accuracy of the automatic speech segmentation, and its robustness to speaker and style variations. The results of the experiments on six speech corpora with a fairly large variation of speaking styles show that the speech segmentation algorithm is quite accurate and robust in extracting segments of both phonemes and accentual phrases. In addition, to subjectively evaluate VoiceFonts created by using the framework, we conducted a listening test for speaker recognizability. The results show that the voice personalities of synthesized speech generated by the VoiceFont-based speech synthesizer are fairly close to those of the donor speakers.

  • AURORA-2J: An Evaluation Framework for Japanese Noisy Speech Recognition

    Satoshi NAKAMURA  Kazuya TAKEDA  Kazumasa YAMAMOTO  Takeshi YAMADA  Shingo KUROIWA  Norihide KITAOKA  Takanobu NISHIURA  Akira SASOU  Mitsunori MIZUMACHI  Chiyomi MIYAJIMA  Masakiyo FUJIMOTO  Toshiki ENDO  

     
    PAPER-Speech Corpora and Related Topics

      Page(s):
    535-544

    This paper introduces an evaluation framework for Japanese noisy speech recognition named AURORA-2J. Speech recognition systems must still be improved to be robust to noisy environments, but this improvement requires development of the standard evaluation corpus and assessment technologies. Recently, the Aurora 2, 3 and 4 corpora and their evaluation scenarios have had significant impact on noisy speech recognition research. The AURORA-2J is a Japanese connected digits corpus and its evaluation scripts are designed in the same way as Aurora 2 with the help of European Telecommunications Standards Institute (ETSI) AURORA group. This paper describes the data collection, baseline scripts, and its baseline performance. We also propose a new performance analysis method that considers differences in recognition performance among speakers. This method is based on the word accuracy per speaker, revealing the degree of the individual difference of the recognition performance. We also propose categorization of modifications, applied to the original HTK baseline system, which helps in comparing the systems and in recognizing technologies that improve the performance best within the same category.

  • Robust Dependency Parsing of Spontaneous Japanese Spoken Language

    Tomohiro OHNO  Shigeki MATSUBARA  Nobuo KAWAGUCHI  Yasuyoshi INAGAKI  

     
    PAPER-Speech Corpora and Related Topics

      Page(s):
    545-552

    Spontaneously spoken Japanese includes a lot of grammatically ill-formed linguistic phenomena such as fillers, hesitations, inversions, and so on, which do not appear in written language. This paper proposes a novel method of robust dependency parsing using a large-scale spoken language corpus, and evaluates the availability and robustness of the method using spontaneously spoken dialogue sentences. By utilizing stochastic information about the appearance of ill-formed phenomena, the method can robustly parse spoken Japanese including fillers, inversions, or dependencies over utterance units. Experimental results reveal that the parsing accuracy reached 87.0%, and we confirmed that it is effective to utilize the location information of a bunsetsu, and the distance information between bunsetsus as stochastic information.

  • Construction and Evaluation of a Large In-Car Speech Corpus

    Kazuya TAKEDA  Hiroshi FUJIMURA  Katsunobu ITOU  Nobuo KAWAGUCHI  Shigeki MATSUBARA  Fumitada ITAKURA  

     
    PAPER-Speech Corpora and Related Topics

      Page(s):
    553-561

    In this paper, we discuss the construction of a large in-car spoken dialogue corpus and the result of its analysis. We have developed a system specially built into a Data Collection Vehicle (DCV) which supports the synchronous recording of multichannel audio data from 16 microphones that can be placed in flexible positions, multichannel video data from 3 cameras, and vehicle related data. Multimedia data has been collected for three sessions of spoken dialogue with different modes of navigation, during approximately a 60 minute drive by each of 800 subjects. We have characterized the collected dialogues across the three sessions. Some characteristics such as sentence complexity and SNR are found to differ significantly among the sessions. Linear regression analysis results also clarify the relative importance of various corpus characteristics.

  • Gemination of Consonant in Spontaneous Speech: An Analysis of the "Corpus of Spontaneous Japanese"

    Masako FUJIMOTO  Takayuki KAGOMIYA  

     
    PAPER-Speech Corpora and Related Topics

      Page(s):
    562-568

    In Japanese, there is frequent alternation between CV morae and moraic geminate consonants. In this study, we analyzed the phonemic environments of consonant gemination (CG) using the "Corpus of Spontaneous Japanese (CSJ)." The results revealed that the environment in which gemination occurs is, to some extent, parallel to that of vowel devoicing. However, there are two crucial differences. One difference is that the CG tends to occur in a /kVk/ environment, whereas such is not the case for vowel devoicing. The second difference is that when the preceding consonant is /r/, gemination occurs, but not vowel devoicing. These observations suggest that the mechanism leading to CG differs from that which leads to vowel devoicing.

  • An Objective Method for Evaluating Speech Translation System: Using a Second Language Learner's Corpus

    Keiji YASUDA  Fumiaki SUGAYA  Toshiyuki TAKEZAWA  Genichiro KIKUI  Seiichi YAMAMOTO  Masuzo YANAGIDA  

     
    PAPER-Speech Corpora and Related Topics

      Page(s):
    569-577

    In this paper we propose an objective method for assessing the capability of a speech translation system. It automates the translation paired comparison method, which gives a simple, easy to understand TOEIC score proposed by Sugaya et al., to succinctly evaluate a speech translation system. To avoid the expensive evaluation cost of the original method where large manual effort is required, the new objective method automates the procedure by employing an objective metric such as BLEU and DP-based measure. The evaluation results obtained by the proposed method are similar to those of the original method. Also, the proposed method is used to evaluate the usefulness of a speech translation system. It is then found that our speech translation system is useful in general, even to users with higher TOEIC score than the system's.

  • CIAIR In-Car Speech Corpus--Influence of Driving Status--

    Nobuo KAWAGUCHI  Shigeki MATSUBARA  Kazuya TAKEDA  Fumitada ITAKURA  

     
    LETTER

      Page(s):
    578-582

    CIAIR, Nagoya University, has been compiling an in-car speech database since 1999. This paper discusses the basic information contained in this database and an analysis on the effects of driving status based on the database. We have developed a system called the Data Collection Vehicle (DCV), which supports synchronous recording of multi-channel audio data from 12 microphones which can be placed throughout the vehicle, multi-channel video recording from three cameras, and the collection of vehicle-related data. In the compilation process, each subject had conversations with three types of dialog system: a human, a "Wizard of Oz" system, and a spoken dialog system. Vehicle information such as speed, engine RPM, accelerator/brake-pedal pressure, and steering-wheel motion were also recorded. In this paper, we report on the effect that driving status has on phenomena specific to spoken language

  • Regular Section
  • On Dependency Pair Method for Proving Termination of Higher-Order Rewrite Systems

    Masahiko SAKAI  Keiichirou KUSAKARI  

     
    PAPER-Computation and Computational Models

      Page(s):
    583-593

    This paper explores how to extend the dependency pair technique for proving termination of higher-order rewrite systems. In the first order case, the termination of term rewriting systems are proved by showing the non-existence of an infinite R-chain of the dependency pairs. However, the termination and the non-existence of an infinite R-chain do not coincide in the higher-order case. We introduce a new notion of dependency forest that characterize infinite reductions and infinite R-chains, and show that the termination property of higher-order rewrite systems R can be checked by showing the non-existence of an infinite R-chain, if R is strongly linear or non-nested.

  • Assessing the Quality of Fuzzy Partitions Using Relative Intersection

    Dae-Won KIM  Young-il KIM  Doheon LEE  Kwang Hyung LEE  

     
    PAPER-Computation and Computational Models

      Page(s):
    594-602

    In this paper, conventional validity indexes are reviewed and the shortcomings of the fuzzy cluster validation index based on inter-cluster proximity are examined. Based on these considerations, a new cluster validity index is proposed for fuzzy partitions obtained from the fuzzy c-means algorithm. The proposed validity index is defined as the average value of the relative intersections of all possible pairs of fuzzy clusters in the system. It computes the overlap between two fuzzy clusters by considering the intersection of each data point in the overlap. The optimal number of clusters is obtained by minimizing the validity index with respect to c. Experiments in which the proposed validity index and several conventional validity indexes were applied to well known data sets highlight the superior qualities of the proposed index.

  • ADPE: Agent-Based Decentralized Process Engine

    Shih-Chien CHOU  

     
    PAPER-Software Engineering

      Page(s):
    603-609

    Process-centered software engineering environments (PSEEs) facilitate controlling complicated software processes. Traditional PSEEs are generally centrally controlled, which may result in the following drawbacks: (1) the server may become a bottleneck and (2) when the server is down, processes need to be suspended. To overcome the drawbacks, we developed a decentralized process engine ADPE (agent-based decentralized process engine). ADPE can be embedded in any PSEE to decentralize the PSEE. This paper presents ADPE.

  • Delay Fault Testing of Processor Cores in Functional Mode

    Virendra SINGH  Michiko INOUE  Kewal K. SALUJA  Hideo FUJIWARA  

     
    PAPER-Dependable Computing

      Page(s):
    610-618

    This paper proposes an efficient methodology of delay fault testing of processor cores using their instruction sets. These test vectors can be applied in the functional mode of operation, hence, self-testing of processor core becomes possible for path delay fault testing. The proposed approach uses a graph theoretic model (represented as an Instruction Execution Graph) of the datapath and a finite state machine model of the controller for the elimination of functionally untestable paths at the early stage without looking into the circuit details and extraction of constraints for the paths that can potentially be tested. Parwan and DLX processors are used to demonstrate the effectiveness of our method.

  • Extended Role Based Access Control with Procedural Constraints for Trusted Operating Systems

    Wook SHIN  Jong-Youl PARK  Dong-Ik LEE  

     
    PAPER-Application Information Security

      Page(s):
    619-627

    The current scheme of access control judges the legality of each access based on immediate information without considering associate information hidden in a series of accesses. Due to the deficiency, access control systems do not efficiently limit attacks consist of ordinary operations. For trusted operating system developments, we extended RBAC and added negative procedural constraints to refuse those attacks. With the procedural constraints, the access control of trusted operating systems can discriminate attack trials from normal behaviors. This paper shows the specification of the extended concept and model, and presents simple analysis results.

  • A Kernel-Based Fisher Discriminant Analysis for Face Detection

    Takio KURITA  Toshiharu TAGUCHI  

     
    PAPER-Pattern Recognition

      Page(s):
    628-635

    This paper presents a modification of kernel-based Fisher discriminant analysis (FDA) to design one-class classifier for face detection. In face detection, it is reasonable to assume "face" images to cluster in certain way, but "non face" images usually do not cluster since different kinds of images are included. It is difficult to model "non face" images as a single distribution in the discriminant space constructed by the usual two-class FDA. Also the dimension of the discriminant space constructed by the usual two-class FDA is bounded by 1. This means that we can not obtain higher dimensional discriminant space. To overcome these drawbacks of the usual two-class FDA, the discriminant criterion of FDA is modified such that the trace of covariance matrix of "face" class is minimized and the sum of squared errors between the average vector of "face" class and feature vectors of "non face" images are maximized. By this modification a higher dimensional discriminant space can be obtained. Experiments are conducted on "face" and "non face" classification using face images gathered from the available face databases and many face images on the Web. The results show that the proposed method can outperform the support vector machine (SVM). A close relationship between the proposed kernel-based FDA and kernel-based Principal Component Analysis (PCA) is also discussed.

  • Optimal Quantization Noise Allocation and Coding Gain in Transform Coding with Two-Dimensional Morphological Haar Wavelet

    Yasunari YOKOTA  Xiaoyong TAN  

     
    PAPER-Image Processing and Video Processing

      Page(s):
    636-645

    This paper analytically formulates both the optimal quantization noise allocation ratio and the coding gain of the two-dimensional morphological Haar wavelet transform. The two-dimensional morphological Haar wavelet transform has been proposed as a nonlinear wavelet transform. It has been anticipated for application to nonlinear transform coding. To utilize a transformation to transform coding, both the optimal quantization noise allocation ratio and the coding gain of the transformation should be derived beforehand regardless of whether the transformation is linear or nonlinear. The derivation is crucial for progress of nonlinear transform image coding with nonlinear wavelet because the two-dimensional morphological Haar wavelet is the most basic nonlinear wavelet. We derive both the optimal quantization noise allocation ratio and the coding gain of the two-dimensional morphological Haar wavelet transform by introducing appropriate approximations to handle the cumbersome nonlinear operator included in the transformation. Numerical experiments confirmed the validity of formulations.

  • Dynamic and Adaptive Morphing of Three-Dimensional Mesh Using Control Maps

    Tong-Yee LEE  Chien-Chi HUANG  

     
    PAPER-Computer Graphics

      Page(s):
    646-651

    This paper describes a dynamic and adaptive scheme for three-dimensional mesh morphing. Using several control maps, the connectivity of intermediate meshes is dynamically changing and the mesh vertices are adaptively modified. The 2D control maps in parametric space that include curvature map, area deformation map and distance map, are used to schedule the inserting and deleting vertices in each frame. Then, the positions of vertices are adaptively moved to better positions using weighted centroidal voronoi diagram (WCVD) and a Delaunay triangulation is finally used to determine the connectivity of mesh. In contrast to most previous work, the intermediate mesh connectivity gradually changes and is much less complicated. We demonstrate several examples of aesthetically pleasing morphs created by the proposed method.

  • An Optimal Load Balancing Method for the Web-Server Cluster Based on the ANFIS Model

    Ilseok HAN  Wanyoung KIM  Hagbae KIM  

     
    LETTER-Computer Systems

      Page(s):
    652-653

    This paper presents an optimal load balancing algorithm based on both of the ANFIS (Adaptive Neuro-Fuzzy Inference System) modeling and the FIS (Fuzzy Inference System) for the local status of real servers. It also shows the substantial benefits such as the removal of load-scheduling overhead, QoS (Quality of Service) provisioning and providing highly available servers, provided by the suggested method.

  • A Video Streaming File Server Framework for Digital Video Broadcasting Environments

    Eunkyo KIM  Wonjun LEE  Choonhwa LEE  

     
    LETTER-Computer Systems

      Page(s):
    654-657

    This letter presents the design and implementation of a video streaming file server system, which has been implemented in the context of a distributed digital multimedia broadcasting environment that has been prototyped. To make a performance analysis of file systems and distributed object services for continuous media (CM) provisioning, we validate the performance of the system against that of a conventional file system, Unix file system, through an experimental evaluation.

  • Comparison of Deadline-Based Scheduling Algorithms for Periodic Real-Time Tasks on Multiprocessor

    Minkyu PARK  Sangchul HAN  Heeheon KIM  Seongje CHO  Yookun CHO  

     
    LETTER-System Programs

      Page(s):
    658-661

    Multiprocessor architecture becomes common on real-time systems as the workload of real-time systems increases. Recently new deadline-based (EDF-based) multiprocessor scheduling algorithms are devised, and comparative studies on the performance of these algorithms are necessary. In this paper, we compare EDZL, a hybrid of EDF and LLF, with other deadline-based scheduling algorithms such as EDF, EDF-US[m/(2m-1)], and fpEDF. We show EDZL schedules all task sets schedulable by EDF. The experimental results show that the number of preemptions of EDZL is comparable to that of EDF and the schedulable utilization bound of EDZL is higher than those of other algorithms we consider.

  • Context-Dependent Phoneme Duration Modeling with Tree-Based State Tying

    Sung-Joon PARK  Myoung-Wan KOO  Chu-Shik JHON  

     
    LETTER-Speech and Hearing

      Page(s):
    662-666

    This letter presents two methods of modeling phoneme durations. One is the context-independent phoneme duration modeling in which duration parameters are stored in each phoneme. The other is the context-dependent duration modeling in which duration parameters are stored in each state shared by context-dependent phonemes. The phoneme duration model is compared with a without-duration model and a state duration model. Experiments are performed on a database collected over the telephone network. Experimental results show that duration information rejects out-of-task (OOT) words well and that the context-dependent duration model yields the best performance among the tested models.

  • Speech Recognition Using Finger Tapping Timings

    Hiromitsu BAN  Chiyomi MIYAJIMA  Katsunobu ITOU  Kazuya TAKEDA  Fumitada ITAKURA  

     
    LETTER-Speech and Hearing

      Page(s):
    667-670

    Behavioral synchronization between speech and finger tapping provides a novel approach to improving speech recognition accuracy. We combine a sequence of finger tapping timings recorded alongside an utterance using two distinct methods: in the first method, HMM state transition probabilities at the word boundaries are controlled by the timing of the finger tapping; in the second, the probability (relative frequency) of the finger tapping is used as a 'feature' and combined with MFCC in a HMM recognition system. We evaluate these methods through connected digit recognition under different noise conditions (AURORA-2J). Leveraging the synchrony between speech and finger tapping provides a 46% relative improvement in connected digit recognition experiments.

  • An Efficient Method for Dynamic Shadow Texture Generation

    Kyoung-Su OH  Byeong-Seok SHIN  

     
    LETTER-Computer Graphics

      Page(s):
    671-674

    We propose a novel shadow texture generation method with linear processing time using a shadow depth buffer (SZ-Buffer). We also present a method that achieves further speedup using temporal coherence. If the transition between dynamic and static state is not frequent, depth values of static objects does not vary significantly. So we can reuse the depth value for static objects and render only dynamic objects.

  • Pruning Rule for kMER-Based Acquisition of the Global Topographic Feature Map

    Eiji UCHINO  Noriaki SUETAKE  Chuhei ISHIGAKI  

     
    LETTER-Biocybernetics, Neurocomputing

      Page(s):
    675-678

    For a kernel-based topographic map formation, kMER (kernel-based maximum entropy learning rule) was proposed by Van Hulle, and some effective learning rules related to kMER have been proposed so far with many applications. However, no discusions have been made concerning the determination of the number of units in kMER. This letter describes a unit-pruning rule, which permits automatic contruction of an appropriate-sized map to acquire the global topographic features underlying the input data. The effectiveness and the validity of the present rule have been confirmed by some preliminary computer simulations.

  • A Genetic Algorithm for Routing with an Upper Bound Constraint

    Jun INAGAKI  Miki HASEYAMA  

     
    LETTER-Biocybernetics, Neurocomputing

      Page(s):
    679-681

    This paper presents a method of searching for the shortest route via the most designated points with the length not exceeding the preset upper bound. The proposed algorithm can obtain the quasi-optimum route efficiently and its effectiveness is verified by applying the algorithm to the actual map data.