The search functionality is under construction.
The search functionality is under construction.

IEICE TRANSACTIONS on Information

  • Impact Factor

    0.59

  • Eigenfactor

    0.002

  • article influence

    0.1

  • Cite Score

    1.4

Advance publication (published online immediately after acceptance)

Volume E78-D No.6  (Publication Date:1995/06/25)

    Special Issue on Spoken Language Processing
  • FOREWORD

    Kazuyo TANAKA  

     
    FOREWORD

      Page(s):
    607-608
  • Multimodal Interaction in Human Communication

    Keiko WATANUKI  Kenji SAKAMOTO  Fumio TOGAWA  

     
    PAPER

      Page(s):
    609-615

    We are developing multimodal man-machine interfaces through which users can communicate by integrating speech, gaze, facial expressions, and gestures such as nodding and finger pointing. Such multimodal interfaces are expected to provide more flexible, natural and productive communications between humans and computers. To achieve this goal, we have taken the approach of modeling human behavior in the context of ordinary face-to-face conversations. As the first step, we have implemented a system which utilizes video and audio recording equipment to capture verbal and nonverbal information in interpersonal communications. Using this system, we have collected data from a task-oriented conversation between a guest (subject) and a receptionist at company reception desk, and quantitatively analyzed this data with respect to multi-modalities which would be functional in fluid interactions. This paper presents detailed analyses of the data collected: (1) head nodding and eye-contact are related to the beginning and end of speaking turns, acting to supplement speech information; (2) listener responses occur after an average of 0.35 sec. from the receptionist's utterance of a keyword, and turn-taking for tag-questions occurs after an average of 0.44 sec.; and (3) there is a rhythmical coordination between speakers and listeners.

  • A Speech Dialogue System with Multimodal Interface for Telephone Directory Assistance

    Osamu YOSHIOKA  Yasuhiro MINAMI  Kiyohiro SHIKANO  

     
    PAPER

      Page(s):
    616-621

    This paper describes a multimodal dialogue system employing speech input. This system uses three input methods (through a speech recognizer, a mouse, and a keyboard) and two output methods (through a display and using sound). For the speech recognizer, an algorithm is employed for large-vocabulary speaker-independent continuous speech recognition based on the HMM-LR technique. This system is implemented for telephone directory assistance to evaluate the speech recognition algorithm and to investigate the variations in speech structure that users utter to computers. Speech input is used in a multimodal environment. The collecting of dialogue data between computers and users is also carried out. Twenty telephone-number retrieval tasks are used to evaluate this system. In the experiments, all the users are equally trained in using the dialogue system with an interactive guidance system implemented on a workstation. Simplified city maps that indicate subscriber names and addresses are used to reduce the implicit restrictions imposed by written sentences, thus allowing each user to develop his own forms of expression. The task completion rate is 99.0% and approximately 75% of the users say that they prefer this system to using a telephone book. Moreover, there is a significant decrease in nonkeyword usage, i.e., the usage of words other than names and addresses, for users who receive more utterance practice.

  • An Utterance Prediction Method Based on the Topic Transition Model

    Yoichi YAMASHITA  Takashi HIRAMATSU  Osamu KAKUSHO  Riichiro MIZOGUCHI  

     
    PAPER

      Page(s):
    622-628

    This paper describes a method for predicting the user's next utterances in spoken dialog based on the topic transition model, named TPN. Some templates are prepared for each utterance pair pattern modeled by SR-plan. They are represented in terms of five kinds of topic-independent constituents in sentences. The topic of an utterance is predicted based on the TPN model and it instantiates the templates. The language processing unit analyzes the speech recognition result using the templates. An experiment shows that the introduction of the TPN model improves the performance of utterance recognition and it drastically reduces the search space of candidates in the input bunsetsu lattice.

  • Cooperative Spoken Dialogue Model Using Bayesian Network and Event Hierarchy

    Masahiro ARAKI  Shuji DOSHITA  

     
    PAPER

      Page(s):
    629-635

    In this paper, we propose a dialogue model that reflects two important aspects of spoken dialogue system: to be robust' and to be cooperative'. For this purpose, our model has two main inference spaces: Conversational Space (CS) and Problem Solving Space (PSS). CS is a kind of dynamic Bayesian network that represents a meaning of utterance and general dialogue rule. Robust' aspect is treated in CS. PSS is a network so called Event Hierarchy that represents the structure of task domain problems. Cooperative' aspect is mainly treated in PSS. In constructing CS and making inference on PSS, system's process, from meaning understanding through response generation, is modeled by dividing into five steps. These steps are (1) meaning understanding, (2) intention understanding, (3) communicative effect, (4) reaction generation, and (5) response generation. Meaning understanding step constructs CS and response generation step composes a surface expression of system's response from the part of CS. Intention understanding step makes correspondence utterance type in CS with action in PSS. Reaction generation step selects a cooperative reaction in PSS and expands a reaction to utterance type of CS. The status of problem solving and declared user's preference are recorded in mental state by communicative effect step. Then from our point of view, cooperative problem solving dialogue is regarded as a process of constructing CS and achieving goal in PSS through these five steps.

  • Error Analysis of Field Trial Results of a Spoken Dialogue System for Telecommunications Applications

    Shingo KUROIWA  Kazuya TAKEDA  Masaki NAITO  Naomi INOUE  Seiichi YAMAMOTO  

     
    PAPER

      Page(s):
    636-641

    We carried out a one year field trial of a voice-activated automatic telephone exchange service at KDD Laboratories which has about 200 branch phones. This system has DSP-based continuous speech recognition hardware which can process incoming calls in real time using a vocabulary of 300 words. The recognition accuracy was found to be 92.5% for speech read from a written text under laboratory conditions independent of the speaker. In this paper, we describe the performance of the system obtained as a result of the field trial. Apart from recognition accuracy, there was about 20% error due to out-of-vocabulary input and incorrect detection of speech endpoints which had not been allowed for in the laboratory experiments. Also, we found that the recognition accuracy for actual speech was about 18% lower than for speech read from text even if there were no out-of-vocabulary words. In this paper, we examine error variations for individual data in order to try and pinpoint the cause of incorrect recognition. It was found from experiments on the collected data that the pause model used, filled pause grammar and differences of channel frequency response seriously affected recognition accuracy. With the help of simple techniques to overcome these problems, we finally obtained a recognition accuracy of 88.7% for real data.

  • Automatic Determination of the Number of Mixture Components for Continuous HMMs Based a Uniform Variance Criterion

    Tetsuo KOSAKA  Shigeki SAGAYAMA  

     
    PAPER

      Page(s):
    642-647

    We discuss how to determine automatically the number of mixture components in continuous mixture density HMMs (CHMMs). A notable trend has been the use of CHMMs in recent years. One of the major problems with a CHMM is how to determine its structure, that is, how many mixture components and states it has and its optimal topology. The number of mixture components has been determined heuristically so far. To solve this problem, we first investigate the influence of the number of mixture components on model parameters and the output log likelihood value. As a result, in contrast to the mixture number uniformity" which is applied in conventional approaches to determine the number of mixture components, we propose the principle of distribution size uniformity". An algorithm is introduced for automatically determining the number of mixture components. The performance of this algorithm is shown through recognition experiments involving all Japanese phonemes. Two types of experiments are carried out. One assumes that the number of mixture components for each state is the same within a phonetic model but may vary between states belonging to different phonemes. The other assumes that each state has a variable number of mixture components. These two experiments give better results than the conventional method.

  • An HMM State Duration Control Algorithm Applied to Large-Vocabulary Spontaneous Speech Recognition

    Satoshi TAKAHASHI  Yasuhiro MINAMI  Kiyohiro SHIKANO  

     
    PAPER

      Page(s):
    648-653

    Although Hidden Markov Modeling (HMM) is widely and successfully used in many speech recognition applications, duration control for HMMs is still an important issue in improving recognition accuracy since a HMM places no constraints on duration. For compensating this defect, some duration control algorithms that employ precise duration models have been proposed. However, they suffer from greatly increased computational complexity. This paper proposes a new state duration control algorithm for limiting both the maximum and the minimum state durations. The algorithm is for the HMM trellis likelihood calculation, not for the Viterbi calculation. The amount of computation required by this algorithm is only order one (O(1)) for the maximum state duration n; that is, the computation amount is independent of the maximum state duration while many conventional duration control algorithm require computation in the amount of order n or order n2. Thus, the algorithm can drastically reduce the computation needed for duration control. The algorithm uses the property that the trellis likelihood calculation is a summation of many path likelihoods. At each frame, the path likelihood that exceeds the maximum likelihood is subtracted, and the path likelihood that satisfies the minimum likelihood is added to the forward probability. By iterating this procedure, the algorithm calculates the trellis likelihood efficiently. The algorithm was evaluated using a large-vocabulary speaker-independent spontaneous speech recognition system for telephone directory assistance. The average reduction in error rate for sentence understanding was about 7% when using context-independent HMMs, and 3% when using context-dependent HMMs. We could confirm the improvement by using the proposed state duration control algorithm even though the maximum and the minimum state durations were not optimized for the task (speaker-independent duration settings obtained from a different task were used).

  • Duration Modeling with Decreased Intra-Group Temporal Variation for HMM-Based Phoneme Recognition

    Nobuaki MINEMATSU  Keikichi HIROSE  

     
    PAPER

      Page(s):
    654-661

    A new clustering method was proposed to increase the effect of duration modeling on the HMM-based phoneme recognition. A precise observation on the temporal correspondences between a phoneme HMM with output probabilities by single Gaussian modeling and its training data indicated that there were two extreme cases, one with several types of correspondences in a phoneme class completely different from each other, and the other with only one type of correspondence. Although duration modeling was commonly used to incorporate the temporal information in the HMMs, a good modeling could not be obtained for the former case. Further observation for phoneme HMMs with output probabilities by Gaussian mixture modeling also showed that some HMMs still had multiple temporal correspondences, though the number of such phonemes was reduced as compared to the case of single Gaussian modeling. An appropriate duration modeling cannot be obtained for these phoneme HMMs by the conventional methods, where the duration distribution for each HMM state is represented by a distribution function. In order to cope with the problem, a new method was proposed which was based on the clustering of phoneme classes with plural types of temporal correspondences into sub-classes. The clustering was conducted so as to reduce the variations of the temporal correspondences in sub-classes. After the clustering, an HMM was constructed for each sub-class. Using the proposed method, speaker dependent recognition experiments were performed for phonemes segmented from isolated words. A few-percent increase was realized in the recognition rate, which was not obtained by another method based on the duration modeling with a Gaussian mixture.

  • A New HMnet Construction Algorithm Requiring No Contextual Factors

    Motoyuki SUZUKI  Shozo MAKINO  Akinori ITO  Hirotomo ASO  Hiroshi SHIMODAIRA  

     
    PAPER

      Page(s):
    662-668

    Many methods have been proposed for constructing context-dependent phoneme models using Hidden Markov Models (HMMs) to improve performance. These conventional methods require previously defined contextual factors. If these factors are deficient, the method exhibit poor recognition performance. In this paper, we propose a new construction algorithm for HMnet which does not require pre-defined contextual factors. Experiments demonstrated that the new algorithm could construct the HMnet even for the case that the Successive State Splitting (SSS) algorithm could not. The new algorithm produced better phoneme recognition characteristics than the SSS algorithm.

  • A Comparative Study of Output Probability Functions in HMMs

    Seiichi NAKAGAWA  Li ZHAO  Hideyuki SUZUKI  

     
    PAPER

      Page(s):
    669-675

    One of the most effective methods in speech recognition is the HMM which has been used to model speech statistically. The discrete distribution and the continuos distribution HMMs have been widely used in various applications. However, in recent years, HMMs with various output probability functions have been proposed to further improve recognition performance, e.g. the Gaussian mixture continuous and the semi-continuous distributed HMMs. We recently have also proposed the RBF (radial basis function)-based HMM and the VQ-distortion based HMM which use a RBF function and VQ-distortion measure at each state instead of an output probability density function used by traditional HMMs. In this paper, we describe the RBF-based HMM and the VQ-distortion based HMM and compare their performance with the discrete distributed, the Gaussian mixture distributed and the semi-continuous distributed HMMs based on their speech recognition performance rates through experiments on speaker-independent spoken digit recognition. Our results confirmed that the RBF-based and VQ-distortion based HMMs are more robust and superior to traditional HMMs.

  • Neural Predictive Hidden Markov Model for Speech Recognition

    Eiichi TSUBOKA  Yoshihiro TAKADA  

     
    PAPER

      Page(s):
    676-684

    This paper describes new modeling methods combining neural network and hidden Markov model applicable to modeling a time series such as speech signal. The idea assumes that the sequence is nonstationary and is a nonlinear autoregressive process whose parameters are controlled by a hidden Markov chain. One is the model where a non-linear predictor composed of a multi-layered neural network is defined at each state, another is the model where a multi-layered neural network is defined so that the path from the input layer to the output layer is divided into path-groups each of which corresponds to the state of the Markov chain. The latter is an extended model of the former. The parameter estimation methods for these models are shown, and other previously proposed models--one called Neural Prediction Model and another called Linear Predictive HMM--are shown to be special cases of the NPHMM proposed here. The experimental result affirms the justification of these proposed models.

  • Tone Recognition of Chinese Dissyllables Using Hidden Markov Models

    Xinhui HU  Keikichi HIROSE  

     
    PAPER

      Page(s):
    685-691

    A method of tone recognition has been developed for dissyllabic speech of Standard Chinese based on discrete hidden Markov modeling. As for the feature parameters of recognition, combination of macroscopic and microscopic parameters of fundamental frequency contours was shown to give a better result as compared to the isolated use of each parameter. Speaker normalization was realized by introducing an offset to the fundamental frequency. In order to avoid recognition errors due to syllable segmentation, a scheme of concatenated learning was adopted for training hidden Markov models. Based on the observations of fundamental frequency contours of dissyllables, a scheme was introduced to the method, where a contour was represented with a series of three syllabic tone models, two for the first and the second syllables and one for the transition part around the syllabic boundary. Corresponding to the voiceless consonant of the second syllable, fundamental frequency contour of a dissyllable may include a part without fundamental frequencies. This part was linearly interpolated in the current method. To prove the validity of the proposed method, it was compared with other methods, such as representing all of the dissyllabic contours as the concatenation of two models, assigning a special code to the voiceless part, and so on. Tone sandhi was also taken into account by introducing two additional models for the half-third tone and for the first 4th tone of the combination of two 4th tones. With the proposed method, average recognition rate of 96% was achieved for 5 male and 5 female speakers.

  • Speech Recognition Using Function-Word N-Grams and Content-Word N-Grams

    Ryosuke ISOTANI  Shoichi MATSUNAGA  Shigeki SAGAYAMA  

     
    PAPER

      Page(s):
    692-697

    This paper proposes a new stochastic language model for speech recognition based on function-word N-grams and content-word N-grams. The conventional word N-gram models are effective for speech recognition, but they represent only local constraints within a few successive words and lack the ability to capture global syntactic or semantic relationships between words. To represent more global constraints, the proposed language model gives the N-gram probabilities of word sequences, with attention given only to function words or to content words. The sequences of function words and of content words are expected to represent syntactic and semantic constraints, respectively. Probabilities of function-word bigrams and content-word bigrams were estimated from a 10,000-sentence text database, and analysis using information theoretic measure showed that expected constraints were extracted appropriately. As an application of this model to speech recognition, a post-processor was constructed to select the optimum sentence candidate from a phrase lattice obtained by a phrase recognition system. The phrase candidate sequence with the highest total acoustic and linguistic score was sought by dynamic programming. The results of experiments carried out on the utterances of 12 speakers showed that the proposed method is more accurate than a CFG-based method, thus demonstrating its effectiveness in improving speech recognition performance.

  • Relationship among Recognition Rate, Rejection Rate and False Alarm Rate in a Spoken Word Recognition System

    Atsuhiko KAI  Seiichi NAKAGAWA  

     
    PAPER

      Page(s):
    698-704

    Detection of an unknown word or non-vocabulary word uttered by the user is necessary in realizing a practical spoken language user-interface. This paper describes the evaluation of an unknown word processing method for a subword unit based spoken word recognizer. We have assessed the relationship between the word recognition accuracy of a system and the detection rate of unknown words both by simulation and by experiment of the unknown word processing method. We found that the resultant detection accuracies using the unknown word processing are significantly influenced by the original word recognition accuracy while the degree of such effect depends on the vocabulary size.

  • Automatic Language Identification Using Sequential Information of Phonemes

    Takayuki ARAI  

     
    PAPER

      Page(s):
    705-711

    In this paper approaches to language identification based on the sequential information of phonemes are described. These approaches assume that each language can be identified from its own phoneme structure, or phonotactics. To extract this phoneme structure, we use phoneme classifiers and grammars for each language. The phoneme classifier for each language is implemented as a multi-layer perceptron trained on quasi-phonetic hand-labeled transcriptions. After training the phoneme classifiers, the grammars for each language are calculated as a set of transition probabilities for each phoneme pair. Because of the interest in automatic language identification for worldwide voice communication, we decided to use telephone speech for this study. The data for this study were drawn from the OGI (Oregon Graduate Institute)-TS (telephone speech) corpus, a standard corpus for this type of research. To investigate the basic issues of this approach, two languages, Japanese and English, were selected. The language classification algorithms are based on Viterbi search constrained by a bigram grammar and by minimum and maximum durations. Using a phoneme classifier trained only on English phonemes, we achieved 81.1% accuracy. We achieved 79.3% accuracy using a phoneme classifier trained on Japanese phonemes. Using both the English and the Japanese phoneme classifiers together, we obtained our best result: 83.3%. Our results were comparable to those obtained by other methods such as that based on the hidden Markov model.

  • A Study on Speaker Adaptation for Mandarin Syllable Recognition with Minimum Error Discriminative Training

    Chih-Heng LIN  Chien-Hsing WU  Pao-Chung CHANG  

     
    PAPER

      Page(s):
    712-718

    This paper investigates a different method of speaker adaptation for Mandarin syllable recognition. Based on the minimum classification error (MCE) criterion, we use the generalized probabilistic decent (GPD) algorithm to adjust interatively the parameters of the hidden Markov models (HMM). The experiments on the multi-speaker Mandarin syllable database of Telecommunication Laboratories (T.L.) yield the following results: 1) Efficient speaker adaptation can be achieved through discriminative training using the MCE criterion and the GPD algorithm. 2) The computations required can be reduced through the use of the confusion sets in Mandarin base syllables. 3) For the discriminative training, the adjustment on the mean values of the Gaussian mixtures has the most prominent effect on speaker adaptation. 4) The discriminative training approach can be used to enhance the speaker adaptation capability of the maximum a posteriori (MAP) approach.

  • Speaker-Consistent Parsing for Speaker-Independent Continuous Speech Recognition

    Kouichi YAMAGUCHI  Harald SINGER  Shoichi MATSUNAGA  Shigeki SAGAYAMA  

     
    PAPER

      Page(s):
    719-724

    This paper describes a novel speaker-independent speech recognition method, called speaker-consistent parsing", which is based on an intra-speaker correlation called the speaker-consistency principle. We focus on the fact that a sentence or a string of words is uttered by an individual speaker even in a speaker-independent task. Thus, the proposed method searches through speaker variations in addition to the contents of utterances. As a result of the recognition process, an appropriate standard speaker is selected for speaker adaptation. This new method is experimentally compared with a conventional speaker-independent speech recognition method. Since the speaker-consistency principle best demonstrates its effect with a large number of training and test speakers, a small-scale experiment may not fully exploit this principle. Nevertheless, even the results of our small-scale experiment show that the new method significantly outperforms the conventional method. In addition, this framework's speaker selection mechanism can drastically reduce the likelihood map computation.

  • A Scheme for Word Detection in Continuous Speech Using Likelihood Scores of Segments Modified by Their Context Within a Word

    Sumio OHNO  Keikichi HIROSE  Hiroya FUJISAKI  

     
    PAPER

      Page(s):
    725-731

    In conventional word-spotting methods for automatic recognition of continuous speech, individual frames or segments of the input speech are assigned labels and local likelihood scores solely on the basis of their own acoustic characteristics. On the other hand, experiments on human speech perception conducted by the present authors and others show that human perception of words in connected speech is based, not only on the acoustic characteristics of individual segments, but also on the acoustic and linguistic contexts in which these segments occurs. In other words, individual segments are not correctly perceive by humans unless they are accompanied by their context. These findings on the process of human speech perception have to be applied in automatic speech recognition in order to improve the performance. From this point of view, the present paper proposes a new scheme for detecting words in continuous speech based on template matching where the likelihood of each segment of a word is determined not only by its own characteristics but also by the likelihood of its context within the framework of a word. This is accomplished by modifying the likelihood score of each segment by the likelihood score of its phonetic context, the latter representing the degree of similarity of the context to that of a candidate word in the lexicon. Higher enhancement is given to the segmental likelihood score if the likelihood score of its context is higher. The advantage of the proposed scheme over conventional schemes is demonstrated by an experiment on constructing a word lattice using connected speech of Japanese uttered by a male speaker. The result indicates that the scheme is especially effective in giving correct recognition in cases where there are two or more candidate words which are almost equal in raw segmental likelihood scores.

  • Uniform and Non-uniform Normalization of Vocal Tracts Measured by MRI Across Male, Female and Child Subjects

    Chang-Sheng YANG  Hideki KASUYA  

     
    PAPER

      Page(s):
    732-737

    Three-dimensional vocal tract shapes of a male, a female and a child subjects are measured from magnetic resonance (MR) images during sustained phonation of Japanese vowels /a, i, u, e, o/. Non-uniform dimensional differences in the vocal tract shapes of the subjects are quantitatively measured. Vocal tract area functions of the female and child subjects are normalized to those of the male on the basis of non-uniform and uniform scalings of the vocal tract length and compared with each other. A comparison is also made between the formant frequencies computed from the area functions normalized by the two different scalings. It is suggested by the comparisons that non-uniformity in the vocal tract dimensions is not essential in the normalization of the five Japanese vowels.

  • Simultaneous Estimation of Vocal Tract and Voice Source Parameters Based on an ARX Model

    Wen DING  Hideki KASUYA  Shuichi ADACHI  

     
    PAPER

      Page(s):
    738-743

    A novel adaptive pitch-synchronous analysis method is proposed to estimate simultaneously vocal tract (formant/antiformant) and voice source parameters from speech waveforms. We use the parametric Rosenberg-Klatt (RK) model to generate a glottal waveform and an autoregressive-exogenous (ARX) model to represent voiced speech production process. The Kalman filter algorithm is used to estimate the formant/antiformant parameters from the coefficient of the ARX model, and the simulated annealing method is employed as a nonlinear optimization approach to estimate the voice source parameters. The two approaches work together in a system identification procedure to find the best set of the parameters of both the models. The new method has been compared using synthetic speech with some other approaches in terms of accuracy of estimated parameter values and has been proved to be superior. We also show that the proposed method can estimate accurately the parameters from natural speech sounds. A major application of the analysis method lies in a concatenative formant synthesizer which allows us to make flexible control of voice quality of synthetic speech.

  • Characteristics of Multi-Layer Perceptron Models in Enhancing Degraded Speech

    Thanh Tung LE  John MASON  Tadashi KITAMURA  

     
    PAPER

      Page(s):
    744-750

    A multi-layer perceptron (MLP) acting directly in the time-domain is applied as a speech signal enhancer, and the performance examined in the context of three common classes of degradation, namely low bit-rate CELP degradation is non-linear system degradation, additive noise, and convolution by a linear system. The investigation focuses on two topics: (i) the influence of non-linearities within the network and (ii) network topology, comparing single and multiple output structures. The objective is to examine how these characteristics influence network performance and whether this depends on the class of degradation. Experimental results show the importance of matching the enhancer to the class of degradation. In the case of the CELP coder the standard MLP with its inherently non-linear characteristics is shown to be consistently better than any equivalent linear structure (up to 3.2 dB compared with 1.6 dB SNR improvement). In contrast, when the degradation is from additive noise, a linear enhancer is always, superior.

  • An Objective Measure Based on an Auditory Model for Assessing Low-Rate Coded Speech

    Toshiro WATANABE  Shinji HAYASHI  

     
    PAPER

      Page(s):
    751-757

    We propose an objective measure from assessing low-rate coded speech. The model for this objective measure, in which several known features of the perceptual processing of speech sounds by the human ear are emulated, is based on the Hertz-to-Bark transformation, critical-band filtering with preemphasis to boost higher frequencies, nonlinear conversion for subjective loudness, and temporal (forward) masking. The effectiveness of the measure, called the Bark spectral distortion rating (BSDR), was validated by second-order polynomial regression analysis between the computed BSDR values and subjective MOS ratings obtained for a large number of utterances coded by several versions of CELP coders and one VSELP coder under three degradation conditions: input speech levels, transmission error rates, and background noise levels. The BSDR values correspond better to MOS ratings than several commonly used measures. Thus, BSDR can be used to accurately predict subjective scores.

  • 4 kbps Improved Pitch Prediction CELP Speech Coding with 20 msec Frame

    Masahiro SERIZAWA  Kazunori OZAWA  

     
    PAPER

      Page(s):
    758-763

    This paper proposes a new pitch prediction method for 4 kbps CELP (Code Excited LPC) speech coding with 20 msec frame, for the future ITU-T 4 kbps speech coding standardization. In the conventional CELP speech coding, synthetic speech quality deteriorates rapidly at 4 kbps, especially for female and children's speech with short pitch period. The pitch prediction performance is significantly degraded for such speech. The important reason is that when the pitch period is shorter than the subframe length, the simple repetition of the past excitation signal based on the estimated lag, not the pitch prediction, is usually carried out in the adaptive codebook operation. The proposed pitch prediction method can carry out the pitch prediction without the above approximation by utilizing the current subframe excitation codevector signal, when the pitch prediction parameters are determined. To further improve the performance, a split vector synthesis and perceptually spectral weighting method, and a low-complexity perceptually harmonic and spectral weighting method have also been developed. The informal listening test result shows that the 4 kbps speech coder with 20 msec frame, utilizing all of the proposed improvements, achieves 0.2 MOS higher results than the coder without them.

  • Regular Section
  • A Structured Video Handling Technique for Multimedia Systems

    Yoshinobu TONOMURA  Akihito AKUTSU  

     
    PAPER-Image Processing, Computer Graphics and Pattern Recognition

      Page(s):
    764-777

    This paper proposes a functional video handling technique based on structured video. The video handling architecture, which includes a video data structure, file management structure, and visual interface structure, is introduced as the core concept of this technique. One of the key features of this architecture is that the newly proposed video indexing method is performed automatically based on image processing. The video data structure, which plays an important role in the architecture, has two kinds of data structures: content and node. The central idea behind these structures is to separate the video contents from the processing operations and to create links between them. Video indexes work as a backend mechanism in structuring video content. A prototype video handling system called the MediaBENCH, a hypermedia basic environment for computer and human interactions, which demonstrates the actual implementation of the proposed concept and technique, is described. Basic functions such as browsing and editing, which are achieved based on the architecture, exhibit the advantages of structured video handling. The concept and the methods proposed in this paper assure various video-computer applications, which will play major roles in the multimedia field.

  • A Note on One-way Auxiliary Pushdown Automata

    Yue WANG  Jian-Liang XU  Katsushi INOUE  Akira ITO  

     
    LETTER-Automata, Languages and Theory of Computing

      Page(s):
    778-782

    This paper establishes a relationship among the accepting powers of deterministic, nondeterministic, and alternating one-way auxiliary pushdown automata, for any tape bound below n. Some other related results are also presented.