The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] SPE(2504hit)

2261-2280hit(2504hit)

  • Unsupervised Speaker Adaptation Using All-Phoneme Ergodic Hidden Markov Network

    Yasunage MIYAZAWA  Jun-ichi TAKAMI  Shigeki SAGAYAMA  Shoichi MATSUNAGA  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E78-D No:8
      Page(s):
    1044-1050

    This paper proposes an unsupervised speaker adaptation method using an all-phoneme ergodic Hidden Markov Network" that combines allophonic (context-dependent phone) acoustic models with stochastic language constraints. Hidden Markov Network (HMnet) for allophone modeling and allophonic bigram probabilities derived from a large text database are combined to yield a single large ergodic HMM which represents arbitrary speech signals in a particular language so that the model parameters can be re-estimated using text-unknown speech samples with the Baum-Welch algorithm. When combined with the Vector Field Smoothing (VFS) technique, unsupervised speaker adaptation can be effectively performed. This method experimentally gave better performances compared with our previous unsupervised adaptation method which used conventional phonetic HMMs and phoneme bigram probabilities especially when the amount of training data was small.

  • 8-kb/s Low-Delay Speech Coding with 4-ms Frame Size

    Yoshiaki ASAKAWA  Preeti RAO  Hidetoshi SEKINE  

     
    PAPER

      Vol:
    E78-A No:8
      Page(s):
    927-933

    This paper describes modifications to a previously proposed 8-kb/s 4-ms-delay CELP speech coding algorithm with a view to improving the speech quality while maintaining low delay and only moderately increasing complexity. The modifications are intended to improve the effectiveness of interframe pitch lag prediction and the sub-optimality level of the excitation coding to the backward adapted synthesis filter by using delayed decision and joint optimization techniques. Results of subjective listening tests using Japanese speech indicate that the coded speech quality is significantly superior to that of the 8-kb/s VSELP coder which has a 20-ms delay. A method that reduces the computational complexity of closed-loop 3-tap pitch prediction with no perceptible degradation in speech quality is proposed, based on representing the pitch-tap vector as the product of a scalar pitch gain and a normalized shape codevector.

  • Novel Architecture and MMIC's for an Integrated Front-End of a Spectrum Analyzer

    Tsutomu TAKENAKA  Atsushi MIYAZAKI  Hiroyuki MATSUURA  

     
    PAPER

      Vol:
    E78-C No:8
      Page(s):
    911-918

    This paper proposes a novel architecture and MMICs for an integrated 2-32 GHz front-end of a spectrum analyzer. The architecture achieves miniaturization by eliminating the large YIG tracking filter and also achieves multi-octave measurement with less than one octave sweep of the first local oscillator. The MMIC's demonstrate ultra-wideband performances with reduced chip sizes by utilizing newly developed FET cells for power combination, multi-order frequency conversion, low leakage variable resistance, and active impedance translation. The MMIC's are a fundamental/harmonic frequency converter, a variable attenuator, a single-pole triple-throw switch, a single-pole double-throw switch, a distributed pre-amplifier, and an active LC lowpass filter. All the MMIC's are smaller than 1 mm2, except the pre-amplifier and the filter.

  • Using Process Algebras for the Semantic Analysis of Data Flow Networks

    Cinzia BERNARDESCHI  Andrea BONDAVALLI  Luca SIMONCINI  

     
    PAPER-Computer Systems

      Vol:
    E78-D No:8
      Page(s):
    959-968

    Data flow is a paradigm for concurrent computations in which a collection of parallel processes communicate asynchronously. For nondeterministic data flow networks many semantic models have been defined, however, it is complex to reason about the semantics of a network. In this paper, we introduce a transformation between data flow networks and the LOTOS specification language to make available theories and tools developed for process algebras for the semantic analysis based on traces of the networks. The transformation does not establish a one-to-one mapping between the traces of a data flow network and the LOTOS specification, but maps each network in a specification which usually contains more traces. The obtained system specification has the same set of traces as the corresponding network if they are finite, otherwise also non fair traces are included. Formal analysis and verification methods can still be applied to prove properties of the original data flow network, allowing in case of networks with finite traces to prove also network equivalence.

  • Dynamic Analysis of Uniplanar Guided-Wave Structures with Trapezoidal Conductor Profile and Microshielding Enclosure

    Tongqing WANG  Ke WU  

     
    PAPER

      Vol:
    E78-C No:8
      Page(s):
    1100-1105

    This work is concerned with a dynamic analysis of complex uniplanar guide-wave structures for MMICs at millimeter-wave frequencies. The enhanced spectral domain approach is effectively used to model such uniplanar structures with trapezoidal conducting strips involving microshielding enclosures. A wide range of line propagation and impedance characteristics is obtained for slotline and coplanar waveguide (CPW). The effect of different conductor profiles on line characteristics is discussed in detail. Results show an excellent agreement with other works. A class of dispersion-related curves are presented for design consideration.

  • Higher Order Spectra Analysis of Nonstationary Harmonizable Random Processes

    Pavol ZAVARSKY  Nobuo FUJII  

     
    PAPER-Digital Signal Processing

      Vol:
    E78-A No:7
      Page(s):
    854-859

    In the correspondence discrete Wigner higher order spectra (WHOS) of harmonizable random signals are addressed and their relations with polyspectra (HOS) are illustrated. It is shown, that discrete WHOS of a random stationary signal do not reduce to the aliased polyspectra in a similar way as Wigner distribution (WD) reduces to the power spectrum of a random signal. Wigner 2nd-order time-frequency distribution of deterministic signals and the 3rd-order spectrum of stationary signals are presented in their modified forms to be used to estimate time-varying third-order spectrum of discrete nonstationary random harmonizable processes.

  • The Spread Spectrum Code Hopping System

    Takeshi ONIZAWA  Takaaki HASEGAWA  

     
    PAPER

      Vol:
    E78-A No:7
      Page(s):
    795-804

    In this paper, the spread spectrum code hopping (CH) system, which has some analogy to frequency hopping systems, is described. The CH system has robustness to code interference that restriction of kinds of PN matched filters (MFs) will cause. The mean acquisition time is shown by theoretical analysis and computer simulation. The acquisition rate results under a single code interference, which seriously affects direct sequence systems, and an asynchronous two-user channels are obtained. Moreover, using theoretical analysis and computer simulation, the bit error rate (BER) performance under single code interference is evaluated. It is shown that CH systems perform better than conventional ones under single code interference.

  • Performance of Spread Spectrum Medical Telemetry System in a Sharing Frequency Band with Current Telemetry System

    Masaki KYOSO  Toshiaki TAKANE  Akihiko UCHIYAMA  

     
    LETTER

      Vol:
    E78-B No:6
      Page(s):
    862-865

    To make medical telemetry system more reliable in severe electromagnetic environment, we applied spread spectrum communication to ECG data transmission method. Spread spectrum communication system has shown superior performances to other systems, especially, in respect of anti-jamming, which allows it to share the frequency band with current telemetry systems. In this study, we show the characteristics of a spread spectrum transmitter when it is used in the same frequency band as a narrow-band transmitter. The result shows that the spread spectrum telemetry system can use the same frequency band permitted for medical telemetry system.

  • Relationship among Recognition Rate, Rejection Rate and False Alarm Rate in a Spoken Word Recognition System

    Atsuhiko KAI  Seiichi NAKAGAWA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    698-704

    Detection of an unknown word or non-vocabulary word uttered by the user is necessary in realizing a practical spoken language user-interface. This paper describes the evaluation of an unknown word processing method for a subword unit based spoken word recognizer. We have assessed the relationship between the word recognition accuracy of a system and the detection rate of unknown words both by simulation and by experiment of the unknown word processing method. We found that the resultant detection accuracies using the unknown word processing are significantly influenced by the original word recognition accuracy while the degree of such effect depends on the vocabulary size.

  • Error Analysis of Field Trial Results of a Spoken Dialogue System for Telecommunications Applications

    Shingo KUROIWA  Kazuya TAKEDA  Masaki NAITO  Naomi INOUE  Seiichi YAMAMOTO  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    636-641

    We carried out a one year field trial of a voice-activated automatic telephone exchange service at KDD Laboratories which has about 200 branch phones. This system has DSP-based continuous speech recognition hardware which can process incoming calls in real time using a vocabulary of 300 words. The recognition accuracy was found to be 92.5% for speech read from a written text under laboratory conditions independent of the speaker. In this paper, we describe the performance of the system obtained as a result of the field trial. Apart from recognition accuracy, there was about 20% error due to out-of-vocabulary input and incorrect detection of speech endpoints which had not been allowed for in the laboratory experiments. Also, we found that the recognition accuracy for actual speech was about 18% lower than for speech read from text even if there were no out-of-vocabulary words. In this paper, we examine error variations for individual data in order to try and pinpoint the cause of incorrect recognition. It was found from experiments on the collected data that the pause model used, filled pause grammar and differences of channel frequency response seriously affected recognition accuracy. With the help of simple techniques to overcome these problems, we finally obtained a recognition accuracy of 88.7% for real data.

  • A Speech Dialogue System with Multimodal Interface for Telephone Directory Assistance

    Osamu YOSHIOKA  Yasuhiro MINAMI  Kiyohiro SHIKANO  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    616-621

    This paper describes a multimodal dialogue system employing speech input. This system uses three input methods (through a speech recognizer, a mouse, and a keyboard) and two output methods (through a display and using sound). For the speech recognizer, an algorithm is employed for large-vocabulary speaker-independent continuous speech recognition based on the HMM-LR technique. This system is implemented for telephone directory assistance to evaluate the speech recognition algorithm and to investigate the variations in speech structure that users utter to computers. Speech input is used in a multimodal environment. The collecting of dialogue data between computers and users is also carried out. Twenty telephone-number retrieval tasks are used to evaluate this system. In the experiments, all the users are equally trained in using the dialogue system with an interactive guidance system implemented on a workstation. Simplified city maps that indicate subscriber names and addresses are used to reduce the implicit restrictions imposed by written sentences, thus allowing each user to develop his own forms of expression. The task completion rate is 99.0% and approximately 75% of the users say that they prefer this system to using a telephone book. Moreover, there is a significant decrease in nonkeyword usage, i.e., the usage of words other than names and addresses, for users who receive more utterance practice.

  • Automatic Determination of the Number of Mixture Components for Continuous HMMs Based a Uniform Variance Criterion

    Tetsuo KOSAKA  Shigeki SAGAYAMA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    642-647

    We discuss how to determine automatically the number of mixture components in continuous mixture density HMMs (CHMMs). A notable trend has been the use of CHMMs in recent years. One of the major problems with a CHMM is how to determine its structure, that is, how many mixture components and states it has and its optimal topology. The number of mixture components has been determined heuristically so far. To solve this problem, we first investigate the influence of the number of mixture components on model parameters and the output log likelihood value. As a result, in contrast to the mixture number uniformity" which is applied in conventional approaches to determine the number of mixture components, we propose the principle of distribution size uniformity". An algorithm is introduced for automatically determining the number of mixture components. The performance of this algorithm is shown through recognition experiments involving all Japanese phonemes. Two types of experiments are carried out. One assumes that the number of mixture components for each state is the same within a phonetic model but may vary between states belonging to different phonemes. The other assumes that each state has a variable number of mixture components. These two experiments give better results than the conventional method.

  • Duration Modeling with Decreased Intra-Group Temporal Variation for HMM-Based Phoneme Recognition

    Nobuaki MINEMATSU  Keikichi HIROSE  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    654-661

    A new clustering method was proposed to increase the effect of duration modeling on the HMM-based phoneme recognition. A precise observation on the temporal correspondences between a phoneme HMM with output probabilities by single Gaussian modeling and its training data indicated that there were two extreme cases, one with several types of correspondences in a phoneme class completely different from each other, and the other with only one type of correspondence. Although duration modeling was commonly used to incorporate the temporal information in the HMMs, a good modeling could not be obtained for the former case. Further observation for phoneme HMMs with output probabilities by Gaussian mixture modeling also showed that some HMMs still had multiple temporal correspondences, though the number of such phonemes was reduced as compared to the case of single Gaussian modeling. An appropriate duration modeling cannot be obtained for these phoneme HMMs by the conventional methods, where the duration distribution for each HMM state is represented by a distribution function. In order to cope with the problem, a new method was proposed which was based on the clustering of phoneme classes with plural types of temporal correspondences into sub-classes. The clustering was conducted so as to reduce the variations of the temporal correspondences in sub-classes. After the clustering, an HMM was constructed for each sub-class. Using the proposed method, speaker dependent recognition experiments were performed for phonemes segmented from isolated words. A few-percent increase was realized in the recognition rate, which was not obtained by another method based on the duration modeling with a Gaussian mixture.

  • A Study on Speaker Adaptation for Mandarin Syllable Recognition with Minimum Error Discriminative Training

    Chih-Heng LIN  Chien-Hsing WU  Pao-Chung CHANG  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    712-718

    This paper investigates a different method of speaker adaptation for Mandarin syllable recognition. Based on the minimum classification error (MCE) criterion, we use the generalized probabilistic decent (GPD) algorithm to adjust interatively the parameters of the hidden Markov models (HMM). The experiments on the multi-speaker Mandarin syllable database of Telecommunication Laboratories (T.L.) yield the following results: 1) Efficient speaker adaptation can be achieved through discriminative training using the MCE criterion and the GPD algorithm. 2) The computations required can be reduced through the use of the confusion sets in Mandarin base syllables. 3) For the discriminative training, the adjustment on the mean values of the Gaussian mixtures has the most prominent effect on speaker adaptation. 4) The discriminative training approach can be used to enhance the speaker adaptation capability of the maximum a posteriori (MAP) approach.

  • Speaker-Consistent Parsing for Speaker-Independent Continuous Speech Recognition

    Kouichi YAMAGUCHI  Harald SINGER  Shoichi MATSUNAGA  Shigeki SAGAYAMA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    719-724

    This paper describes a novel speaker-independent speech recognition method, called speaker-consistent parsing", which is based on an intra-speaker correlation called the speaker-consistency principle. We focus on the fact that a sentence or a string of words is uttered by an individual speaker even in a speaker-independent task. Thus, the proposed method searches through speaker variations in addition to the contents of utterances. As a result of the recognition process, an appropriate standard speaker is selected for speaker adaptation. This new method is experimentally compared with a conventional speaker-independent speech recognition method. Since the speaker-consistency principle best demonstrates its effect with a large number of training and test speakers, a small-scale experiment may not fully exploit this principle. Nevertheless, even the results of our small-scale experiment show that the new method significantly outperforms the conventional method. In addition, this framework's speaker selection mechanism can drastically reduce the likelihood map computation.

  • A Scheme for Word Detection in Continuous Speech Using Likelihood Scores of Segments Modified by Their Context Within a Word

    Sumio OHNO  Keikichi HIROSE  Hiroya FUJISAKI  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    725-731

    In conventional word-spotting methods for automatic recognition of continuous speech, individual frames or segments of the input speech are assigned labels and local likelihood scores solely on the basis of their own acoustic characteristics. On the other hand, experiments on human speech perception conducted by the present authors and others show that human perception of words in connected speech is based, not only on the acoustic characteristics of individual segments, but also on the acoustic and linguistic contexts in which these segments occurs. In other words, individual segments are not correctly perceive by humans unless they are accompanied by their context. These findings on the process of human speech perception have to be applied in automatic speech recognition in order to improve the performance. From this point of view, the present paper proposes a new scheme for detecting words in continuous speech based on template matching where the likelihood of each segment of a word is determined not only by its own characteristics but also by the likelihood of its context within the framework of a word. This is accomplished by modifying the likelihood score of each segment by the likelihood score of its phonetic context, the latter representing the degree of similarity of the context to that of a candidate word in the lexicon. Higher enhancement is given to the segmental likelihood score if the likelihood score of its context is higher. The advantage of the proposed scheme over conventional schemes is demonstrated by an experiment on constructing a word lattice using connected speech of Japanese uttered by a male speaker. The result indicates that the scheme is especially effective in giving correct recognition in cases where there are two or more candidate words which are almost equal in raw segmental likelihood scores.

  • A Comparative Study of Output Probability Functions in HMMs

    Seiichi NAKAGAWA  Li ZHAO  Hideyuki SUZUKI  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    669-675

    One of the most effective methods in speech recognition is the HMM which has been used to model speech statistically. The discrete distribution and the continuos distribution HMMs have been widely used in various applications. However, in recent years, HMMs with various output probability functions have been proposed to further improve recognition performance, e.g. the Gaussian mixture continuous and the semi-continuous distributed HMMs. We recently have also proposed the RBF (radial basis function)-based HMM and the VQ-distortion based HMM which use a RBF function and VQ-distortion measure at each state instead of an output probability density function used by traditional HMMs. In this paper, we describe the RBF-based HMM and the VQ-distortion based HMM and compare their performance with the discrete distributed, the Gaussian mixture distributed and the semi-continuous distributed HMMs based on their speech recognition performance rates through experiments on speaker-independent spoken digit recognition. Our results confirmed that the RBF-based and VQ-distortion based HMMs are more robust and superior to traditional HMMs.

  • Characteristics of Multi-Layer Perceptron Models in Enhancing Degraded Speech

    Thanh Tung LE  John MASON  Tadashi KITAMURA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    744-750

    A multi-layer perceptron (MLP) acting directly in the time-domain is applied as a speech signal enhancer, and the performance examined in the context of three common classes of degradation, namely low bit-rate CELP degradation is non-linear system degradation, additive noise, and convolution by a linear system. The investigation focuses on two topics: (i) the influence of non-linearities within the network and (ii) network topology, comparing single and multiple output structures. The objective is to examine how these characteristics influence network performance and whether this depends on the class of degradation. Experimental results show the importance of matching the enhancer to the class of degradation. In the case of the CELP coder the standard MLP with its inherently non-linear characteristics is shown to be consistently better than any equivalent linear structure (up to 3.2 dB compared with 1.6 dB SNR improvement). In contrast, when the degradation is from additive noise, a linear enhancer is always, superior.

  • Neural Predictive Hidden Markov Model for Speech Recognition

    Eiichi TSUBOKA  Yoshihiro TAKADA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    676-684

    This paper describes new modeling methods combining neural network and hidden Markov model applicable to modeling a time series such as speech signal. The idea assumes that the sequence is nonstationary and is a nonlinear autoregressive process whose parameters are controlled by a hidden Markov chain. One is the model where a non-linear predictor composed of a multi-layered neural network is defined at each state, another is the model where a multi-layered neural network is defined so that the path from the input layer to the output layer is divided into path-groups each of which corresponds to the state of the Markov chain. The latter is an extended model of the former. The parameter estimation methods for these models are shown, and other previously proposed models--one called Neural Prediction Model and another called Linear Predictive HMM--are shown to be special cases of the NPHMM proposed here. The experimental result affirms the justification of these proposed models.

  • Speech Recognition Using Function-Word N-Grams and Content-Word N-Grams

    Ryosuke ISOTANI  Shoichi MATSUNAGA  Shigeki SAGAYAMA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    692-697

    This paper proposes a new stochastic language model for speech recognition based on function-word N-grams and content-word N-grams. The conventional word N-gram models are effective for speech recognition, but they represent only local constraints within a few successive words and lack the ability to capture global syntactic or semantic relationships between words. To represent more global constraints, the proposed language model gives the N-gram probabilities of word sequences, with attention given only to function words or to content words. The sequences of function words and of content words are expected to represent syntactic and semantic constraints, respectively. Probabilities of function-word bigrams and content-word bigrams were estimated from a 10,000-sentence text database, and analysis using information theoretic measure showed that expected constraints were extracted appropriately. As an application of this model to speech recognition, a post-processor was constructed to select the optimum sentence candidate from a phrase lattice obtained by a phrase recognition system. The phrase candidate sequence with the highest total acoustic and linguistic score was sought by dynamic programming. The results of experiments carried out on the utterances of 12 speakers showed that the proposed method is more accurate than a CFG-based method, thus demonstrating its effectiveness in improving speech recognition performance.

2261-2280hit(2504hit)