  • Speaker-Consistent Parsing for Speaker-Independent Continuous Speech Recognition

    Kouichi YAMAGUCHI  Harald SINGER  Shoichi MATSUNAGA  Shigeki SAGAYAMA  


    E78-D No:6

    This paper describes a novel speaker-independent speech recognition method, called speaker-consistent parsing", which is based on an intra-speaker correlation called the speaker-consistency principle. We focus on the fact that a sentence or a string of words is uttered by an individual speaker even in a speaker-independent task. Thus, the proposed method searches through speaker variations in addition to the contents of utterances. As a result of the recognition process, an appropriate standard speaker is selected for speaker adaptation. This new method is experimentally compared with a conventional speaker-independent speech recognition method. Since the speaker-consistency principle best demonstrates its effect with a large number of training and test speakers, a small-scale experiment may not fully exploit this principle. Nevertheless, even the results of our small-scale experiment show that the new method significantly outperforms the conventional method. In addition, this framework's speaker selection mechanism can drastically reduce the likelihood map computation.

  • A Scheme for Word Detection in Continuous Speech Using Likelihood Scores of Segments Modified by Their Context Within a Word

    Sumio OHNO  Keikichi HIROSE  Hiroya FUJISAKI  


    E78-D No:6

    In conventional word-spotting methods for automatic recognition of continuous speech, individual frames or segments of the input speech are assigned labels and local likelihood scores solely on the basis of their own acoustic characteristics. On the other hand, experiments on human speech perception conducted by the present authors and others show that human perception of words in connected speech is based, not only on the acoustic characteristics of individual segments, but also on the acoustic and linguistic contexts in which these segments occurs. In other words, individual segments are not correctly perceive by humans unless they are accompanied by their context. These findings on the process of human speech perception have to be applied in automatic speech recognition in order to improve the performance. From this point of view, the present paper proposes a new scheme for detecting words in continuous speech based on template matching where the likelihood of each segment of a word is determined not only by its own characteristics but also by the likelihood of its context within the framework of a word. This is accomplished by modifying the likelihood score of each segment by the likelihood score of its phonetic context, the latter representing the degree of similarity of the context to that of a candidate word in the lexicon. Higher enhancement is given to the segmental likelihood score if the likelihood score of its context is higher. The advantage of the proposed scheme over conventional schemes is demonstrated by an experiment on constructing a word lattice using connected speech of Japanese uttered by a male speaker. The result indicates that the scheme is especially effective in giving correct recognition in cases where there are two or more candidate words which are almost equal in raw segmental likelihood scores.

  • Neural Predictive Hidden Markov Model for Speech Recognition

    Eiichi TSUBOKA  Yoshihiro TAKADA  


    E78-D No:6

    This paper describes new modeling methods combining neural network and hidden Markov model applicable to modeling a time series such as speech signal. The idea assumes that the sequence is nonstationary and is a nonlinear autoregressive process whose parameters are controlled by a hidden Markov chain. One is the model where a non-linear predictor composed of a multi-layered neural network is defined at each state, another is the model where a multi-layered neural network is defined so that the path from the input layer to the output layer is divided into path-groups each of which corresponds to the state of the Markov chain. The latter is an extended model of the former. The parameter estimation methods for these models are shown, and other previously proposed models--one called Neural Prediction Model and another called Linear Predictive HMM--are shown to be special cases of the NPHMM proposed here. The experimental result affirms the justification of these proposed models.

  • Speech Recognition Using Function-Word N-Grams and Content-Word N-Grams

    Ryosuke ISOTANI  Shoichi MATSUNAGA  Shigeki SAGAYAMA  


    E78-D No:6

    This paper proposes a new stochastic language model for speech recognition based on function-word N-grams and content-word N-grams. The conventional word N-gram models are effective for speech recognition, but they represent only local constraints within a few successive words and lack the ability to capture global syntactic or semantic relationships between words. To represent more global constraints, the proposed language model gives the N-gram probabilities of word sequences, with attention given only to function words or to content words. The sequences of function words and of content words are expected to represent syntactic and semantic constraints, respectively. Probabilities of function-word bigrams and content-word bigrams were estimated from a 10,000-sentence text database, and analysis using information theoretic measure showed that expected constraints were extracted appropriately. As an application of this model to speech recognition, a post-processor was constructed to select the optimum sentence candidate from a phrase lattice obtained by a phrase recognition system. The phrase candidate sequence with the highest total acoustic and linguistic score was sought by dynamic programming. The results of experiments carried out on the utterances of 12 speakers showed that the proposed method is more accurate than a CFG-based method, thus demonstrating its effectiveness in improving speech recognition performance.

  • Error Analysis of Field Trial Results of a Spoken Dialogue System for Telecommunications Applications

    Shingo KUROIWA  Kazuya TAKEDA  Masaki NAITO  Naomi INOUE  Seiichi YAMAMOTO  


    E78-D No:6

    We carried out a one year field trial of a voice-activated automatic telephone exchange service at KDD Laboratories which has about 200 branch phones. This system has DSP-based continuous speech recognition hardware which can process incoming calls in real time using a vocabulary of 300 words. The recognition accuracy was found to be 92.5% for speech read from a written text under laboratory conditions independent of the speaker. In this paper, we describe the performance of the system obtained as a result of the field trial. Apart from recognition accuracy, there was about 20% error due to out-of-vocabulary input and incorrect detection of speech endpoints which had not been allowed for in the laboratory experiments. Also, we found that the recognition accuracy for actual speech was about 18% lower than for speech read from text even if there were no out-of-vocabulary words. In this paper, we examine error variations for individual data in order to try and pinpoint the cause of incorrect recognition. It was found from experiments on the collected data that the pause model used, filled pause grammar and differences of channel frequency response seriously affected recognition accuracy. With the help of simple techniques to overcome these problems, we finally obtained a recognition accuracy of 88.7% for real data.

  • Characteristics of Multi-Layer Perceptron Models in Enhancing Degraded Speech

    Thanh Tung LE  John MASON  Tadashi KITAMURA  


    E78-D No:6

    A multi-layer perceptron (MLP) acting directly in the time-domain is applied as a speech signal enhancer, and the performance examined in the context of three common classes of degradation, namely low bit-rate CELP degradation is non-linear system degradation, additive noise, and convolution by a linear system. The investigation focuses on two topics: (i) the influence of non-linearities within the network and (ii) network topology, comparing single and multiple output structures. The objective is to examine how these characteristics influence network performance and whether this depends on the class of degradation. Experimental results show the importance of matching the enhancer to the class of degradation. In the case of the CELP coder the standard MLP with its inherently non-linear characteristics is shown to be consistently better than any equivalent linear structure (up to 3.2 dB compared with 1.6 dB SNR improvement). In contrast, when the degradation is from additive noise, a linear enhancer is always, superior.

  • 4 kbps Improved Pitch Prediction CELP Speech Coding with 20 msec Frame

    Masahiro SERIZAWA  Kazunori OZAWA  


    E78-D No:6

    This paper proposes a new pitch prediction method for 4 kbps CELP (Code Excited LPC) speech coding with 20 msec frame, for the future ITU-T 4 kbps speech coding standardization. In the conventional CELP speech coding, synthetic speech quality deteriorates rapidly at 4 kbps, especially for female and children's speech with short pitch period. The pitch prediction performance is significantly degraded for such speech. The important reason is that when the pitch period is shorter than the subframe length, the simple repetition of the past excitation signal based on the estimated lag, not the pitch prediction, is usually carried out in the adaptive codebook operation. The proposed pitch prediction method can carry out the pitch prediction without the above approximation by utilizing the current subframe excitation codevector signal, when the pitch prediction parameters are determined. To further improve the performance, a split vector synthesis and perceptually spectral weighting method, and a low-complexity perceptually harmonic and spectral weighting method have also been developed. The informal listening test result shows that the 4 kbps speech coder with 20 msec frame, utilizing all of the proposed improvements, achieves 0.2 MOS higher results than the coder without them.

  • Coding for Multi-Pulse PPM with Imperfect Slot Synchronization in Optical Direct-Detection Channels

    Kazumi SATO  Tomoaki OHTSUKI  Iwao SASASE  

    PAPER-Optical Communication

    E78-B No:6

    The performance of coded multi-pulse pulse position modulation (MPPM) consisting of m slots and 2 pulses, denoted as (m, 2) MPPM, with imperfect slot synchronization is analyzed. Convolutional codes and Reed-Solomon (RS) codes are employed for (m, 2) MPPM, and the bit error probability of coded (m, 2) MPPM in the presence of the timing offset is derived. In each coded (m, 2) MPPM, we compare the performance of some different code rate systems. Moreover, we compare the performance of both systems at the same information bit rate. It is shown that in both coded systems, the performance of code rate-1/2 coded (m, 2) MPPM is the best when the timing offset is small. Wheji the timing offset is somewhat large, however, uncoded (m, 2) MPPM is shown to perform better than coded (m, 2) MPPM. Further, convolutional coded (m, 2) MPPM with the constraint length k7 is shown to perform better than RS coded (m, 2) MPPM for the same code rate.

  • Automatic Language Identification Using Sequential Information of Phonemes

    Takayuki ARAI  


    E78-D No:6

    In this paper approaches to language identification based on the sequential information of phonemes are described. These approaches assume that each language can be identified from its own phoneme structure, or phonotactics. To extract this phoneme structure, we use phoneme classifiers and grammars for each language. The phoneme classifier for each language is implemented as a multi-layer perceptron trained on quasi-phonetic hand-labeled transcriptions. After training the phoneme classifiers, the grammars for each language are calculated as a set of transition probabilities for each phoneme pair. Because of the interest in automatic language identification for worldwide voice communication, we decided to use telephone speech for this study. The data for this study were drawn from the OGI (Oregon Graduate Institute)-TS (telephone speech) corpus, a standard corpus for this type of research. To investigate the basic issues of this approach, two languages, Japanese and English, were selected. The language classification algorithms are based on Viterbi search constrained by a bigram grammar and by minimum and maximum durations. Using a phoneme classifier trained only on English phonemes, we achieved 81.1% accuracy. We achieved 79.3% accuracy using a phoneme classifier trained on Japanese phonemes. Using both the English and the Japanese phoneme classifiers together, we obtained our best result: 83.3%. Our results were comparable to those obtained by other methods such as that based on the hidden Markov model.

  • An Objective Measure Based on an Auditory Model for Assessing Low-Rate Coded Speech

    Toshiro WATANABE  Shinji HAYASHI  


    E78-D No:6

    We propose an objective measure from assessing low-rate coded speech. The model for this objective measure, in which several known features of the perceptual processing of speech sounds by the human ear are emulated, is based on the Hertz-to-Bark transformation, critical-band filtering with preemphasis to boost higher frequencies, nonlinear conversion for subjective loudness, and temporal (forward) masking. The effectiveness of the measure, called the Bark spectral distortion rating (BSDR), was validated by second-order polynomial regression analysis between the computed BSDR values and subjective MOS ratings obtained for a large number of utterances coded by several versions of CELP coders and one VSELP coder under three degradation conditions: input speech levels, transmission error rates, and background noise levels. The BSDR values correspond better to MOS ratings than several commonly used measures. Thus, BSDR can be used to accurately predict subjective scores.

  • An HMM State Duration Control Algorithm Applied to Large-Vocabulary Spontaneous Speech Recognition

    Satoshi TAKAHASHI  Yasuhiro MINAMI  Kiyohiro SHIKANO  


    E78-D No:6

    Although Hidden Markov Modeling (HMM) is widely and successfully used in many speech recognition applications, duration control for HMMs is still an important issue in improving recognition accuracy since a HMM places no constraints on duration. For compensating this defect, some duration control algorithms that employ precise duration models have been proposed. However, they suffer from greatly increased computational complexity. This paper proposes a new state duration control algorithm for limiting both the maximum and the minimum state durations. The algorithm is for the HMM trellis likelihood calculation, not for the Viterbi calculation. The amount of computation required by this algorithm is only order one (O(1)) for the maximum state duration n; that is, the computation amount is independent of the maximum state duration while many conventional duration control algorithm require computation in the amount of order n or order n2. Thus, the algorithm can drastically reduce the computation needed for duration control. The algorithm uses the property that the trellis likelihood calculation is a summation of many path likelihoods. At each frame, the path likelihood that exceeds the maximum likelihood is subtracted, and the path likelihood that satisfies the minimum likelihood is added to the forward probability. By iterating this procedure, the algorithm calculates the trellis likelihood efficiently. The algorithm was evaluated using a large-vocabulary speaker-independent spontaneous speech recognition system for telephone directory assistance. The average reduction in error rate for sentence understanding was about 7% when using context-independent HMMs, and 3% when using context-dependent HMMs. We could confirm the improvement by using the proposed state duration control algorithm even though the maximum and the minimum state durations were not optimized for the task (speaker-independent duration settings obtained from a different task were used).

  • Development of Liquid Helium-Free Superconducting Magnet

    Junji SAKURABA  Mamoru ISHIHARA  Seiji YASUHARA  Kazunori JIKIHARA  Keiichi WATAZAWA  Tsuginori HASEBE  Chin Kung CHONG  Yutaka YAMADA  Kazuo WATANABE  

    INVITED PAPER-Applications of small-size high field superconducting magnet

    E78-C No:5

    Cryocooler cooled superconducting magnets using Bismuth based high-Tc current leads have been successfully demonstrated. The magnets mainly consisted of a superconducting coil, current leads and a radiation shield which are cooled by a two stage Gifford-McMahon cryocooler without using liquid helium. Our first liquid helium-free 4.6 T (Nb, Ti)3Sn superconducting magnet with a room temperature bore of 38 mm operated at 11 K has recorded a continuous operation at 3.7 T for 1,200 hours and total cooling time over 10,000 hours without trouble. As a next step, we constructed a (Nb, Ti)3Sn liquid helium-free superconducting magnet with a wider room temperature bore of 60 mm. The coil temperature reached 8.3 K in 37 hours after starting the cryocooler. The magnet generated 5.0 T at the center of the 60 mm room temperature bore at an operating current of 140 A. An operation at a field of 5 T was confirmed to be stable even if the cryocooler has been stopped for 4 minutes. These results show that the liquid helium-free superconducting magnets can provide an excellent performance for a new application of the superconducting magnet.

  • Identifying Strategies Using Decision Lists from Trace Information

    Satoshi KOBAYASHI  

    PAPER-Machine Learning and Its Applications

    E78-D No:5

    This paper concerns the issue of learning strategies for problem solvers from trace data. Many works on Explanation Based Learning have proposed methods for speeding up a given problem solver (or a Prolog program) by optimizing it on some subspace of problem instances with high probability of occurrences. However, in the current paper, we discuss the issue of identifying a target strategy exactly from trace data. Learning criterion used in this paper is the identification in the limit proposed by Gold. Further, we use the tree pattern language to represent preconditions of operators, and propose a class of strategies, called decision list strategies. One of the interesting features of our learning algorithm is the coupled use of state and operator sequence information of traces. Theoretically, we show that the proposed algorithm identifies some subclass of decision list strategies in the limit with the conjectures updated in polynomial time. Further, an experimental result on N-puzzle domain is presented.

  • Phenomenological Description of Temperature and Frequency Dependence of Surface Resistance of High-Tc Superconductors by Improved Three-Fluid Model

    Tadashi IMAI  Yoshio KOBAYASHI  

    PAPER-Microwave devices

    E78-C No:5

    A calculation method by the improved three-fluid model is shown to describe phenomenologically temperature and frequency dependence of surface resistance Rs for high-Tc superconductors. It is verified that this model is useful to describe temperature dependence of Rs for such high-Tc superconducting films as Y-Ba-Cu-O (YBCO), Eu-Ba-Cu-O, and Tl-Ba-Ca-Cu-O films. For the frequency dependence of Rs of a YBCO bulk, furthermore, the measured results which have not depended on f2 in the frequency range 10-25 GHz, can be described successfully by this model. Finally, a figure of merit is proposed to evaluate material quality for high-Tc superconductors from the values of electron densities and momentum relaxation time determined by the present model.

  • All-Optical Timing Clock Extraction Using Multiple Wavelength Pumped Brillouin Amplifier

    Hiroto KAWAKAMI  Yutaka MIYAMOTO  Tomoyoshi KATAOKA  Kazuo HAGIMOTO  


    E78-B No:5

    This paper discusses an all-optical tank circuit that uses the comb-shaped gain spectrum generated by a Brillouin amplifier. The theory of timing clock extraction is shown for two cases: with two gains and with three gains. In both cases, the waveform of the extracted timing clock is simulated. According to the simulation, unlike an ordinary tank circuit, the amplitude of the extracted clock is not constant even though the quality factor (Q) is infinite. The extracted clock is clearly influenced by the pattern of the original data stream if the Brillouin gain is finite. The ratio of the maximum extracted clock amplitude to the minimum extracted amplitude is calculated as a function of Brillouin gain. The detuning of the pump light frequency is also discussed. It induces not only changes in the Brillouin gain, but also phase shift in the amplified light. The relation between the frequency drift of the pump lights and the jitter of the extracted timing clock is shown, in both cases: two pump lights are used and three pump lights are used. It is numerically shown that when the all pump lights have the same frequency drift, i.e., their frequency separation is constant, the phase of the extracted clock is not influenced by the frequency drift of the pump lights. The operation principle is demonstrated at 5Gbit/s, 2.5Gbit/s, and 2Gbit/s using two pumping techniques. The parameters of quality factor and the suppression ratio in the baseband domain are measured. Q and the suppression ratio are found to be 160 and 28dB, respectively.

  • High-Tc Superconducting Quantum Interference Device with Additional Positive Feedback

    Akira ADACHI  Ken'ichi OKAJIMA  Youichi TAKADA  Saburo TANAKA  Hideo ITOZAKI  Haruhisa TOYODA  Hisashi KADO  

    PAPER-SQUID sensor and multi-channel SQUID system

    E78-C No:5

    This study shows that using the direct offset integration technique (DOIT) and additional positive feedback (APF) in a high-Tc dc superconducting quantum interference device (SQUID) improves the effective flux-to-voltage transfer function and reduces the flux noise of a magnetometer, thus improving the magnetic field noise. The effective flux-to-voltage transfer function and the flux noise with APF were measured at different values of the positive feedback parameter βa, which depends on the resistance of the APF circuit. These quantities were also compared between conditions with and without APF. This investigation showed that a βa condition the most suitable for minimizing the flux noise of a magnetometer with APF exists and that it is βa=0.77. The effective flux-to-voltage transfer function with APF is about three times what it is without APF (93 µV/Φ0 vs. 32 µV/Φ0). The magnetic field noise of a magnetometer with APF is improved by a factor of about 3 (242 fT/Hz vs. 738 fT/Hz).

  • Characteristics of High-Tc Superconducting Flux Flow Transistors

    Kazunori MIYAHARA  Koji TSURU  Shugo KUBO  Minoru SUZUKI  

    INVITED PAPER-Three terminal devices and Josephson Junctions

    E78-C No:5

    High-Tc superconducting flux flow transistors were fabricated with co-evaporated thin films of YBaCuO. The vortex flow channels (2 µm in width) and the device patterns were formed by Ar ion milling. The three-terminal characteristics, vortex flow characteristics, transresistance, and current gain of the device were measured. The AC input-output characteristics of the device with an Au load resistor were also measured. The measured flow voltage, transresistance and current gain are discussed in relation to these AC input-output measurements.

  • Passive Sonar-Ranging System Based on Adaptive Filter Technique

    Chang-Yu SUN  Qi-Hu LI  Takashi SOMA  

    PAPER-Digital Signal Processing

    E78-A No:5

    A noise cancelling sonar-ranging system based on the adaptive filtering technique, which can automatically adapt itself to the changes in environmental noise-field and improve the passive sonar-ranging/goniometric precision, was introduced by this paper. In the meantime, the software and hardware design principle of the system using high speed VLSI (Very Large Scale Integrated) DSP (Digital Signal Processing) chips, and the practical test results were also presented. In comparison with the traditional ranging system, the system not only enhanced obviously the ranging precision but also possessed some more characteristics such as simple structure, rapid operation, large data-storage volume, easy programming, high reliability and so on.

  • High-Speed and Low-Power n+-p+ Double-Gate SOI CMOS

    Kunihiro SUZUKI  Tetsu TANAKA  Yoshiharu TOSAKA  Hiroshi HORIE  Toshihiro SUGII  

    PAPER-Device Technology

    E78-C No:4

    We propose and fabricate n+-p+ double-gate SOI MOSFETs for which threshold voltage is controlled by interaction between the two gates. Devices have excellent short channel immunity, dispite a low channel doping concentration of 1015 cm-3, and enable us to design a threshold voltage below 0.3 V while maintaining an almost ideal subthreshold swing. We demonstrated 27 ps CMOS inverter delay with a gate length of 0.19 µm, which is, to our knowledge, the lowest delay for this gate length despite rather a thick 9 nm gate oxide. This high performance is a result of the low threshold voltage and negligible drain capacitance. We also showed theoretically that we can design a 0.1 µm gate length device with an ideal subthreshold swing, and that we can expect less than 10 ps inverter delay at a supply voltage of 1 V.

  • A New Approach of Parsing and Search Based on the Divide and Conquer Strategy for Continuous Speech Recognition

    Ming-Sheng WANG  Satoshi IMAI  

    PAPER-Speech Processing and Acoustics

    E78-D No:4

    In this paper, we report a new approach about parsing and searching problem for a given phonetic lattice. The approach is based on the Divide and Conquer (DC) strategy. By dividing the phonetic lattice, we first construct a PD-tree to represent this lattice, then, we parse through this PD-tree to identify the possible sentence which is supposed to be the speech utterance. Next, we propose a new search scheme called Downward Request (DR) search model to decrease the computation costs, and this search model gives us the optimal or N-best solutions. Experiments performed on Chinese speech recognition show us the good results.
