The search functionality is under construction.

Author Search Result

[Author] Seiichi YAMAMOTO(13hit)

1-13hit
  • Phoneme Set Design for Speech Recognition of English by Japanese

    Xiaoyun WANG  Jinsong ZHANG  Masafumi NISHIDA  Seiichi YAMAMOTO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2014/10/01
      Vol:
    E98-D No:1
      Page(s):
    148-156

    This paper describes a novel method to improve the performance of second language speech recognition when the mother tongue of users is known. Considering that second language speech usually includes less fluent pronunciation and more frequent pronunciation mistakes, the authors propose using a reduced phoneme set generated by a phonetic decision tree (PDT)-based top-down sequential splitting method instead of the canonical one of the second language. The authors verify the efficacy of the proposed method using second language speech collected with a translation game type dialogue-based English CALL system. Experiments show that a speech recognizer achieved higher recognition accuracy with the reduced phoneme set than with the canonical phoneme set.

  • An Objective Method for Evaluating Speech Translation System: Using a Second Language Learner's Corpus

    Keiji YASUDA  Fumiaki SUGAYA  Toshiyuki TAKEZAWA  Genichiro KIKUI  Seiichi YAMAMOTO  Masuzo YANAGIDA  

     
    PAPER-Speech Corpora and Related Topics

      Vol:
    E88-D No:3
      Page(s):
    569-577

    In this paper we propose an objective method for assessing the capability of a speech translation system. It automates the translation paired comparison method, which gives a simple, easy to understand TOEIC score proposed by Sugaya et al., to succinctly evaluate a speech translation system. To avoid the expensive evaluation cost of the original method where large manual effort is required, the new objective method automates the procedure by employing an objective metric such as BLEU and DP-based measure. The evaluation results obtained by the proposed method are similar to those of the original method. Also, the proposed method is used to evaluate the usefulness of a speech translation system. It is then found that our speech translation system is useful in general, even to users with higher TOEIC score than the system's.

  • An Adaptive Echo Canceller with Variable Step Gain Method

    Seiichi YAMAMOTO  Seishi KITAYAMA  

     
    PAPER-Transmission Systems

      Vol:
    E65-E No:1
      Page(s):
    1-8

    As a means of improving the rate of convergence of the conventional echo canceller using the learning identification method, the authors have previously proposed a linear predictive algorithm. This algorithm shows better convergence than the learning identification method. However, in this algorithm, as well as in the learning identification method, a compromise is necessary between a relatively large step gain required for fast convergence and the relatively small step gain needed for noise insensitivity in the presence of noise. In this paper a new algorithm based on the linear predicitive algorithm is proposed, in which the step gain is determined as a function of the estimated values of noise and the parameters-error of the echo path model in order to improve both the rate of convergence and the noise insensitivity simultaneously. The efficiency of the proposed algorithm is examined by computer simulations. It has been shown that the proposed algorithm gives about twice the rate of convergence and about 10 dB lower parameters-error in the stationary state in comparison with the learning identification method. Besides, it has been proved that this algorithm guarantees non-divergence of the echo path model even during the period of double-talking" without any control device such as a double-talking detector.

  • Error Analysis of Field Trial Results of a Spoken Dialogue System for Telecommunications Applications

    Shingo KUROIWA  Kazuya TAKEDA  Masaki NAITO  Naomi INOUE  Seiichi YAMAMOTO  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    636-641

    We carried out a one year field trial of a voice-activated automatic telephone exchange service at KDD Laboratories which has about 200 branch phones. This system has DSP-based continuous speech recognition hardware which can process incoming calls in real time using a vocabulary of 300 words. The recognition accuracy was found to be 92.5% for speech read from a written text under laboratory conditions independent of the speaker. In this paper, we describe the performance of the system obtained as a result of the field trial. Apart from recognition accuracy, there was about 20% error due to out-of-vocabulary input and incorrect detection of speech endpoints which had not been allowed for in the laboratory experiments. Also, we found that the recognition accuracy for actual speech was about 18% lower than for speech read from text even if there were no out-of-vocabulary words. In this paper, we examine error variations for individual data in order to try and pinpoint the cause of incorrect recognition. It was found from experiments on the collected data that the pause model used, filled pause grammar and differences of channel frequency response seriously affected recognition accuracy. With the help of simple techniques to overcome these problems, we finally obtained a recognition accuracy of 88.7% for real data.

  • Internet Metronome: An Experimental Remote Jazz Jam Session with Uncompressed HDTV Transmission over Lightpaths Open Access

    Osamu NAKAMURA  Kazunori SUGIURA  Seiichi YAMAMOTO  Noriyuki SHIGECHIKA  Akira KATO  Katsuyuki HASEBE  Jun MURAI  

     
    INVITED PAPER

      Vol:
    E89-B No:4
      Page(s):
    1052-1058

    An experimental remote jazz jam session with uncompressed HDTV over the Internet was conducted on September 21st as a Grand Final event of the Aichi Exposition 2005. Professional jazz musicians located at the venue of Aichi Exposition and at SARA in Amsterdam have made the jazz jam session with new mechanisms called as "Internet Metronome" and "delay-control unit" using an international "lightpath." This was the first music collaboration using a new methodology and, one of the challenging demonstrations to transport the uncompressed HDTV streams with timing control under the current software and hardware architectures. "Internet Metronome" and "delay-control unit" enabled to make a tempo using and controlling delay, and "lightpath" minimized the network jitter. Using these new mechanisms and technology, the musicians could play with new music collaboration environment over the Internet with long communication delay, and enjoyed remote jazz jam session at both ends.

  • Utterance Intent Classification for Spoken Dialogue System with Data-Driven Untying of Recursive Autoencoders Open Access

    Tsuneo KATO  Atsushi NAGAI  Naoki NODA  Jianming WU  Seiichi YAMAMOTO  

     
    PAPER-Natural Language Processing

      Pubricized:
    2019/03/04
      Vol:
    E102-D No:6
      Page(s):
    1197-1205

    Data-driven untying of a recursive autoencoder (RAE) is proposed for utterance intent classification for spoken dialogue systems. Although an RAE expresses a nonlinear operation on two neighboring child nodes in a parse tree in the application of spoken language understanding (SLU) of spoken dialogue systems, the nonlinear operation is considered to be intrinsically different depending on the types of child nodes. To reduce the gap between the single nonlinear operation of an RAE and intrinsically different operations depending on the node types, a data-driven untying of autoencoders using part-of-speech (PoS) tags at leaf nodes is proposed. When using the proposed method, the experimental results on two corpora: ATIS English data set and Japanese data set of a smartphone-based spoken dialogue system showed improved accuracies compared to when using the tied RAE, as well as a reasonable difference in untying between two languages.

  • A Study of Aspect Calculus

    Kazuo HASHIMOTO  Tohru ASAMI  Seiichi YAMAMOTO  

     
    PAPER-Foundations of Artificial Intelligence and Knowledge Processing

      Vol:
    E75-A No:3
      Page(s):
    436-450

    Since Vendler classified aspect into four categories, state, achievement, activity, and accomplishment, much effort has been made to define the notion of aspect logically. It is commonly agreed that aspect represents the general temporal characteristics of events and states. However, there still remains a considerable amount of disagreement about its formal treatment. One of the major problems is that the aspect of a sentence shifts by certain types of sentence construction. For instance, adding time adverbials to a sentence modifies the original aspect, taking the progressive form of the verb changes the aspect, and so on. These phenomena are known as the aspect shifts. The other is the problem known as the imperfective paradox. The imperfective paradox is a problem of the truth definition of the progressives. The truth condition of the progressive form of the sentence is defined at an internal subinterval of the temporal range of the corresponding non-progressive sentence. If the truth condition of the progressive form of the sentence is defined using the truth condition of the non-progressive form of the sentence, there are logical contradictions of truth definition in a sentence such as "Max was building a house, but he never built it". These problems cause much confusion (1) in the truth definition of aspects, (2) in the definition of aspect operations, such as initiative, terminative, progressive, perfective, etc., and also (3) in the definition of adding time adverbials. This paper reviews the semantic problems with respect to aspect, and presents a consistent mechanism of aspect interpretation in order to settle all these semantic puzzles at once. For the sake of logical clarity, we construct a formal language, Lt, where every meaningful formula is a pair of a meaningful sentence and its aspect. The syntax of Lt describes the phenomenology of aspect shifts. The semantics of Lt defines temporal interpretation for all the meaningful sentences of Lt, with assuming the temporal interpretations of three inherent aspects, state, achievement, and activity. The proposed aspect interpretation gives a reasonable account for aspect shifts, and solves the imperfective paradox by asssuming the time structure to be backwards linear.

  • Classification of Utterances Based on Multiple BLEU Scores for Translation-Game-Type CALL Systems

    Reiko KUWA  Tsuneo KATO  Seiichi YAMAMOTO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2017/12/04
      Vol:
    E101-D No:3
      Page(s):
    750-757

    This paper proposes a classification method of second-language-learner utterances for interactive computer-assisted language learning systems. This classification method uses three types of bilingual evaluation understudy (BLEU) scores as features for a classifier. The three BLEU scores are calculated in accordance with three subsets of a learner corpus divided according to the quality of utterances. For the purpose of overcoming the data-sparseness problem, this classification method uses the BLEU scores calculated using a mixture of word and part-of-speech (POS)-tag sequences converted from word sequences based on a POS-replacement rule according to which words are replaced with POS tags in n-grams. Experiments of classifying English utterances by Japanese demonstrated that the proposed classification method achieved classification accuracy of 78.2% which was 12.3 points higher than a baseline with one BLEU score.

  • Automatic Induction of Romanization Systems from Bilingual Corpora

    Keiko TAGUCHI  Andrew FINCH  Seiichi YAMAMOTO  Eiichiro SUMITA  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2014/11/14
      Vol:
    E98-D No:2
      Page(s):
    381-393

    In this article we present a novel corpus-based method for inducing romanization systems for languages through a bilingual alignment of transliteration word pairs. First, the word pairs are aligned using a non-parametric Bayesian approach, and then for each grapheme sequence to be romanized, a particular romanization is selected according to a user-specified criterion. As far as we are aware, this paper is the only one to describe a method for automatically deriving complete romanization systems. Unlike existing human-derived romanization systems, the proposed method is able to discover induced romanization systems tailored for specific purposes, for example, for use in data mining, or efficient user input methods. Our experiments study the romanization of four totally different languages: Russian, Japanese, Hindi and Myanmar. The first two languages already have standard romanization systems in regular use, Hindi has a large number of diverse systems, and Myanmar has no standard system for romanization. We compare our induced romanization system to existing systems for Russian and Japanese. We find that the systems so induced are almost identical to Russian, and 69% identical to Japanese. We applied our approach to the task of transliteration mining, and used Levenshtein distance as the romanization selection criterion. Our experiments show that our induced romanization system was able to match the performance of the human created system for Russian, and offer substantially improved mining performance for Japanese. We provide an analysis of the mechanism our approach uses to improve mining performance, and also analyse the differences in characteristics between the induced system for Japanese and the official Japanese Nihon-shiki system. In order to investigate the limits of our approach, we studied the romanization of Myanmar, a low-resource language with a large vocabulary of graphemes. We estimate the approximate corpus size required to effectively romanize the most frequency k graphemes in the language for all values of k up to 1800.

  • Phoneme Set Design Based on Integrated Acoustic and Linguistic Features for Second Language Speech Recognition

    Xiaoyun WANG  Tsuneo KATO  Seiichi YAMAMOTO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2016/12/29
      Vol:
    E100-D No:4
      Page(s):
    857-864

    Recognition of second language (L2) speech is a challenging task even for state-of-the-art automatic speech recognition (ASR) systems, partly because pronunciation by L2 speakers is usually significantly influenced by the mother tongue of the speakers. Considering that the expressions of non-native speakers are usually simpler than those of native ones, and that second language speech usually includes mispronunciation and less fluent pronunciation, we propose a novel method that maximizes unified acoustic and linguistic objective function to derive a phoneme set for second language speech recognition. The authors verify the efficacy of the proposed method using second language speech collected with a translation game type dialogue-based computer assisted language learning (CALL) system. In this paper, the authors examine the performance based on acoustic likelihood, linguistic discrimination ability and integrated objective function for second language speech. Experiments demonstrate the validity of the phoneme set derived by the proposed method.

  • Speech Recognition of English by Japanese Using Lexicon Represented by Multiple Reduced Phoneme Sets

    Xiaoyun WANG  Seiichi YAMAMOTO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2015/09/10
      Vol:
    E98-D No:12
      Page(s):
    2271-2279

    Recognition of second language (L2) speech is still a challenging task even for state-of-the-art automatic speech recognition (ASR) systems, partly because pronunciation by L2 speakers is usually significantly influenced by the mother tongue of the speakers. The authors previously proposed using a reduced phoneme set (RPS) instead of the canonical one of L2 when the mother tongue of speakers is known, and demonstrated that this reduced phoneme set improved the recognition performance through experiments using English utterances spoken by Japanese. However, the proficiency of L2 speakers varies widely, as does the influence of the mother tongue on their pronunciation. As a result, the effect of the reduced phoneme set is different depending on the speakers' proficiency in L2. In this paper, the authors examine the relation between proficiency of speakers and a reduced phoneme set customized for them. The experimental results are then used as the basis of a novel speech recognition method using a lexicon in which the pronunciation of each lexical item is represented by multiple reduced phoneme sets, and the implementation of a language model most suitable for that lexicon is described. Experimental results demonstrate the high validity of the proposed method.

  • An Adaptive Echo Canceller with Linear Predictor

    Seiichi YAMAMOTO  Seishi KITAYAMA  Junso TAMURA  Hikoichi ISHIGAMI  

     
    PAPER-Communication Theory

      Vol:
    E62-E No:12
      Page(s):
    851-857

    This paper describes the algorithm and convergence properties of an adaptive echo canceller with linear predictor. Conventional echo cancellers based on the learning identification algorithm may not provide good performance, because the rate of convergence is low due to the high correlation of speech signals, and echoes at the beginning of calls cannot be cancelled. In order to obtain better convergence properties, the new echo canceller adopts a linear prediction as the method for decorrelating the speech signals. The identification of the echo path and the generation of the echo replica are conducted independently, and the identification of echo path is carried out with prediction errors of speech signals and echo signal when predictor coefficients are decided by the linear prediction of speech signals. The echo replica is generated by putting the received speech signal through the echo path model. Computer simulation has shown that the new echo canceller is converged faster than conventional echo cancellers and that the convergence properties are better as the degree of linear prediction is higher and the predictor coefficients are more accurate. In case the degree is five, the rate of convergence is about twice as high and Echo Return Loss Enhancement (ERLE) increases over 10 dB in comparison with the conventional one.

  • A Portable Text-to-Speech System Using a Pocket-Sized Formant Speech Synthesizer

    Norio HIGUCHI  Tohru SHIMIZU  Hisashi KAWAI  Seiichi YAMAMOTO  

     
    PAPER

      Vol:
    E76-A No:11
      Page(s):
    1981-1989

    The authors developed a portable Japanese text-to-speech system using a pocket-sized formant speech synthesizer. It consists of a linguistic processor and an acoustic processor. The linguistic processor runs on an MS-DOS personal computer and has functions to determine readings and prosodic information for input sentences written in kana-kanji-mixed style. New techniques, such as minimization of a cost function for phrases, rare-compound flag, semantic information, information of reading selection and restriction by associated particles, are used to increase the accuracy of readings and accent positions. The accuracy of determining readings and accent positions is 98.6% for sentences in newspaper articles. It is possible to use the linguistic processor through an interface library which has also been developed by the authors. Consequently, it has become possible not only to convert whole texts stored in text files but also to convert parts of sentences sent by the interface library sequentially, and the readings and prosodic information are optimized for the whole sentence at one time. The acoustic processor is custom-made hardware, and it has adopted new techniques, for the improvement of rules for vowel devoicing, control of phoneme durations, control of the phrase components of voice fundamental frequency and the construction of the acoustic parameter database. Due to the above-mentioned modifications, the naturalness of synthetic speech generated by a Klatt-type formant speech synthesizer was improved. On a naturalness test it was rated 3.61 on a scale of 5 points from 0 to 4.