The search functionality is under construction.

Author Search Result

[Author] Yoichi YAMASHITA(12hit)

1-12hit
  • Robust Talker Direction Estimation Based on Weighted CSP Analysis and Maximum Likelihood Estimation

    Yuki DENDA  Takanobu NISHIURA  Yoichi YAMASHITA  

     
    PAPER-Speech Enhancement

      Vol:
    E89-D No:3
      Page(s):
    1050-1057

    This paper describes a new talker direction estimation method for front-end processing to capture distant-talking speech by using a microphone array. The proposed method consists of two algorithms: One is a TDOA (Time Delay Of Arrival) estimation algorithm based on a weighted CSP (Cross-power Spectrum Phase) analysis with an average speech spectrum and CSP coefficient subtraction. The other is a talker direction estimation algorithm based on ML (Maximum Likelihood) estimation in a time sequence of the estimated TDOAs. To evaluate the effectiveness of the proposed method, talker direction estimation experiments were carried out in an actual office room. The results confirmed that the talker direction estimation performance of the proposed method is superior to that of the conventional methods in both diffused- and directional-noise environments.

  • Joint Analysis of Sound Events and Acoustic Scenes Using Multitask Learning

    Noriyuki TONAMI  Keisuke IMOTO  Ryosuke YAMANISHI  Yoichi YAMASHITA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2020/11/19
      Vol:
    E104-D No:2
      Page(s):
    294-301

    Sound event detection (SED) and acoustic scene classification (ASC) are important research topics in environmental sound analysis. Many research groups have addressed SED and ASC using neural-network-based methods, such as the convolutional neural network (CNN), recurrent neural network (RNN), and convolutional recurrent neural network (CRNN). The conventional methods address SED and ASC separately even though sound events and acoustic scenes are closely related to each other. For example, in the acoustic scene “office,” the sound events “mouse clicking” and “keyboard typing” are likely to occur. Therefore, it is expected that information on sound events and acoustic scenes will be of mutual aid for SED and ASC. In this paper, we propose multitask learning for joint analysis of sound events and acoustic scenes, in which the parts of the networks holding information on sound events and acoustic scenes in common are shared. Experimental results obtained using the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of SED and ASC by 1.31 and 1.80 percentage points in terms of the F-score, respectively, compared with the conventional CRNN-based method.

  • F0 Parameterization of Glottalized Tones in HMM-Based Speech Synthesis for Hanoi Vietnamese

    Duy Khanh NINH  Yoichi YAMASHITA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2015/09/07
      Vol:
    E98-D No:12
      Page(s):
    2280-2289

    A conventional HMM-based speech synthesis system for Hanoi Vietnamese often suffers from hoarse quality due to incomplete F0 parameterization of glottalized tones. Since estimating F0 from glottalized waveform is rather problematic for usual F0 extractors, we propose a pitch marking algorithm where pitch marks are propagated from regular regions of a speech signal to glottalized ones, from which complete F0 contours for the glottalized tones are derived. The proposed F0 parameterization scheme was confirmed to significantly reduce the hoarseness whilst slightly improving the tone naturalness of synthetic speech by both objective and listening tests. The pitch marking algorithm works as a refinement step based on the results of an F0 extractor. Therefore, the proposed scheme can be combined with any F0 extractor.

  • MASCOTS: Dialog Management System for Speech Understanding System

    Tetsuya YAMAMOTO  Yoshikazu OHTA  Yoichi YAMASHITA  Osamu KAKUSHO  Riichiro MIZOGUCHI  

     
    PAPER-Speech Understanding

      Vol:
    E74-A No:7
      Page(s):
    1881-1888

    This paper describes a dialog management system called MASCOTS which manages a dialog between a user and a problem solving system through spoken Japanese and helps the speech understanding system in its language processing. MASCOTS tries to predict the next user utterance based on the architecture for managing dialog with two stacks and plan information. MASCOTS not only contributes to making language processing efficient, but also works for a problem solving system. MASCOTS identifies the kind of the utterance and standardizes its representation form in place of a problem solving system. In this paper, the architecture of MASCOTS is discussed focusing on the characteristics of dialog and two ways of predicting the next user utterance exchanging the information with the language processing system.

  • Tree-Based Approaches to Automatic Generation of Speech Synthesis Rules for Prosodic Parameters

    Yoichi YAMASHITA  Manabu TANAKA  Yoshitake AMAKO  Yasuo NOMURA  Yoshikazu OHTA  Atsunori KITOH  Osamu KAKUSHO  Riichiro MIZOGUCHI  

     
    PAPER

      Vol:
    E76-A No:11
      Page(s):
    1934-1941

    This paper describes automatic generation of speech synthesis rules which predict a stress level for each bunsetsu in long noun phrases. The rules are inductively inferred from a lot of speech data by using two kinds of tree-based methods, the conventional decision tree and the SBR-tree methods. The rule sets automatically generated by two methods have almost the same performance and decrease the prediction error to about 14 Hz from 23 Hz of the accent component value. The rate of the correct reproduction of the change for adjacent bunsetsu pairs is also used as a measure for evaluating the generated rule sets and they correctly reproduce the change of about 80%. The effectiveness of the rule sets is verified through the listening test. And, with regard to the comprehensiveness of the generated rules, the rules by the SBR-tree methods are very compact and easy to human experts to interpret and matches the former studies.

  • A Stochastic F0 Contour Model Based on Clustering and a Probabilistic Measure

    Yoichi YAMASHITA  Tomoyoshi ISHIDA  Kazuki SHIMADERA  

     
    PAPER-Speech Synthesis and Prosody

      Vol:
    E86-D No:3
      Page(s):
    543-549

    One of fundamental issues on the F0 contour is modeling relationship between F0 parameters and linguistic information of a sentence. This paper proposes a stochastic F0 model which probabilistically models the relationship between the F0 contour and the linguistic information. For the application of speech synthesis, an F0 generator selects the most probable F0 contour from candidates given by a probabilistic F0 model. An F0 contour of a Japanese sentence is represented by concatenation of F0 patterns of a Japanese syntactic unit, bunsetsu. A bunsetsu F0 pattern is composed of an F0 average and an F0 shape. The F0 average is independently predicted for each bunsetsu by a quantification theory from linguistic features of the bunsetsu. The most probable sequence of bunsetsu F0 shapes for a sentence is found in the F0 shape database based on a probabilistic measure. The probability that an F0 contour is observed for a sentence is defined by two kinds of probabilities, the F0 shape production and the F0 shape bigram. The latter is a probability of adjacent occurrence of two F0 shapes, which is similar to a word bigram in speech recognition. Several typical bunsetsu F0 shapes are extracted by clustering of training data and stored in the F0 shape database. The probability of the F0 shape production is computed for each bunsetsu based on distribution of values for the linguistic feature in a cluster. The RMS prediction errors of the F0 contour are 0.26[octave].

  • FOREWORD Open Access

    Yoichi YAMASHITA  

     
    FOREWORD

      Vol:
    E97-D No:6
      Page(s):
    1402-1402
  • Multiple Sound Source Localization Based on Inter-Channel Correlation Using a Distributed Microphone System in a Real Environment

    Kook CHO  Hajime OKUMURA  Takanobu NISHIURA  Yoichi YAMASHITA  

     
    PAPER-Microphone Array

      Vol:
    E93-D No:9
      Page(s):
    2463-2471

    In real environments, the presence of ambient noise and room reverberations seriously degrades the accuracy in sound source localization. In addition, conventional sound source localization methods cannot localize multiple sound sources accurately in real noisy environments. This paper proposes a new method of multiple sound source localization using a distributed microphone system that is a recording system with multiple microphones dispersed to a wide area. The proposed method localizes a sound source by finding the position that maximizes the accumulated correlation coefficient between multiple channel pairs. After the estimation of the first sound source, a typical pattern of the accumulated correlation for a single sound source is subtracted from the observed distribution of the accumulated correlation. Subsequently, the second sound source is searched again. To evaluate the effectiveness of the proposed method, experiments of two sound source localization were carried out in an office room. The result shows that sound source localization accuracy is about 99.7%. The proposed method could realize the multiple sound source localization robustly and stably.

  • MASCOTS II: A Dialog Manager in General Interface for Speech Input and Output

    Yoichi YAMASHITA  Hideaki YOSHIDA  Takashi HIRAMATSU  Yasuo NOMURA  Riichiro MIZOGUCHI  

     
    PAPER

      Vol:
    E76-D No:1
      Page(s):
    74-83

    This paper describes a general interface system for speech input and output and a dialog management system, MASCOTS, which is a component of the interface system. The authors designed this interface system, paying attention to its generality; that is, it is not dependent on the problem-solving system it is connected to. The previous version of MASCOTS dealt with the dialog processing only for the speech input based on the SR-plans. We extend MASCOTS to cover the speech output to the user. The revised version of MASCOTS, named MASCOTS II, makes use of topic information given by the topic packet network (TPN) which models the topic transitions in dialogs. Input and output messages are described with the concept representation based on the case structure. For the speech input, prediction of user's utterance is focused and enhanced by using the TPN. The TPN compensates for the shortages of the SR-plan and improves the accuracy of prediction as to stimulus utterances of the user. As the dialog processing in the speech output, MASCOTS II extracts emphatic words and restores missing words to the output message if necessary, e.g., in order to notify the results of speech recognition. The basic mechanisms of the SR-plan and the TPN are shared between the speech input and output processes in MASCOTS II.

  • Omnidirectional Audio-Visual Talker Localization Based on Dynamic Fusion of Audio-Visual Features Using Validity and Reliability Criteria

    Yuki DENDA  Takanobu NISHIURA  Yoichi YAMASHITA  

     
    PAPER-Applications

      Vol:
    E91-D No:3
      Page(s):
    598-606

    This paper proposes a robust omnidirectional audio-visual (AV) talker localizer for AV applications. The proposed localizer consists of two innovations. One of them is robust omnidirectional audio and visual features. The direction of arrival (DOA) estimation using an equilateral triangular microphone array, and human position estimation using an omnidirectional video camera extract the AV features. The other is a dynamic fusion of the AV features. The validity criterion, called the audio- or visual-localization counter, validates each audio- or visual-feature. The reliability criterion, called the speech arriving evaluator, acts as a dynamic weight to eliminate any prior statistical properties from its fusion procedure. The proposed localizer can compatibly achieve talker localization in a speech activity and user localization in a non-speech activity under the identical fusion rule. Talker localization experiments were conducted in an actual room to evaluate the effectiveness of the proposed localizer. The results confirmed that the talker localization performance of the proposed AV localizer using the validity and reliability criteria is superior to that of conventional localizers.

  • Automatic Scoring for Prosodic Proficiency of English Sentences Spoken by Japanese Based on Utterance Comparison

    Yoichi YAMASHITA  Keisuke KATO  Kazunori NOZAWA  

     
    PAPER-Speech Synthesis and Prosody

      Vol:
    E88-D No:3
      Page(s):
    496-501

    This paper describes techniques of scoring prosodic proficiency of English sentences spoken by Japanese. The multiple regression model predicts the prosodic proficiency using new prosodic measures based on the characteristics of Japanese novice learners of English. Prosodic measures are calculated by comparing prosodic parameters, such as F0, power and duration, of learner's and native speaker's speech. The new measures include the approximation error of the fitting line and the comparison result of prosodic parameters for a limited segment of the word boundary rather than the whole utterance. This paper reveals that the introduction of the new measures improved the correlation by 0.1 between the teachers' and automatic scores.

  • An Utterance Prediction Method Based on the Topic Transition Model

    Yoichi YAMASHITA  Takashi HIRAMATSU  Osamu KAKUSHO  Riichiro MIZOGUCHI  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    622-628

    This paper describes a method for predicting the user's next utterances in spoken dialog based on the topic transition model, named TPN. Some templates are prepared for each utterance pair pattern modeled by SR-plan. They are represented in terms of five kinds of topic-independent constituents in sentences. The topic of an utterance is predicted based on the TPN model and it instantiates the templates. The language processing unit analyzes the speech recognition result using the templates. An experiment shows that the introduction of the TPN model improves the performance of utterance recognition and it drastically reduces the search space of candidates in the input bunsetsu lattice.