The search functionality is under construction.
The search functionality is under construction.

Author Search Result

[Author] Mikio NAKANO(5hit)

1-5hit
  • Automatic Allocation of Training Data for Speech Understanding Based on Multiple Model Combinations

    Kazunori KOMATANI  Mikio NAKANO  Masaki KATSUMARU  Kotaro FUNAKOSHI  Tetsuya OGATA  Hiroshi G. OKUNO  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:9
      Page(s):
    2298-2307

    The optimal way to build speech understanding modules depends on the amount of training data available. When only a small amount of training data is available, effective allocation of the data is crucial to preventing overfitting of statistical methods. We have developed a method for allocating a limited amount of training data in accordance with the amount available. Our method exploits rule-based methods for when the amount of data is small, which are included in our speech understanding framework based on multiple model combinations, i.e., multiple automatic speech recognition (ASR) modules and multiple language understanding (LU) modules, and then allocates training data preferentially to the modules that dominate the overall performance of speech understanding. Experimental evaluation showed that our allocation method consistently outperforms baseline methods that use a single ASR module and a single LU module while the amount of training data increases.

  • Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances

    Kazunori KOMATANI  Naoki HOTTA  Satoshi SATO  Mikio NAKANO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2015/07/24
      Vol:
    E98-D No:11
      Page(s):
    1923-1931

    Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).

  • Discriminating Unknown Objects from Known Objects Using Image and Speech Information

    Yuko OZASA  Mikio NAKANO  Yasuo ARIKI  Naoto IWAHASHI  

     
    PAPER-Multimedia Pattern Processing

      Pubricized:
    2014/12/16
      Vol:
    E98-D No:3
      Page(s):
    704-711

    This paper deals with a problem where a robot identifies an object that a human asks it to bring by voice when there is a set of objects that the human and the robot can see. When the robot knows the requested object, it must identify the object and when it does not know the object, it must say it does not. This paper presents a new method for discriminating unknown objects from known objects using object images and human speech. It uses a confidence measure that integrates image recognition confidences and speech recognition confidences based on logistic regression.

  • Ranking Multiple Dialogue States by Corpus Statistics to Improve Discourse Understanding in Spoken Dialogue Systems

    Ryuichiro HIGASHINAKA  Mikio NAKANO  

     
    PAPER-Natural Language Processing

      Vol:
    E92-D No:9
      Page(s):
    1771-1782

    This paper discusses the discourse understanding process in spoken dialogue systems. This process enables a system to understand user utterances from the context of a dialogue. Ambiguity in user utterances caused by multiple speech recognition hypotheses and parsing results sometimes makes it difficult for a system to decide on a single interpretation of a user intention. As a solution, the idea of retaining possible interpretations as multiple dialogue states and resolving the ambiguity using succeeding user utterances has been proposed. Although this approach has proven to improve discourse understanding accuracy, carefully created hand-crafted rules are necessary in order to accurately rank the dialogue states. This paper proposes automatically ranking multiple dialogue states using statistical information obtained from dialogue corpora. The experimental results in the train ticket reservation and weather information service domains show that the statistical information can significantly improve the ranking accuracy of dialogue states as well as the slot accuracy and the concept error rate of the top-ranked dialogue states.

  • A Method for Predicting Stressed Words in Teaching Materials for English Jazz Chants

    Ryo NAGATA  Kotaro FUNAKOSHI  Tatsuya KITAMURA  Mikio NAKANO  

     
    PAPER-Educational Technology

      Vol:
    E95-D No:11
      Page(s):
    2658-2663

    To acquire a second language, one must develop an ear and tongue for the correct stress and intonation patterns of that language. In English language teaching, there is an effective method called Jazz Chants for working on the sound system. In this paper, we propose a method for predicting stressed words, which play a crucial role in Jazz Chants. The proposed method is specially designed for stress prediction in Jazz chants. It exploits several sources of information including words, POSs, sentence types, and the constraint on the number of stressed words in a chant text. Experiments show that the proposed method achieves an F-measure of 0.939 and outperforms the other methods implemented for comparison. The proposed method is expected to be useful in supporting non-native teachers of English when they teach chants to students and create chant texts with stress marks from arbitrary texts.