The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] spoken dialogue(14hit)

1-14hit
  • Utterance Intent Classification for Spoken Dialogue System with Data-Driven Untying of Recursive Autoencoders Open Access

    Tsuneo KATO  Atsushi NAGAI  Naoki NODA  Jianming WU  Seiichi YAMAMOTO  

     
    PAPER-Natural Language Processing

      Pubricized:
    2019/03/04
      Vol:
    E102-D No:6
      Page(s):
    1197-1205

    Data-driven untying of a recursive autoencoder (RAE) is proposed for utterance intent classification for spoken dialogue systems. Although an RAE expresses a nonlinear operation on two neighboring child nodes in a parse tree in the application of spoken language understanding (SLU) of spoken dialogue systems, the nonlinear operation is considered to be intrinsically different depending on the types of child nodes. To reduce the gap between the single nonlinear operation of an RAE and intrinsically different operations depending on the node types, a data-driven untying of autoencoders using part-of-speech (PoS) tags at leaf nodes is proposed. When using the proposed method, the experimental results on two corpora: ATIS English data set and Japanese data set of a smartphone-based spoken dialogue system showed improved accuracies compared to when using the tied RAE, as well as a reasonable difference in untying between two languages.

  • Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances

    Kazunori KOMATANI  Naoki HOTTA  Satoshi SATO  Mikio NAKANO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2015/07/24
      Vol:
    E98-D No:11
      Page(s):
    1923-1931

    Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).

  • Automatic Allocation of Training Data for Speech Understanding Based on Multiple Model Combinations

    Kazunori KOMATANI  Mikio NAKANO  Masaki KATSUMARU  Kotaro FUNAKOSHI  Tetsuya OGATA  Hiroshi G. OKUNO  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:9
      Page(s):
    2298-2307

    The optimal way to build speech understanding modules depends on the amount of training data available. When only a small amount of training data is available, effective allocation of the data is crucial to preventing overfitting of statistical methods. We have developed a method for allocating a limited amount of training data in accordance with the amount available. Our method exploits rule-based methods for when the amount of data is small, which are included in our speech understanding framework based on multiple model combinations, i.e., multiple automatic speech recognition (ASR) modules and multiple language understanding (LU) modules, and then allocates training data preferentially to the modules that dominate the overall performance of speech understanding. Experimental evaluation showed that our allocation method consistently outperforms baseline methods that use a single ASR module and a single LU module while the amount of training data increases.

  • Selecting Help Messages by Using Robust Grammar Verification for Handling Out-of-Grammar Utterances in Spoken Dialogue Systems

    Kazunori KOMATANI  Yuichiro FUKUBAYASHI  Satoshi IKEDA  Tetsuya OGATA  Hiroshi G. OKUNO  

     
    PAPER-Speech and Hearing

      Vol:
    E93-D No:12
      Page(s):
    3359-3367

    We address the issue of out-of-grammar (OOG) utterances in spoken dialogue systems by generating help messages. Help message generation for OOG utterances is a challenge because language understanding based on automatic speech recognition (ASR) of OOG utterances is usually erroneous; important words are often misrecognized or missing from such utterances. Our grammar verification method uses a weighted finite-state transducer, to accurately identify the grammar rule that the user intended to use for the utterance, even if important words are missing from the ASR results. We then use a ranking algorithm, RankBoost, to rank help message candidates in order of likely usefulness. Its features include the grammar verification results and the utterance history representing the user's experience.

  • Ranking Multiple Dialogue States by Corpus Statistics to Improve Discourse Understanding in Spoken Dialogue Systems

    Ryuichiro HIGASHINAKA  Mikio NAKANO  

     
    PAPER-Natural Language Processing

      Vol:
    E92-D No:9
      Page(s):
    1771-1782

    This paper discusses the discourse understanding process in spoken dialogue systems. This process enables a system to understand user utterances from the context of a dialogue. Ambiguity in user utterances caused by multiple speech recognition hypotheses and parsing results sometimes makes it difficult for a system to decide on a single interpretation of a user intention. As a solution, the idea of retaining possible interpretations as multiple dialogue states and resolving the ambiguity using succeeding user utterances has been proposed. Although this approach has proven to improve discourse understanding accuracy, carefully created hand-crafted rules are necessary in order to accurately rank the dialogue states. This paper proposes automatically ranking multiple dialogue states using statistical information obtained from dialogue corpora. The experimental results in the train ticket reservation and weather information service domains show that the statistical information can significantly improve the ranking accuracy of dialogue states as well as the slot accuracy and the concept error rate of the top-ranked dialogue states.

  • An Interactive Open-Vocabulary Chinese Name Input System Using Syllable Spelling and Character Description Recognition Modules for Error Correction

    Nick Jui Chang WANG  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:11
      Page(s):
    1796-1804

    The open-vocabulary name recognition technique is one of the most challenging tasks in the application of automatic Chinese speech recognition technology. It can be used as the free name input method for telephony speech applications and automatic directory assistance systems. A Chinese name usually has two to three characters, each of which is pronounced as a single tonal syllable. Obviously, it is very confusing to recognize a three-syllable word from millions to billions of possible candidates. A novel interactive automatic-speech-recognition system is proposed to resolve this highly challenging task. This system was built as an open-vocabulary Chinese name recognition system using character-based approaches. Two important character-input speech-recognition modules were designed as backoff approaches in this system to complete the name input or to correct any misrecognized characters. Finite-state networks were compiled from regular grammar of syllable spellings and character descriptions for these two speech recognition modules. The possible candidate names cover more than five billions. This system has been tested publicly and proved a robust way to interact with the speaker. An 86.7% name recognition success rate was achieved by the interactive open-vocabulary Chinese name input system.

  • Interface for Barge-in Free Spoken Dialogue System Using Nullspace Based Sound Field Control and Beamforming

    Shigeki MIYABE  Hiroshi SARUWATARI  Kiyohiro SHIKANO  Yosuke TATEKURA  

     
    PAPER-Speech/Audio Processing

      Vol:
    E89-A No:3
      Page(s):
    716-726

    In this paper, we describe a new interface for a barge-in free spoken dialogue system combining multichannel sound field control and beamforming, in which the response sound from the system can be canceled out at the microphone points. The conventional method inhibits a user from moving because the system forces the user to stay at a fixed position where the response sound is reproduced. However, since the proposed method does not set control points for the reproduction of the response sound to the user, the user is allowed to move. Furthermore, the relaxation of strict reproduction for the response sound enables us to design a stable system with fewer loudspeakers than those used in the conventional method. The proposed method shows a higher performance in speech recognition experiments.

  • Proposal of a Multimodal Interaction Description Language for Various Interactive Agents

    Masahiro ARAKI  Akiko KOUZAWA  Kenji TACHIBANA  

     
    PAPER

      Vol:
    E88-D No:11
      Page(s):
    2469-2476

    In this paper, we propose a new multimodal interaction description language, MIML (Multimodal Interaction Markup Language), which defines dialogue patterns between human and various types of interactive agents. The feature of this language is three-layered description of agent-based interactive systems. The high-level description is a task definition that can easily construct typical agent-based interactive task control information. The middle-level description is an interaction description that defines agent's behavior and user's input at the granularity of dialogue segment. The low-level description is a platform dependent description that can override the pre-defined function in the interaction description. The connection between task-level and interaction-level is realized by generation of interaction description templates from the task level description. The connection between interaction-level and platform-level is realized by a binding mechanism of XML. As a result of the comparison with other languages, MIML has advantages in high-level interaction description, modality extensibility and compatibility with standardized technologies.

  • Interface for Barge-in Free Spoken Dialogue System Combining Adaptive Sound Field Control and Microphone Array

    Tatsunori ASAI  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    LETTER-Speech and Hearing

      Vol:
    E88-A No:6
      Page(s):
    1613-1618

    This paper describes a new interface for a barge-in free spoken dialogue system combining an adaptive sound field control and a microphone array. In order to actualize robustness against the change of transfer functions due to the various interferences, the barge-in free spoken dialogue system which uses sound field control and a microphone array has been proposed by one of the authors. However, this method cannot follow the change of transfer functions because the method consists of fixed filters. To solve the problem, we introduce a new adaptive sound field control that follows the change of transfer functions.

  • Dialogue Speech Recognition by Combining Hierarchical Topic Classification and Language Model Switching

    Ian R. LANE  Tatsuya KAWAHARA  Tomoko MATSUI  Satoshi NAKAMURA  

     
    PAPER-Spoken Language Systems

      Vol:
    E88-D No:3
      Page(s):
    446-454

    An efficient, scalable speech recognition architecture combining topic detection and topic-dependent language modeling is proposed for multi-domain spoken language systems. In the proposed approach, the inferred topic is automatically detected from the user's utterance, and speech recognition is then performed by applying an appropriate topic-dependent language model. This approach enables users to freely switch between domains while maintaining high recognition accuracy. As topic detection is performed on a single utterance, detection errors may occur and propagate through the system. To improve robustness, a hierarchical back-off mechanism is introduced where detailed topic models are applied when topic detection is confident and wider models that cover multiple topics are applied in cases of uncertainty. The performance of the proposed architecture is evaluated when combined with two topic detection methods: unigram likelihood and SVMs (Support Vector Machines). On the ATR Basic Travel Expression Corpus, both methods provide a significant reduction in WER (9.7% and 10.3%, respectively) compared to a single language model system. Furthermore, recognition accuracy is comparable to performing decoding with all topic-dependent models in parallel, while the required computational cost is much reduced.

  • Example-Based Query Generation for Spontaneous Speech

    Hiroya MURAO  Nobuo KAWAGUCHI  Shigeki MATSUBARA  Yasuyoshi INAGAKI  

     
    LETTER-Speech and Hearing

      Vol:
    E88-D No:2
      Page(s):
    324-329

    This paper proposes a new method of example-based query generation for spontaneous speech. Along with modeling the information flows of human dialogues, the authors have designed a system that allows users to retrieve information while driving a car. The system refers to the dialogue corpus to find an example that is similar to input speech, and it generates a query from the example. The experimental results for the prototype system show that 1) for transcribed text input, it provides the correct query in about 64% of cases and the partially collect query in about 88% 2) it has the ability to create correct queries for the utterances not including keywords, compared with the conventional keyword extraction method.

  • A Spoken Dialogue Interface for TV Operations Based on Data Collected by Using WOZ Method

    Jun GOTO  Kazuteru KOMINE  Masaru MIYAZAKI  Yeun-Bae KIM  Noriyoshi URATANI  

     
    PAPER

      Vol:
    E87-D No:6
      Page(s):
    1397-1404

    The development of multi-channel digital broadcasting has generated a demand not only for new services but also for smart and highly functional capabilities in all broadcast-related devices. This is especially true of TV receivers on the viewer's side. With the aim of achieving a friendly interface that anybody can use with ease, we built a prototype spoken dialogue interface for TV operation based on data collected by using Wizard of Oz method. At the current stage of our research, we are using this system to investigate the usefulness and problem areas of an interactive voice interface for TV operation.

  • Conversation Robot Participating in Group Conversation

    Yosuke MATSUSAKA  Tsuyoshi TOJO  Tetsunori KOBAYASHI  

     
    INVITED PAPER

      Vol:
    E86-D No:1
      Page(s):
    26-36

    We developed a conversation system which can participate in a group conversation. Group conversation is a form of conversation in which three or more participants talk to each other about a topic on an equal footing. Conventional conversation systems have been designed under the assumption that each system merely talked with only one person. Group conversation is different from these conventional systems in the following points. It is necessary for the system to understand the conversational situation such as who is speaking, to whom he is speaking, and also to whom the other participants pay attention. It is also necessary for the system itself to try to affect the situation appropriately. In this study, we realized the function of recognizing the conversational situation, by combining image processing and acoustic processing, and the function of working on the conversational situation utilizing facial and body actions of the robot. Thus, a robot that can join in the group conversation was realized.

  • A Distributed Agent Architecture for Intelligent Multi-Domain Spoken Dialogue Systems

    Bor-Shen LIN  Hsin-Min WANG  Lin-Shan LEE  

     
    PAPER-Speech and Hearing

      Vol:
    E84-D No:9
      Page(s):
    1217-1230

    Multi-domain spoken dialogue systems with high degree of intelligence and domain extensibility have long been desired but difficult to achieve. When the user freely surfs among different topics during the dialogue, it will be very difficult for the system to control the switching of the topics and domains while keeping the dialogue consistent, and decide when and how to take the initiative. This paper presents a distributed agent architecture for multi-domain spoken dialogue systems with high domain extensibility and intelligence. Under this architecture, different spoken dialogue agents (SDA's) handling different domains can be developed independently, and then smoothly cooperate with one another to achieve the user's multiple goals, while a user interface agent (UIA) can access the correct spoken dialogue agent through a domain switching protocol, and carry over the dialogue state and history so as to keep the knowledge processed coherently across different domains.