The search functionality is under construction.

Keyword Search Result

[Keyword] dialogue(46hit)

1-20hit(46hit)

  • Intrinsic Representation Mining for Zero-Shot Slot Filling

    Sixia LI  Shogo OKADA  Jianwu DANG  

     
    PAPER-Natural Language Processing

      Pubricized:
    2022/08/19
      Vol:
    E105-D No:11
      Page(s):
    1947-1956

    Zero-shot slot filling is a domain adaptation approach to handle unseen slots in new domains without training instances. Previous studies implemented zero-shot slot filling by predicting both slot entities and slot types. Because of the lack of knowledge about new domains, the existing methods often fail to predict slot entities for new domains as well as cannot effectively predict unseen slot types even when slot entities are correctly identified. Moreover, for some seen slot types, those methods may suffer from the domain shift problem, because the unseen context in new domains may change the explanations of the slots. In this study, we propose intrinsic representations to alleviate the domain shift problems above. Specifically, we propose a multi-relation-based representation to capture both the general and specific characteristics of slot entities, and an ontology-based representation to provide complementary knowledge on the relationships between slots and values across domains, for handling both unseen slot types and unseen contexts. We constructed a two-step pipeline model using the proposed representations to solve the domain shift problem. Experimental results in terms of the F1 score on three large datasets—Snips, SGD, and MultiWOZ 2.3—showed that our model outperformed state-of-the-art baselines by 29.62, 10.38, and 3.89, respectively. The detailed analysis with the average slot F1 score showed that our model improved the prediction by 25.82 for unseen slot types and by 10.51 for seen slot types. The results demonstrated that the proposed intrinsic representations can effectively alleviate the domain shift problem for both unseen slot types and seen slot types with unseen contexts.

  • A Hierarchical Memory Model for Task-Oriented Dialogue System

    Ya ZENG  Li WAN  Qiuhong LUO  Mao CHEN  

     
    PAPER-Natural Language Processing

      Pubricized:
    2022/05/16
      Vol:
    E105-D No:8
      Page(s):
    1481-1489

    Traditional pipeline methods for task-oriented dialogue systems are designed individually and expensively. Existing memory augmented end-to-end methods directly map the inputs to outputs and achieve promising results. However, the most existing end-to-end solutions store the dialogue history and knowledge base (KB) information in the same memory and represent KB information in the form of KB triples, making the memory reader's reasoning on the memory more difficult, which makes the system difficult to retrieve the correct information from the memory to generate a response. Some methods introduce many manual annotations to strengthen reasoning. To reduce the use of manual annotations, while strengthening reasoning, we propose a hierarchical memory model (HM2Seq) for task-oriented systems. HM2Seq uses a hierarchical memory to separate the dialogue history and KB information into two memories and stores KB in KB rows, then we use memory rows pointer combined with an entity decoder to perform hierarchical reasoning over memory. The experimental results on two publicly available task-oriented dialogue datasets confirm our hypothesis and show the outstanding performance of our HM2Seq by outperforming the baselines.

  • Toward Generating Robot-Robot Natural Counseling Dialogue

    Tomoya HASHIGUCHI  Takehiro YAMAMOTO  Sumio FUJITA  Hiroaki OHSHIMA  

     
    PAPER

      Pubricized:
    2022/02/07
      Vol:
    E105-D No:5
      Page(s):
    928-935

    In this study, we generate dialogue contents in which two systems discuss their distress with each other. The user inputs sentences that include environment and feelings of distress. The system generates the dialogue content from the input. In this study, we created dialogue data about distress in order to generate them using deep learning. The generative model fine-tunes the GPT of the pre-trained model using the TransferTransfo method. The contribution of this study is the creation of a conversational dataset using publicly available data. This study used EmpatheticDialogues, an existing empathetic dialogue dataset, and Reddit r/offmychest, a public data set of distress. The models fine-tuned with each data were evaluated both automatically (such as by the BLEU and ROUGE scores) and manually (such as by relevance and empathy) by human assessors.

  • Hierarchical Argumentation Structure for Persuasive Argumentative Dialogue Generation

    Kazuki SAKAI  Ryuichiro HIGASHINAKA  Yuichiro YOSHIKAWA  Hiroshi ISHIGURO  Junji TOMITA  

     
    PAPER-Natural Language Processing

      Pubricized:
    2019/10/30
      Vol:
    E103-D No:2
      Page(s):
    424-434

    Argumentation is a process of reaching a consensus through premises and rebuttals. If an artificial dialogue system can perform argumentation, it can improve users' decisions and ability to negotiate with the others. Previously, researchers have studied argumentative dialogue systems through a structured database regarding argumentation structure and evaluated the logical consistency of the dialogue. However, these systems could not change its response based on the user's agreement or disagreement to its last utterance. Furthermore, the persuasiveness of the generated dialogue has not been evaluated. In this study, a method is proposed to generate persuasive arguments through a hierarchical argumentation structure that considers human agreement and disagreement. Persuasiveness is evaluated through a crowd sourcing platform wherein participants' written impressions of shown dialogue texts are scored via a third person Likert scale evaluation. The proposed method was compared to the baseline method wherein argument response texts were generated without consideration of the user's agreement or disagreement. Experiment results suggest that the proposed method can generate a more persuasive dialogue than the baseline method. Further analysis implied that perceived persuasiveness was induced by evaluations of the behavior of the dialogue system, which was inherent in the hierarchical argumentation structure.

  • Effectiveness of Speech Mode Adaptation for Improving Dialogue Speech Synthesis

    Kazuki KAYA  Hiroki MORI  

     
    LETTER-Speech and Hearing

      Pubricized:
    2019/06/13
      Vol:
    E102-D No:10
      Page(s):
    2064-2066

    The effectiveness of model adaptation in dialogue speech synthesis is explored. The proposed adaptation method is based on a conversion from a base model learned with a large dataset into a target, dialogue-style speech model. The proposed method is shown to improve the intelligibility of synthesized dialogue speech, while maintaining the speaking style of dialogue.

  • Utterance Intent Classification for Spoken Dialogue System with Data-Driven Untying of Recursive Autoencoders Open Access

    Tsuneo KATO  Atsushi NAGAI  Naoki NODA  Jianming WU  Seiichi YAMAMOTO  

     
    PAPER-Natural Language Processing

      Pubricized:
    2019/03/04
      Vol:
    E102-D No:6
      Page(s):
    1197-1205

    Data-driven untying of a recursive autoencoder (RAE) is proposed for utterance intent classification for spoken dialogue systems. Although an RAE expresses a nonlinear operation on two neighboring child nodes in a parse tree in the application of spoken language understanding (SLU) of spoken dialogue systems, the nonlinear operation is considered to be intrinsically different depending on the types of child nodes. To reduce the gap between the single nonlinear operation of an RAE and intrinsically different operations depending on the node types, a data-driven untying of autoencoders using part-of-speech (PoS) tags at leaf nodes is proposed. When using the proposed method, the experimental results on two corpora: ATIS English data set and Japanese data set of a smartphone-based spoken dialogue system showed improved accuracies compared to when using the tied RAE, as well as a reasonable difference in untying between two languages.

  • Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances

    Kazunori KOMATANI  Naoki HOTTA  Satoshi SATO  Mikio NAKANO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2015/07/24
      Vol:
    E98-D No:11
      Page(s):
    1923-1931

    Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).

  • Effects of Conversational Agents on Activation of Communication in Thought-Evoking Multi-Party Dialogues

    Kohji DOHSAKA  Ryota ASAI  Ryuichiro HIGASHINAKA  Yasuhiro MINAMI  Eisaku MAEDA  

     
    PAPER-Natural Language Processing

      Vol:
    E97-D No:8
      Page(s):
    2147-2156

    This paper presents an experimental study that analyzes how conversational agents activate human communication in thought-evoking multi-party dialogues between multi-users and multi-agents. A thought-evoking dialogue is a kind of interaction in which agents act to provoke user thinking, and it has the potential to activate multi-party interactions. This paper focuses on quiz-style multi-party dialogues between two users and two agents as an example of thought-evoking multi-party dialogues. The experimental results revealed that the presence of a peer agent significantly improved user satisfaction and increased the number of user utterances in quiz-style multi-party dialogues. We also found that agents' empathic expressions significantly improved user satisfaction, improved user ratings of the peer agent, and increased the number of user utterances. Our findings should be useful for activating multi-party communications in various applications such as pedagogical agents and community facilitators.

  • Automatic Allocation of Training Data for Speech Understanding Based on Multiple Model Combinations

    Kazunori KOMATANI  Mikio NAKANO  Masaki KATSUMARU  Kotaro FUNAKOSHI  Tetsuya OGATA  Hiroshi G. OKUNO  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:9
      Page(s):
    2298-2307

    The optimal way to build speech understanding modules depends on the amount of training data available. When only a small amount of training data is available, effective allocation of the data is crucial to preventing overfitting of statistical methods. We have developed a method for allocating a limited amount of training data in accordance with the amount available. Our method exploits rule-based methods for when the amount of data is small, which are included in our speech understanding framework based on multiple model combinations, i.e., multiple automatic speech recognition (ASR) modules and multiple language understanding (LU) modules, and then allocates training data preferentially to the modules that dominate the overall performance of speech understanding. Experimental evaluation showed that our allocation method consistently outperforms baseline methods that use a single ASR module and a single LU module while the amount of training data increases.

  • Selecting Help Messages by Using Robust Grammar Verification for Handling Out-of-Grammar Utterances in Spoken Dialogue Systems

    Kazunori KOMATANI  Yuichiro FUKUBAYASHI  Satoshi IKEDA  Tetsuya OGATA  Hiroshi G. OKUNO  

     
    PAPER-Speech and Hearing

      Vol:
    E93-D No:12
      Page(s):
    3359-3367

    We address the issue of out-of-grammar (OOG) utterances in spoken dialogue systems by generating help messages. Help message generation for OOG utterances is a challenge because language understanding based on automatic speech recognition (ASR) of OOG utterances is usually erroneous; important words are often misrecognized or missing from such utterances. Our grammar verification method uses a weighted finite-state transducer, to accurately identify the grammar rule that the user intended to use for the utterance, even if important words are missing from the ASR results. We then use a ranking algorithm, RankBoost, to rank help message candidates in order of likely usefulness. Its features include the grammar verification results and the utterance history representing the user's experience.

  • A Single-Chip Speech Dialogue Module and Its Evaluation on a Personal Robot, PaPeRo-Mini

    Miki SATO  Toru IWASAWA  Akihiko SUGIYAMA  Toshihiro NISHIZAWA  Yosuke TAKANO  

     
    PAPER-Digital Signal Processing

      Vol:
    E93-A No:1
      Page(s):
    261-271

    This paper presents a single-chip speech dialogue module and its evaluation on a personal robot. This module is implemented on an application processor that was developed primarily for mobile phones to provide a compact size, low power-consumption, and low cost. It performs speech recognition with preprocessing functions such as direction-of-arrival (DOA) estimation, noise cancellation, beamforming with an array of microphones, and echo cancellation. Text-to-speech (TTS) conversion is also equipped with. Evaluation results obtained on a new personal robot, PaPeRo-mini, which is a scale-down version of PaPeRo, demonstrate an 85% correct rate in DOA estimation, and as much as 54% and 30% higher speech recognition rates in noisy environments and during robot utterances, respectively. These results are shown to be comparable to those obtained by PaPeRo.

  • Activating Humans with Humor -- A Dialogue System That Users Want to Interact with

    Pawel DYBALA  Michal PTASZYNSKI  Rafal RZEPKA  Kenji ARAKI  

     
    PAPER-Spoken Dialogue System

      Vol:
    E92-D No:12
      Page(s):
    2394-2401

    The topic of Human Computer Interaction (HCI) has been gathering more and more scientific attention of late. A very important, but often undervalued area in this field is human engagement. That is, a person's commitment to take part in and continue the interaction. In this paper we describe work on a humor-equipped casual conversational system (chatterbot) and investigate the effect of humor on a user's engagement in the conversation. A group of users was made to converse with two systems: one with and one without humor. The chat logs were then analyzed using an emotive analysis system to check user reactions and attitudes towards each system. Results were projected on Russell 's two-dimensional emotiveness space to evaluate the positivity/negativity and activation/deactivation of these emotions. This analysis indicated emotions elicited by the humor-equipped system were more positively active and less negatively active than by the system without humor. The implications of results and relation between them and user engagement in the conversation are discussed. We also propose a distinction between positive and negative engagement.

  • Ranking Multiple Dialogue States by Corpus Statistics to Improve Discourse Understanding in Spoken Dialogue Systems

    Ryuichiro HIGASHINAKA  Mikio NAKANO  

     
    PAPER-Natural Language Processing

      Vol:
    E92-D No:9
      Page(s):
    1771-1782

    This paper discusses the discourse understanding process in spoken dialogue systems. This process enables a system to understand user utterances from the context of a dialogue. Ambiguity in user utterances caused by multiple speech recognition hypotheses and parsing results sometimes makes it difficult for a system to decide on a single interpretation of a user intention. As a solution, the idea of retaining possible interpretations as multiple dialogue states and resolving the ambiguity using succeeding user utterances has been proposed. Although this approach has proven to improve discourse understanding accuracy, carefully created hand-crafted rules are necessary in order to accurately rank the dialogue states. This paper proposes automatically ranking multiple dialogue states using statistical information obtained from dialogue corpora. The experimental results in the train ticket reservation and weather information service domains show that the statistical information can significantly improve the ranking accuracy of dialogue states as well as the slot accuracy and the concept error rate of the top-ranked dialogue states.

  • Facial Expression Generation from Speaker's Emotional States in Daily Conversation

    Hiroki MORI  Koh OHSHIMA  

     
    PAPER-Media Communication

      Vol:
    E91-D No:6
      Page(s):
    1628-1633

    A framework for generating facial expressions from emotional states in daily conversation is described. It provides a mapping between emotional states and facial expressions, where the former is represented by vectors with psychologically-defined abstract dimensions, and the latter is coded by the Facial Action Coding System. In order to obtain the mapping, parallel data with rated emotional states and facial expressions were collected for utterances of a female speaker, and a neural network was trained with the data. The effectiveness of proposed method is verified by a subjective evaluation test. As the result, the Mean Opinion Score with respect to the suitability of generated facial expression was 3.86 for the speaker, which was close to that of hand-made facial expressions.

  • An Interactive Open-Vocabulary Chinese Name Input System Using Syllable Spelling and Character Description Recognition Modules for Error Correction

    Nick Jui Chang WANG  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:11
      Page(s):
    1796-1804

    The open-vocabulary name recognition technique is one of the most challenging tasks in the application of automatic Chinese speech recognition technology. It can be used as the free name input method for telephony speech applications and automatic directory assistance systems. A Chinese name usually has two to three characters, each of which is pronounced as a single tonal syllable. Obviously, it is very confusing to recognize a three-syllable word from millions to billions of possible candidates. A novel interactive automatic-speech-recognition system is proposed to resolve this highly challenging task. This system was built as an open-vocabulary Chinese name recognition system using character-based approaches. Two important character-input speech-recognition modules were designed as backoff approaches in this system to complete the name input or to correct any misrecognized characters. Finite-state networks were compiled from regular grammar of syllable spellings and character descriptions for these two speech recognition modules. The possible candidate names cover more than five billions. This system has been tested publicly and proved a robust way to interact with the speaker. An 86.7% name recognition success rate was achieved by the interactive open-vocabulary Chinese name input system.

  • Interface for Barge-in Free Spoken Dialogue System Using Nullspace Based Sound Field Control and Beamforming

    Shigeki MIYABE  Hiroshi SARUWATARI  Kiyohiro SHIKANO  Yosuke TATEKURA  

     
    PAPER-Speech/Audio Processing

      Vol:
    E89-A No:3
      Page(s):
    716-726

    In this paper, we describe a new interface for a barge-in free spoken dialogue system combining multichannel sound field control and beamforming, in which the response sound from the system can be canceled out at the microphone points. The conventional method inhibits a user from moving because the system forces the user to stay at a fixed position where the response sound is reproduced. However, since the proposed method does not set control points for the reproduction of the response sound to the user, the user is allowed to move. Furthermore, the relaxation of strict reproduction for the response sound enables us to design a stable system with fewer loudspeakers than those used in the conventional method. The proposed method shows a higher performance in speech recognition experiments.

  • Proposal of a Multimodal Interaction Description Language for Various Interactive Agents

    Masahiro ARAKI  Akiko KOUZAWA  Kenji TACHIBANA  

     
    PAPER

      Vol:
    E88-D No:11
      Page(s):
    2469-2476

    In this paper, we propose a new multimodal interaction description language, MIML (Multimodal Interaction Markup Language), which defines dialogue patterns between human and various types of interactive agents. The feature of this language is three-layered description of agent-based interactive systems. The high-level description is a task definition that can easily construct typical agent-based interactive task control information. The middle-level description is an interaction description that defines agent's behavior and user's input at the granularity of dialogue segment. The low-level description is a platform dependent description that can override the pre-defined function in the interaction description. The connection between task-level and interaction-level is realized by generation of interaction description templates from the task level description. The connection between interaction-level and platform-level is realized by a binding mechanism of XML. As a result of the comparison with other languages, MIML has advantages in high-level interaction description, modality extensibility and compatibility with standardized technologies.

  • Interface for Barge-in Free Spoken Dialogue System Combining Adaptive Sound Field Control and Microphone Array

    Tatsunori ASAI  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    LETTER-Speech and Hearing

      Vol:
    E88-A No:6
      Page(s):
    1613-1618

    This paper describes a new interface for a barge-in free spoken dialogue system combining an adaptive sound field control and a microphone array. In order to actualize robustness against the change of transfer functions due to the various interferences, the barge-in free spoken dialogue system which uses sound field control and a microphone array has been proposed by one of the authors. However, this method cannot follow the change of transfer functions because the method consists of fixed filters. To solve the problem, we introduce a new adaptive sound field control that follows the change of transfer functions.

  • Dialogue Speech Recognition by Combining Hierarchical Topic Classification and Language Model Switching

    Ian R. LANE  Tatsuya KAWAHARA  Tomoko MATSUI  Satoshi NAKAMURA  

     
    PAPER-Spoken Language Systems

      Vol:
    E88-D No:3
      Page(s):
    446-454

    An efficient, scalable speech recognition architecture combining topic detection and topic-dependent language modeling is proposed for multi-domain spoken language systems. In the proposed approach, the inferred topic is automatically detected from the user's utterance, and speech recognition is then performed by applying an appropriate topic-dependent language model. This approach enables users to freely switch between domains while maintaining high recognition accuracy. As topic detection is performed on a single utterance, detection errors may occur and propagate through the system. To improve robustness, a hierarchical back-off mechanism is introduced where detailed topic models are applied when topic detection is confident and wider models that cover multiple topics are applied in cases of uncertainty. The performance of the proposed architecture is evaluated when combined with two topic detection methods: unigram likelihood and SVMs (Support Vector Machines). On the ATR Basic Travel Expression Corpus, both methods provide a significant reduction in WER (9.7% and 10.3%, respectively) compared to a single language model system. Furthermore, recognition accuracy is comparable to performing decoding with all topic-dependent models in parallel, while the required computational cost is much reduced.

  • Example-Based Query Generation for Spontaneous Speech

    Hiroya MURAO  Nobuo KAWAGUCHI  Shigeki MATSUBARA  Yasuyoshi INAGAKI  

     
    LETTER-Speech and Hearing

      Vol:
    E88-D No:2
      Page(s):
    324-329

    This paper proposes a new method of example-based query generation for spontaneous speech. Along with modeling the information flows of human dialogues, the authors have designed a system that allows users to retrieve information while driving a car. The system refers to the dialogue corpus to find an example that is similar to input speech, and it generates a query from the example. The experimental results for the prototype system show that 1) for transcribed text input, it provides the correct query in about 64% of cases and the partially collect query in about 88% 2) it has the ability to create correct queries for the utterances not including keywords, compared with the conventional keyword extraction method.

1-20hit(46hit)