The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] dialogue system(19hit)

1-19hit
  • Intrinsic Representation Mining for Zero-Shot Slot Filling

    Sixia LI  Shogo OKADA  Jianwu DANG  

     
    PAPER-Natural Language Processing

      Pubricized:
    2022/08/19
      Vol:
    E105-D No:11
      Page(s):
    1947-1956

    Zero-shot slot filling is a domain adaptation approach to handle unseen slots in new domains without training instances. Previous studies implemented zero-shot slot filling by predicting both slot entities and slot types. Because of the lack of knowledge about new domains, the existing methods often fail to predict slot entities for new domains as well as cannot effectively predict unseen slot types even when slot entities are correctly identified. Moreover, for some seen slot types, those methods may suffer from the domain shift problem, because the unseen context in new domains may change the explanations of the slots. In this study, we propose intrinsic representations to alleviate the domain shift problems above. Specifically, we propose a multi-relation-based representation to capture both the general and specific characteristics of slot entities, and an ontology-based representation to provide complementary knowledge on the relationships between slots and values across domains, for handling both unseen slot types and unseen contexts. We constructed a two-step pipeline model using the proposed representations to solve the domain shift problem. Experimental results in terms of the F1 score on three large datasets—Snips, SGD, and MultiWOZ 2.3—showed that our model outperformed state-of-the-art baselines by 29.62, 10.38, and 3.89, respectively. The detailed analysis with the average slot F1 score showed that our model improved the prediction by 25.82 for unseen slot types and by 10.51 for seen slot types. The results demonstrated that the proposed intrinsic representations can effectively alleviate the domain shift problem for both unseen slot types and seen slot types with unseen contexts.

  • Toward Generating Robot-Robot Natural Counseling Dialogue

    Tomoya HASHIGUCHI  Takehiro YAMAMOTO  Sumio FUJITA  Hiroaki OHSHIMA  

     
    PAPER

      Pubricized:
    2022/02/07
      Vol:
    E105-D No:5
      Page(s):
    928-935

    In this study, we generate dialogue contents in which two systems discuss their distress with each other. The user inputs sentences that include environment and feelings of distress. The system generates the dialogue content from the input. In this study, we created dialogue data about distress in order to generate them using deep learning. The generative model fine-tunes the GPT of the pre-trained model using the TransferTransfo method. The contribution of this study is the creation of a conversational dataset using publicly available data. This study used EmpatheticDialogues, an existing empathetic dialogue dataset, and Reddit r/offmychest, a public data set of distress. The models fine-tuned with each data were evaluated both automatically (such as by the BLEU and ROUGE scores) and manually (such as by relevance and empathy) by human assessors.

  • Utterance Intent Classification for Spoken Dialogue System with Data-Driven Untying of Recursive Autoencoders Open Access

    Tsuneo KATO  Atsushi NAGAI  Naoki NODA  Jianming WU  Seiichi YAMAMOTO  

     
    PAPER-Natural Language Processing

      Pubricized:
    2019/03/04
      Vol:
    E102-D No:6
      Page(s):
    1197-1205

    Data-driven untying of a recursive autoencoder (RAE) is proposed for utterance intent classification for spoken dialogue systems. Although an RAE expresses a nonlinear operation on two neighboring child nodes in a parse tree in the application of spoken language understanding (SLU) of spoken dialogue systems, the nonlinear operation is considered to be intrinsically different depending on the types of child nodes. To reduce the gap between the single nonlinear operation of an RAE and intrinsically different operations depending on the node types, a data-driven untying of autoencoders using part-of-speech (PoS) tags at leaf nodes is proposed. When using the proposed method, the experimental results on two corpora: ATIS English data set and Japanese data set of a smartphone-based spoken dialogue system showed improved accuracies compared to when using the tied RAE, as well as a reasonable difference in untying between two languages.

  • Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances

    Kazunori KOMATANI  Naoki HOTTA  Satoshi SATO  Mikio NAKANO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2015/07/24
      Vol:
    E98-D No:11
      Page(s):
    1923-1931

    Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).

  • Effects of Conversational Agents on Activation of Communication in Thought-Evoking Multi-Party Dialogues

    Kohji DOHSAKA  Ryota ASAI  Ryuichiro HIGASHINAKA  Yasuhiro MINAMI  Eisaku MAEDA  

     
    PAPER-Natural Language Processing

      Vol:
    E97-D No:8
      Page(s):
    2147-2156

    This paper presents an experimental study that analyzes how conversational agents activate human communication in thought-evoking multi-party dialogues between multi-users and multi-agents. A thought-evoking dialogue is a kind of interaction in which agents act to provoke user thinking, and it has the potential to activate multi-party interactions. This paper focuses on quiz-style multi-party dialogues between two users and two agents as an example of thought-evoking multi-party dialogues. The experimental results revealed that the presence of a peer agent significantly improved user satisfaction and increased the number of user utterances in quiz-style multi-party dialogues. We also found that agents' empathic expressions significantly improved user satisfaction, improved user ratings of the peer agent, and increased the number of user utterances. Our findings should be useful for activating multi-party communications in various applications such as pedagogical agents and community facilitators.

  • Automatic Allocation of Training Data for Speech Understanding Based on Multiple Model Combinations

    Kazunori KOMATANI  Mikio NAKANO  Masaki KATSUMARU  Kotaro FUNAKOSHI  Tetsuya OGATA  Hiroshi G. OKUNO  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:9
      Page(s):
    2298-2307

    The optimal way to build speech understanding modules depends on the amount of training data available. When only a small amount of training data is available, effective allocation of the data is crucial to preventing overfitting of statistical methods. We have developed a method for allocating a limited amount of training data in accordance with the amount available. Our method exploits rule-based methods for when the amount of data is small, which are included in our speech understanding framework based on multiple model combinations, i.e., multiple automatic speech recognition (ASR) modules and multiple language understanding (LU) modules, and then allocates training data preferentially to the modules that dominate the overall performance of speech understanding. Experimental evaluation showed that our allocation method consistently outperforms baseline methods that use a single ASR module and a single LU module while the amount of training data increases.

  • Selecting Help Messages by Using Robust Grammar Verification for Handling Out-of-Grammar Utterances in Spoken Dialogue Systems

    Kazunori KOMATANI  Yuichiro FUKUBAYASHI  Satoshi IKEDA  Tetsuya OGATA  Hiroshi G. OKUNO  

     
    PAPER-Speech and Hearing

      Vol:
    E93-D No:12
      Page(s):
    3359-3367

    We address the issue of out-of-grammar (OOG) utterances in spoken dialogue systems by generating help messages. Help message generation for OOG utterances is a challenge because language understanding based on automatic speech recognition (ASR) of OOG utterances is usually erroneous; important words are often misrecognized or missing from such utterances. Our grammar verification method uses a weighted finite-state transducer, to accurately identify the grammar rule that the user intended to use for the utterance, even if important words are missing from the ASR results. We then use a ranking algorithm, RankBoost, to rank help message candidates in order of likely usefulness. Its features include the grammar verification results and the utterance history representing the user's experience.

  • Ranking Multiple Dialogue States by Corpus Statistics to Improve Discourse Understanding in Spoken Dialogue Systems

    Ryuichiro HIGASHINAKA  Mikio NAKANO  

     
    PAPER-Natural Language Processing

      Vol:
    E92-D No:9
      Page(s):
    1771-1782

    This paper discusses the discourse understanding process in spoken dialogue systems. This process enables a system to understand user utterances from the context of a dialogue. Ambiguity in user utterances caused by multiple speech recognition hypotheses and parsing results sometimes makes it difficult for a system to decide on a single interpretation of a user intention. As a solution, the idea of retaining possible interpretations as multiple dialogue states and resolving the ambiguity using succeeding user utterances has been proposed. Although this approach has proven to improve discourse understanding accuracy, carefully created hand-crafted rules are necessary in order to accurately rank the dialogue states. This paper proposes automatically ranking multiple dialogue states using statistical information obtained from dialogue corpora. The experimental results in the train ticket reservation and weather information service domains show that the statistical information can significantly improve the ranking accuracy of dialogue states as well as the slot accuracy and the concept error rate of the top-ranked dialogue states.

  • An Interactive Open-Vocabulary Chinese Name Input System Using Syllable Spelling and Character Description Recognition Modules for Error Correction

    Nick Jui Chang WANG  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:11
      Page(s):
    1796-1804

    The open-vocabulary name recognition technique is one of the most challenging tasks in the application of automatic Chinese speech recognition technology. It can be used as the free name input method for telephony speech applications and automatic directory assistance systems. A Chinese name usually has two to three characters, each of which is pronounced as a single tonal syllable. Obviously, it is very confusing to recognize a three-syllable word from millions to billions of possible candidates. A novel interactive automatic-speech-recognition system is proposed to resolve this highly challenging task. This system was built as an open-vocabulary Chinese name recognition system using character-based approaches. Two important character-input speech-recognition modules were designed as backoff approaches in this system to complete the name input or to correct any misrecognized characters. Finite-state networks were compiled from regular grammar of syllable spellings and character descriptions for these two speech recognition modules. The possible candidate names cover more than five billions. This system has been tested publicly and proved a robust way to interact with the speaker. An 86.7% name recognition success rate was achieved by the interactive open-vocabulary Chinese name input system.

  • Interface for Barge-in Free Spoken Dialogue System Using Nullspace Based Sound Field Control and Beamforming

    Shigeki MIYABE  Hiroshi SARUWATARI  Kiyohiro SHIKANO  Yosuke TATEKURA  

     
    PAPER-Speech/Audio Processing

      Vol:
    E89-A No:3
      Page(s):
    716-726

    In this paper, we describe a new interface for a barge-in free spoken dialogue system combining multichannel sound field control and beamforming, in which the response sound from the system can be canceled out at the microphone points. The conventional method inhibits a user from moving because the system forces the user to stay at a fixed position where the response sound is reproduced. However, since the proposed method does not set control points for the reproduction of the response sound to the user, the user is allowed to move. Furthermore, the relaxation of strict reproduction for the response sound enables us to design a stable system with fewer loudspeakers than those used in the conventional method. The proposed method shows a higher performance in speech recognition experiments.

  • Interface for Barge-in Free Spoken Dialogue System Combining Adaptive Sound Field Control and Microphone Array

    Tatsunori ASAI  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    LETTER-Speech and Hearing

      Vol:
    E88-A No:6
      Page(s):
    1613-1618

    This paper describes a new interface for a barge-in free spoken dialogue system combining an adaptive sound field control and a microphone array. In order to actualize robustness against the change of transfer functions due to the various interferences, the barge-in free spoken dialogue system which uses sound field control and a microphone array has been proposed by one of the authors. However, this method cannot follow the change of transfer functions because the method consists of fixed filters. To solve the problem, we introduce a new adaptive sound field control that follows the change of transfer functions.

  • Conversation Robot Participating in Group Conversation

    Yosuke MATSUSAKA  Tsuyoshi TOJO  Tetsunori KOBAYASHI  

     
    INVITED PAPER

      Vol:
    E86-D No:1
      Page(s):
    26-36

    We developed a conversation system which can participate in a group conversation. Group conversation is a form of conversation in which three or more participants talk to each other about a topic on an equal footing. Conventional conversation systems have been designed under the assumption that each system merely talked with only one person. Group conversation is different from these conventional systems in the following points. It is necessary for the system to understand the conversational situation such as who is speaking, to whom he is speaking, and also to whom the other participants pay attention. It is also necessary for the system itself to try to affect the situation appropriately. In this study, we realized the function of recognizing the conversational situation, by combining image processing and acoustic processing, and the function of working on the conversational situation utilizing facial and body actions of the robot. Thus, a robot that can join in the group conversation was realized.

  • Man-Machine Interaction Using a Vision System with Dual Viewing Angles

    Ying-Jieh HUANG  Hiroshi DOHI  Mitsuru ISHIZUKA  

     
    PAPER-Image Processing,Computer Graphics and Pattern Recognition

      Vol:
    E80-D No:11
      Page(s):
    1074-1083

    This paper describes a vision system with dual viewing angles, i. e., wide and narrow viewing angles, and a scheme of user-friendly speech dialogue environment based on the vision system. The wide viewing angle provides a wide viewing field for wide range motion tracking, and the narrow viewing angle is capable of following a target in wide viewing field to take the image of the target with sufficient resolution. For a fast and robust motion tracking, modified motion energy (MME) and existence energy (EE) are defined to detect the motion of the target and extract the motion region at the same time. Instead of using a physical device such as a foot switch commonly used in speech dialogue systems, the begin/end of an utterance is detected from the movement of user's mouth in our system. Without recognizing the movement of lips directly, the shape variation of the region between lips is tracked for more stable recognition of the span of a dialogue. The tracking speed is about 10 frames/sec when no recognition is performed and about 5 frames/sec when both tracking and recognition are performed without using any special hardware.

  • Hybrid Method of Data Collection for Evaluating Speech Dialogue System

    Shu NAKAZATO  Ikuo KUDO  Katsuhiko SHIRAI  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E79-D No:1
      Page(s):
    41-46

    In this paper, we propose a new method of dialogue data collection which can be used to evaluate modules of a spoken dialogue system. To evaluate the module, it is necessary to use suitable data. Human-human dialogue data have not been appropriate to module evaluation, because spontaneous data usually include too much specific phenomena such as fillers, restarts, pauses, and hesitations. Human-machine dialogue data have not been appropriate to module evaluation, because the dialogue was unnatural and the available vocabularies were limited. Here, we propose 'Hybrid method' for the collection of spoken dialogue data. The merit is that, the collected data can be used as test data for the evaluation of a spoken dialogue system without any modification. In our method a human takes the role of some modules of the system and the system, also, works as the other part of the system together. For example, humans works as the speech recognition module and the dialogue management and a machine does the other part, response generation module. The collected data are good for the evaluation of the speech recognition and the dialogue management modules. The reasons are as follows. (1) Lexicon: The lexicon was composed of limited words and dependent on the task. (2) Grammar: The intention expressed by the subjects were concise and clear. (3) Topics: There were few utterances outside the task domain. The collected data can be used test data for the evaluation of a spoken dialogue system without any modification.

  • A Speech Dialogue System with Multimodal Interface for Telephone Directory Assistance

    Osamu YOSHIOKA  Yasuhiro MINAMI  Kiyohiro SHIKANO  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    616-621

    This paper describes a multimodal dialogue system employing speech input. This system uses three input methods (through a speech recognizer, a mouse, and a keyboard) and two output methods (through a display and using sound). For the speech recognizer, an algorithm is employed for large-vocabulary speaker-independent continuous speech recognition based on the HMM-LR technique. This system is implemented for telephone directory assistance to evaluate the speech recognition algorithm and to investigate the variations in speech structure that users utter to computers. Speech input is used in a multimodal environment. The collecting of dialogue data between computers and users is also carried out. Twenty telephone-number retrieval tasks are used to evaluate this system. In the experiments, all the users are equally trained in using the dialogue system with an interactive guidance system implemented on a workstation. Simplified city maps that indicate subscriber names and addresses are used to reduce the implicit restrictions imposed by written sentences, thus allowing each user to develop his own forms of expression. The task completion rate is 99.0% and approximately 75% of the users say that they prefer this system to using a telephone book. Moreover, there is a significant decrease in nonkeyword usage, i.e., the usage of words other than names and addresses, for users who receive more utterance practice.

  • Error Analysis of Field Trial Results of a Spoken Dialogue System for Telecommunications Applications

    Shingo KUROIWA  Kazuya TAKEDA  Masaki NAITO  Naomi INOUE  Seiichi YAMAMOTO  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    636-641

    We carried out a one year field trial of a voice-activated automatic telephone exchange service at KDD Laboratories which has about 200 branch phones. This system has DSP-based continuous speech recognition hardware which can process incoming calls in real time using a vocabulary of 300 words. The recognition accuracy was found to be 92.5% for speech read from a written text under laboratory conditions independent of the speaker. In this paper, we describe the performance of the system obtained as a result of the field trial. Apart from recognition accuracy, there was about 20% error due to out-of-vocabulary input and incorrect detection of speech endpoints which had not been allowed for in the laboratory experiments. Also, we found that the recognition accuracy for actual speech was about 18% lower than for speech read from text even if there were no out-of-vocabulary words. In this paper, we examine error variations for individual data in order to try and pinpoint the cause of incorrect recognition. It was found from experiments on the collected data that the pause model used, filled pause grammar and differences of channel frequency response seriously affected recognition accuracy. With the help of simple techniques to overcome these problems, we finally obtained a recognition accuracy of 88.7% for real data.

  • A Linguistic Procedure for an Extension Number Guidance System

    Naomi INOUE  Izuru NOGAITO  Masahiko TAKAHASHI  

     
    PAPER

      Vol:
    E76-D No:1
      Page(s):
    106-111

    This paper describes the linguistic procedure of our speech dialogue system. The procedure is composed of two processes, syntactic analysis using a finite state network, and discourse analysis using a plan recognition model. The finite state network is compiled from regular grammar. The regular grammar is described in order to accept sentences with various styles, for example ellipsis and inversion. The regular grammar is automatically generated from the skeleton of the grammar. The discourse analysis module understands the utterance, generates the next question for users and also predicts words which will be in the next utterance. For an extension number guidance task, we obtained correct recognition results for 93% of input sentences without word prediction and for 98% if prediction results include proper words.

  • System Design, Data Collection and Evaluation of a Speech Dialogue System

    Katunobu ITOU  Satoru HAYAMIZU  Kazuyo TANAKA  Hozumi TANAKA  

     
    PAPER

      Vol:
    E76-D No:1
      Page(s):
    121-127

    This paper describes design issues of a speech dialogue system, the evaluation of the system, and the data collection of spontaneous speech in a transportation guidance domain. As it is difficult to collect spontaneous speech and to use a real system for the collection and evaluation, the phenomena related with dialogues have not been quantitatively clarified yet. The authors constructed a speech dialogue system which operates in almost real time, with acceptable recognition accuracy and flexible dialogue control. The system was used for spontaneous speech collection in a transportation guidance domain. The system performance evaluated in the domain is the understanding rate of 84.2% for the utterances within the predefined grammar and the lexicon. Also some statistics of the spontaneous speech collected are given.

  • How Might One Comfortably Converse with a Machine ?

    Yasuhisa NIIMI  

     
    INVITED PAPER

      Vol:
    E76-D No:1
      Page(s):
    9-16

    Progress of speech recognition based on the hidden Markov model has made it possible to realize man-machine dialogue systems capable of operating in real time. In spite of considerable effort, however, few systems have been successfully developed because of the lack of appropriate dialogue models. This paper reports on some of technology necessary to develop a dialogue system with which one can converse comfortably. The emphasis is placed on the following three points: how a human converses with a machine; how errors of speech recognition can be recovered through conversation; and what it means for a machine to be cooperative. We examine the first problem by investigating dialogues between human speakers, and dialogues between a human speaker and a simulated machine. As a consideration in the design of dialogue control, we discuss the relation between efficiency and cooperativeness of dialogue, the method for confirming what the machine has recognized, and dynamic adaptation of the machine. Thirdly, we review the research on the friendliness of a natural language interface, mainly concerning the exchange of initiative, corrective and suggestive answers, and indirect questions. Lastly, we describe briefly the current state of the art in speech recognition and synthesis, and suggest what should be done for acceptance of spontaneous speech and production of a voice suitable to the output of a dialogue system.