The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] spoken language(11hit)

1-11hit
  • Intrinsic Representation Mining for Zero-Shot Slot Filling

    Sixia LI  Shogo OKADA  Jianwu DANG  

     
    PAPER-Natural Language Processing

      Pubricized:
    2022/08/19
      Vol:
    E105-D No:11
      Page(s):
    1947-1956

    Zero-shot slot filling is a domain adaptation approach to handle unseen slots in new domains without training instances. Previous studies implemented zero-shot slot filling by predicting both slot entities and slot types. Because of the lack of knowledge about new domains, the existing methods often fail to predict slot entities for new domains as well as cannot effectively predict unseen slot types even when slot entities are correctly identified. Moreover, for some seen slot types, those methods may suffer from the domain shift problem, because the unseen context in new domains may change the explanations of the slots. In this study, we propose intrinsic representations to alleviate the domain shift problems above. Specifically, we propose a multi-relation-based representation to capture both the general and specific characteristics of slot entities, and an ontology-based representation to provide complementary knowledge on the relationships between slots and values across domains, for handling both unseen slot types and unseen contexts. We constructed a two-step pipeline model using the proposed representations to solve the domain shift problem. Experimental results in terms of the F1 score on three large datasets—Snips, SGD, and MultiWOZ 2.3—showed that our model outperformed state-of-the-art baselines by 29.62, 10.38, and 3.89, respectively. The detailed analysis with the average slot F1 score showed that our model improved the prediction by 25.82 for unseen slot types and by 10.51 for seen slot types. The results demonstrated that the proposed intrinsic representations can effectively alleviate the domain shift problem for both unseen slot types and seen slot types with unseen contexts.

  • Construction of Spontaneous Emotion Corpus from Indonesian TV Talk Shows and Its Application on Multimodal Emotion Recognition

    Nurul LUBIS  Dessi LESTARI  Sakriani SAKTI  Ayu PURWARIANTI  Satoshi NAKAMURA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2018/05/10
      Vol:
    E101-D No:8
      Page(s):
    2092-2100

    As interaction between human and computer continues to develop to the most natural form possible, it becomes increasingly urgent to incorporate emotion in the equation. This paper describes a step toward extending the research on emotion recognition to Indonesian. The field continues to develop, yet exploration of the subject in Indonesian is still lacking. In particular, this paper highlights two contributions: (1) the construction of the first emotional audio-visual database in Indonesian, and (2) the first multimodal emotion recognizer in Indonesian, built from the aforementioned corpus. In constructing the corpus, we aim at natural emotions that are corresponding to real life occurrences. However, the collection of emotional corpora is notably labor intensive and expensive. To diminish the cost, we collect the emotional data from television programs recordings, eliminating the need of an elaborate recording set up and experienced participants. In particular, we choose television talk shows due to its natural conversational content, yielding spontaneous emotion occurrences. To cover a broad range of emotions, we collected three episodes in different genres: politics, humanity, and entertainment. In this paper, we report points of analysis of the data and annotations. The acquisition of the emotion corpus serves as a foundation in further research on emotion. Subsequently, in the experiment, we employ the support vector machine (SVM) algorithm to model the emotions in the collected data. We perform multimodal emotion recognition utilizing the predictions of three modalities: acoustic, semantic, and visual. When compared to the unimodal result, in the multimodal feature combination, we attain identical accuracy for the arousal at 92.6%, and a significant improvement for the valence classification task at 93.8%. We hope to continue this work and move towards a finer-grain, more precise quantification of emotion.

  • Fuzzy Matching of Semantic Class in Chinese Spoken Language Understanding

    Yanling LI  Qingwei ZHAO  Yonghong YAN  

     
    PAPER-Natural Language Processing

      Vol:
    E96-D No:8
      Page(s):
    1845-1852

    Semantic concept in an utterance is obtained by a fuzzy matching methods to solve problems such as words' variation induced by automatic speech recognition (ASR), or missing field of key information by users in the process of spoken language understanding (SLU). A two-stage method is proposed: first, we adopt conditional random field (CRF) for building probabilistic models to segment and label entity names from an input sentence. Second, fuzzy matching based on similarity function is conducted between the named entities labeled by a CRF model and the reference characters of a dictionary. The experiments compare the performances in terms of accuracy and processing speed. Dice similarity and cosine similarity based on TF score can achieve better accuracy performance among four similarity measures, which equal to and greater than 93% in F1-measure. Especially the latter one improved by 8.8% and 9% respectively compared to q-gram and improved edit-distance, which are two conventional methods for string fuzzy matching.

  • Error Corrective Fusion of Classifier Scores for Spoken Language Recognition

    Omid DEHZANGI  Bin MA  Eng Siong CHNG  Haizhou LI  

     
    PAPER-Speech and Hearing

      Vol:
    E94-D No:12
      Page(s):
    2503-2512

    This paper investigates a new method for fusion of scores generated by multiple classification sub-systems that help to further reduce the classification error rate in Spoken Language Recognition (SLR). In recent studies, a variety of effective classification algorithms have been developed for SLR. Hence, it has been a common practice in the National Institute of Standards and Technology (NIST) Language Recognition Evaluations (LREs) to fuse the results from several classification sub-systems to boost the performance of the SLR systems. In this work, we introduce a discriminative performance measure to optimize the performance of the fusion of 7 language classifiers developed as IIR's submission to the 2009 NIST LRE. We present an Error Corrective Fusion (ECF) method in which we iteratively learn the fusion weights to minimize error rate of the fusion system. Experiments conducted on the 2009 NIST LRE corpus demonstrate a significant improvement compared to individual sub-systems. Comparison study is also conducted to show the effectiveness of the ECF method.

  • Incremental Parsing with Adjoining Operation

    Yoshihide KATO  Shigeki MATSUBARA  

     
    PAPER-Morphological/Syntactic Analysis

      Vol:
    E92-D No:12
      Page(s):
    2306-2312

    This paper describes an incremental parser based on an adjoining operation. By using the operation, we can avoid the problem of infinite local ambiguity. This paper further proposes a restricted version of the adjoining operation, which preserves lexical dependencies of partial parse trees. Our experimental results showed that the restriction enhances the accuracy of the incremental parsing.

  • On the Use of Structures for Spoken Language Understanding: A Two-Step Approach

    Minwoo JEONG  Gary Geunbae LEE  

     
    PAPER-Natural Language Processing

      Vol:
    E91-D No:5
      Page(s):
    1552-1561

    Spoken language understanding (SLU) aims to map a user's speech into a semantic frame. Since most of the previous works use the semantic structures for SLU, we verify that the structure is useful even for noisy input. We apply a structured prediction method to SLU problem and compare it to an unstructured one. In addition, we present a combined method to embed long-distance dependency between entities in a cascaded manner. On air travel data, we show that our approach improves performance over baseline models.

  • Dialogue Languages and Persons with Disabilities

    Akira ICHIKAWA  

     
    INVITED PAPER

      Vol:
    E87-D No:6
      Page(s):
    1312-1319

    Any utterances of dialogue, spoken language or sign language, have functions that enable recipients to achieve real-time and easy understanding and to control conversation smoothly in spite of its volatile characteristics. In this paper, we present evidence of these functions obtained experimentally. Prosody plays a very important role not only in spoken language (aural language) but also in sign language (visual language) and finger braille (tactile language). Skilled users of a language may detect word boundaries in utterances and estimate sentence structure immediately using prosody. The gestures and glances of a recipient may influence the utterances of the sender, leading to amendments of the contents of utterances and smooth exchanges in turn. Individuality and emotion in utterances are also very important aspects of effective communication support systems for persons with disabilities even more so than for those non-disabled persons. The trials described herein are universal in design. Some trials carried out to develop these systems are also reported.

  • Incremental Transfer in English-Japanese Machine Translation

    Shigeki MATSUBARA  Yasuyoshi INAGAKI  

     
    PAPER-Artificial Intelligence and Cognitive Science

      Vol:
    E80-D No:11
      Page(s):
    1122-1130

    Since spontaneously spoken language expressions appear continuously, the transfer stage of a spoken language machine translation system have to work incrementally. In such the system, the high degree of incrementality is also strongly required rather than that of quality. This paper proposes an incremental machine translation system, which translates English spoken words into Japanese in accordance with the order of appearances of them. The system is composed of three modules: incremental parsing, transfer and generation, which work synchronously. The transfer module utilizes some features and phenomena characterizing Japanese spoken language: flexible wordorder, ellipses, repetitions and so forth. This in influenced by the observational facts that such characteristics frequently appear in Japanese uttered by English-Japanese interpreters. Their frequent utilization is the key to success of the exceedingly incremental translation between English and Japanese, which have different word-order. We have implemented a prototype system Sync/Trans, which parses English dialogues incrementally and generates Japanese immediately. To evaluate Sync/Trans we fave made an experiment with the conversations consisting of 27 dialogues and 218 sentences. 190 of the sentences are correct, providing a success rate of 87.2%. This result shows our incremental method to be a promising technique for spoken language translation with acceptable accuracy and high real-time nature.

  • Multimedia Spoken Language Translation

    Jae-Woo YANG  Youngjik LEE  Jin-H. KIM  

     
    INVITED PAPER

      Vol:
    E79-D No:6
      Page(s):
    653-658

    This paper is concerned with spoken language translation in multimedia communication environment. We summarize the current research activities, by describing various spoken language translation systems and the multimodal technology related to spoken language translation. We propose a spoken language translation system that exploits the multimedia communication environment in order to overcome the limits caused by imperfect speech recognition. Our approach is in contrast to that of most conventional speech translation systems that limit their dialogue domains to obtain better speech recognition. We also propose a performance measure for spoken language translation systems. Our measure is defined as the ratio of the information quantities at each end of communication. Using this measure, we show that multimedia enhance spoken language translation systems.

  • Cooperative Spoken Dialogue Model Using Bayesian Network and Event Hierarchy

    Masahiro ARAKI  Shuji DOSHITA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    629-635

    In this paper, we propose a dialogue model that reflects two important aspects of spoken dialogue system: to be robust' and to be cooperative'. For this purpose, our model has two main inference spaces: Conversational Space (CS) and Problem Solving Space (PSS). CS is a kind of dynamic Bayesian network that represents a meaning of utterance and general dialogue rule. Robust' aspect is treated in CS. PSS is a network so called Event Hierarchy that represents the structure of task domain problems. Cooperative' aspect is mainly treated in PSS. In constructing CS and making inference on PSS, system's process, from meaning understanding through response generation, is modeled by dividing into five steps. These steps are (1) meaning understanding, (2) intention understanding, (3) communicative effect, (4) reaction generation, and (5) response generation. Meaning understanding step constructs CS and response generation step composes a surface expression of system's response from the part of CS. Intention understanding step makes correspondence utterance type in CS with action in PSS. Reaction generation step selects a cooperative reaction in PSS and expands a reaction to utterance type of CS. The status of problem solving and declared user's preference are recorded in mental state by communicative effect step. Then from our point of view, cooperative problem solving dialogue is regarded as a process of constructing CS and achieving goal in PSS through these five steps.

  • An Efficient One-Pass Search Algorithm for Parsing Spoken Language

    Michio OKADA  

     
    PAPER-Speech

      Vol:
    E75-A No:7
      Page(s):
    944-953

    Spoken language systems such as speech-to-speech dialog translation systems have been gaining more attention in recent years. These systems require full integration of speech recognition and natural language understanding. This paper presents an efficient parsing algorithm that integrates the search problems of speech processing and language processing. The parsing algorithm we propose here is regarded as an extension of the finite-state-network directed, one-pass search algorithm to one directed by a context-free grammar with retention of the time-synchronous procedure. The extended search algorithm is used to find approximately globally optimal sentence hypotheses; it does not have overhead which exists in, for example, hierarchical systems based on the lattice parsing approach. The computational complexity of this search algorithm is proportional to the length of the input speech. As the search process in the speech recognition can directly take account of the predictive information in the sentence parsing, this framework can be extended to sopken language systems which deal with dynamically varying constraints in dialogue situations.