The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] language(282hit)

101-120hit(282hit)

  • Training Set Selection for Building Compact and Efficient Language Models

    Keiji YASUDA  Hirofumi YAMAMOTO  Eiichiro SUMITA  

     
    PAPER-Natural Language Processing

      Vol:
    E92-D No:3
      Page(s):
    506-511

    For statistical language model training, target domain matched corpora are required. However, training corpora sometimes include both target domain matched and unmatched sentences. In such a case, training set selection is effective for both reducing model size and improving model performance. In this paper, training set selection method for statistical language model training is described. The method provides two advantages for training a language model. One is its capacity to improve the language model performance, and the other is its capacity to reduce computational loads for the language model. The method has four steps. 1) Sentence clustering is applied to all available corpora. 2) Language models are trained on each cluster. 3) Perplexity on the development set is calculated using the language models. 4) For the final language model training, we use the clusters whose language models yield low perplexities. The experimental results indicate that the language model trained on the data selected by our method gives lower perplexity on an open test set than a language model trained on all available corpora.

  • Polynomial Time Inductive Inference of TTSP Graph Languages from Positive Data

    Ryoji TAKAMI  Yusuke SUZUKI  Tomoyuki UCHIDA  Takayoshi SHOUDAI  

     
    PAPER

      Vol:
    E92-D No:2
      Page(s):
    181-190

    Two-Terminal Series Parallel (TTSP, for short) graphs are used as data models in applications for electric networks and scheduling problems. We propose a TTSP term graph which is a TTSP graph having structured variables, that is, a graph pattern over a TTSP graph. Let TGTTSP be the set of all TTSP term graphs whose variable labels are mutually distinct. For a TTSP term graph g in TGTTSP, the TTSP graph language of g, denoted by L(g), is the set of all TTSP graphs obtained from g by substituting arbitrary TTSP graphs for all variables in g. Firstly, when a TTSP graph G and a TTSP term graph g are given as inputs, we present a polynomial time matching algorithm which decides whether or not L(g) contains G. The minimal language problem for the class LTTSP={L(g) | g ∈ TGTTSP} is, given a set S of TTSP graphs, to find a TTSP term graph g in TGTTSP such that L(g) is minimal among all TTSP graph languages which contain all TTSP graphs in S. Secondly, we give a polynomial time algorithm for solving the minimal language problem for LTTSP. Finally, we show that LTTSP is polynomial time inductively inferable from positive data.

  • Monotone Increasing Binary Similarity and Its Application to Automatic Document-Acquisition of a Category

    Izumi SUZUKI  Yoshiki MIKAMI  Ario OHSATO  

     
    PAPER-Knowledge Acquisition

      Vol:
    E91-D No:11
      Page(s):
    2545-2551

    A technique that acquires documents in the same category with a given short text is introduced. Regarding the given text as a training document, the system marks up the most similar document, or sufficiently similar documents, from among the document domain (or entire Web). The system then adds the marked documents to the training set to learn the set, and this process is repeated until no more documents are marked. Setting a monotone increasing property to the similarity as it learns enables the system to 1) detect the correct timing so that no more documents remain to be marked and to 2) decide the threshold value that the classifier uses. In addition, under the condition that the normalization process is limited to what term weights are divided by a p-norm of the weights, the linear classifier in which training documents are indexed in a binary manner is the only instance that satisfies the monotone increasing property. The feasibility of the proposed technique was confirmed through an examination of binary similarity and using English and German documents randomly selected from the Web.

  • Some Results on Primitive Words, Square-Free Words, and Disjunctive Languages

    Tetsuo MORIYA  

     
    LETTER-Automata and Formal Language Theory

      Vol:
    E91-D No:10
      Page(s):
    2514-2516

    In this paper, we give some resuts on primitive words, square-free words and disjunctive languages. We show that for a word u ∈Σ+, every element of λ(cp(u)) is d-primitive iff it is square-free, where cp(u) is the set of all cyclic-permutations of u, and λ(cp(u)) is the set of all primitive roots of it. Next we show that pmqn is a primitive word for every n, m ≥1 and primitive words p, q, under the condition that |p| = |q| and (m, n) ≠ (1, 1). We also give a condition of disjunctiveness for a language.

  • A Method for Recognizing Noisy Romanized Japanese Words in Learner English

    Ryo NAGATA  Jun-ichi KAKEGAWA  Hiromi SUGIMOTO  Yukiko YABUTA  

     
    PAPER-Educational Technology

      Vol:
    E91-D No:10
      Page(s):
    2458-2466

    This paper describes a method for recognizing romanized Japanese words in learner English. They become noise and problematic in a variety of systems and tools for language learning and teaching including text analysis, spell checking, and grammatical error detection because they are Japanese words and thus mostly unknown to such systems and tools. A problem one encounters when recognizing romanized Japanese words in learner English is that the spelling rules of romanized Japanese words are often violated. To address this problem, the described method uses a clustering algorithm reinforced by a small set of rules. Experiments show that it achieves an F-measure of 0.879 and outperforms other methods. They also show that it only requires the target text and an English word list of reasonable size.

  • On Algebraic Properties of Delay-Nonconflicting Languages in Supervisory Control under Communication Delays

    Jung-Min YANG  Seong-Jin PARK  

     
    LETTER-Systems and Control

      Vol:
    E91-A No:8
      Page(s):
    2237-2239

    In networked control systems, uncontrollable events may unexpectedly occur in a plant before a proper control action is applied to the plant due to communication delays. In the area of supervisory control of discrete event systems, Park and Cho [5] proposed the notion of delay-nonconflictingness for the existence of a supervisor achieving a given language specification under communication delays. In this paper, we present the algebraic properties of delay-nonconflicting languages which are necessary for solving supervisor synthesis problems under communication delays. Specifically, we show that the class of prefix-closed and delay-nonconflicting languages is closed under intersection, which leads to the existence of a unique infimal prefix-closed and delay-nonconflicting superlanguage of a given language specification.

  • Decomposition of Task-Level Concurrency on C Programs Applied to the Design of Multiprocessor SoC

    Mohammad ZALFANY URFIANTO  Tsuyoshi ISSHIKI  Arif ULLAH KHAN  Dongju LI  Hiroaki KUNIEDA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E91-A No:7
      Page(s):
    1748-1756

    A simple extension used to assist the decomposition of task-level concurrency within C programs is presented in this paper. The concurrency decomposition is meant to be used as the point of entry for Multiprocessor System-on-Chips (MPSoC) architectures' design-flow. Our methodology allows the (re)use of readily available reference C programs and enables easy and rapid exploration for various alternatives of task partitioning strategies; a crucial task that greatly influences the overall quality of the designed MPSoC. A test case using a JPEG encoder application has been performed and the results are presented in this paper.

  • Study of Spatial Configurations of Equipment for Online Sign Interpretation Service

    Kaoru NAKAZONO  Saori TANAKA  

     
    PAPER-Media Communication

      Vol:
    E91-D No:6
      Page(s):
    1613-1621

    This paper discusses the design of configurations of videophone equipment aimed at online sign interpretation. We classified interpretation services into three types of situations: on-site interpretation, partial online interpretation, and full online interpretation. For each situation, the spatial configurations of the equipment are considered keeping the issue of nonverbal signals in mind. Simulation experiments of sign interpretation were performed using these spatial configurations and the qualities of the configurations were assessed. The preferred configurations had the common characteristics that the hearing subject could see the face of his/her principal conversation partner, that is, the deaf subject. The results imply that hearing people who do not understand sign language utilize nonverbal signals for facilitating interpreter-mediated conversation.

  • On the Use of Structures for Spoken Language Understanding: A Two-Step Approach

    Minwoo JEONG  Gary Geunbae LEE  

     
    PAPER-Natural Language Processing

      Vol:
    E91-D No:5
      Page(s):
    1552-1561

    Spoken language understanding (SLU) aims to map a user's speech into a semantic frame. Since most of the previous works use the semantic structures for SLU, we verify that the structure is useful even for noisy input. We apply a structured prediction method to SLU problem and compare it to an unstructured one. In addition, we present a combined method to embed long-distance dependency between entities in a cascaded manner. On air travel data, we show that our approach improves performance over baseline models.

  • Query Language for Location-Based Services: A Model Checking Approach

    Christian HOAREAU  Ichiro SATOH  

     
    PAPER-Ubiquitous Computing

      Vol:
    E91-D No:4
      Page(s):
    976-985

    We present a model checking approach to the rationale, implementation, and applications of a query language for location-based services. Such query mechanisms are necessary so that users, objects, and/or services can effectively benefit from the location-awareness of their surrounding environment. The underlying data model is founded on a symbolic model of space organized in a tree structure. Once extended to a semantic model for modal logic, we regard location query processing as a model checking problem, and thus define location queries as hybrid logic-based formulas. Our approach is unique to existing research because it explores the connection between location models and query processing in ubiquitous computing systems, relies on a sound theoretical basis, and provides modal logic-based query mechanisms for expressive searches over a decentralized data structure. A prototype implementation is also presented and will be discussed.

  • Automatic Language Identification with Discriminative Language Characterization Based on SVM

    Hongbin SUO  Ming LI  Ping LU  Yonghong YAN  

     
    PAPER-Language Identification

      Vol:
    E91-D No:3
      Page(s):
    567-575

    Robust automatic language identification (LID) is the task of identifying the language from a short utterance spoken by an unknown speaker. The mainstream approaches include parallel phone recognition language modeling (PPRLM), support vector machine (SVM) and the general Gaussian mixture models (GMMs). These systems map the cepstral features of spoken utterances into high level scores by classifiers. In this paper, in order to increase the dimension of the score vector and alleviate the inter-speaker variability within the same language, multiple data groups based on supervised speaker clustering are employed to generate the discriminative language characterization score vectors (DLCSV). The back-end SVM classifiers are used to model the probability distribution of each target language in the DLCSV space. Finally, the output scores of back-end classifiers are calibrated by a pair-wise posterior probability estimation (PPPE) algorithm. The proposed language identification frameworks are evaluated on 2003 NIST Language Recognition Evaluation (LRE) databases and the experiments show that the system described in this paper produces comparable results to the existing systems. Especially, the SVM framework achieves an equal error rate (EER) of 4.0% in the 30-second task and outperforms the state-of-art systems by more than 30% relative error reduction. Besides, the performances of proposed PPRLM and GMMs algorithms achieve an EER of 5.1% and 5.0% respectively.

  • Language Modeling Using PLSA-Based Topic HMM

    Atsushi SAKO  Tetsuya TAKIGUCHI  Yasuo ARIKI  

     
    PAPER-Language Modeling

      Vol:
    E91-D No:3
      Page(s):
    522-528

    In this paper, we propose a PLSA-based language model for sports-related live speech. This model is implemented using a unigram rescaling technique that combines a topic model and an n-gram. In the conventional method, unigram rescaling is performed with a topic distribution estimated from a recognized transcription history. This method can improve the performance, but it cannot express topic transition. By incorporating the concept of topic transition, it is expected that the recognition performance will be improved. Thus, the proposed method employs a "Topic HMM" instead of a history to estimate the topic distribution. The Topic HMM is an Ergodic HMM that expresses typical topic distributions as well as topic transition probabilities. Word accuracy results from our experiments confirmed the superiority of the proposed method over a trigram and a PLSA-based conventional method that uses a recognized history.

  • Learning of Finite Unions of Tree Patterns with Internal Structured Variables from Queries

    Satoshi MATSUMOTO  Takayoshi SHOUDAI  Tomoyuki UCHIDA  Tetsuhiro MIYAHARA  Yusuke SUZUKI  

     
    PAPER-Algorithmic Learning Theory

      Vol:
    E91-D No:2
      Page(s):
    222-230

    A linear term tree is defined as an edge-labeled rooted tree pattern with ordered children and internal structured variables whose labels are mutually distinct. A variable can be replaced with arbitrary edge-labeled rooted ordered trees. We consider the polynomial time learnability of finite unions of linear term trees in the exact learning model formalized by Angluin. The language L(t) of a linear term tree t is the set of all trees obtained from t by substituting arbitrary edge-labeled rooted ordered trees for all variables in t. Moreover, for a finite set S of linear term trees, we define L(S)=∪t∈S L(t). A target of learning, denoted by T*, is a finite set of linear term trees, where the number of edge labels is infinite. In this paper, for any set T* of m linear term trees (m ≥ 0), we present a query learning algorithm which exactly identifies T* in polynomial time using at most 2mn2 Restricted Subset queries and at most m+1 Equivalence queries, where n is the maximum size of counterexamples. Finally, we note that finite sets of linear term trees are not learnable in polynomial time using Restricted Equivalence, Membership and Subset queries.

  • A Machine Learning Approach for an Indonesian-English Cross Language Question Answering System

    Ayu PURWARIANTI  Masatoshi TSUCHIYA  Seiichi NAKAGAWA  

     
    PAPER-Natural Language Processing

      Vol:
    E90-D No:11
      Page(s):
    1841-1852

    We have built a CLQA (Cross Language Question Answering) system for a source language with limited data resources (e.g. Indonesian) using a machine learning approach. The CLQA system consists of four modules: question analyzer, keyword translator, passage retriever and answer finder. We used machine learning in two modules, the question classifier (part of the question analyzer) and the answer finder. In the question classifier, we classify the EAT (Expected Answer Type) of a question by using SVM (Support Vector Machine) method. Features for the classification module are basically the output of our shallow question parsing module. To improve the classification score, we use statistical information extracted from our Indonesian corpus. In the answer finder module, using an approach different from the common approach in which answer is located by matching the named entity of the word corpus with the EAT of question, we locate the answer by text chunking the word corpus. The features for the SVM based text chunking process consist of question features, word corpus features and similarity scores between the word corpus and the question keyword. In this way, we eliminate the named entity tagging process for the target document. As for the keyword translator module, we use an Indonesian-English dictionary to translate Indonesian keywords into English. We also use some simple patterns to transform some borrowed English words. The keywords are then combined in boolean queries in order to retrieve relevant passages using IDF scores. We first conducted an experiment using 2,837 questions (about 10% are used as the test data) obtained from 18 Indonesian college students. We next conducted a similar experiment using the NTCIR (NII Test Collection for IR Systems) 2005 CLQA task by translating the English questions into Indonesian. Compared to the Japanese-English and Chinese-English CLQA results in the NTCIR 2005, we found that our system is superior to others except for one system that uses a high data resource employing 3 dictionaries. Further, a rough comparison with two other Indonesian-English CLQA systems revealed that our system achieved higher accuracy score.

  • Statistical-Based Approach to Non-segmented Language Processing

    Virach SORNLERTLAMVANICH  Thatsanee CHAROENPORN  Shisanu TONGCHIM  Canasai KRUENGKRAI  Hitoshi ISAHARA  

     
    PAPER

      Vol:
    E90-D No:10
      Page(s):
    1565-1573

    Several approaches have been studied to cope with the exceptional features of non-segmented languages. When there is no explicit information about the boundary of a word, segmenting an input text is a formidable task in language processing. Not only the contemporary word list, but also usages of the words have to be maintained to cover the use in the current texts. The accuracy and efficiency in higher processing do heavily rely on this word boundary identification task. In this paper, we introduce some statistical based approaches to tackle the problem due to the ambiguity in word segmentation. The word boundary identification problem is then defined as a part of others for performing the unified language processing in total. To exhibit the ability in conducting the unified language processing, we selectively study the tasks of language identification, word extraction, and dictionary-less search engine.

  • On the Generative Powers of Some Extensions of Minimal Linear Grammars

    Kaoru ONODERA  

     
    PAPER-Automata and Formal Language Theory

      Vol:
    E90-D No:6
      Page(s):
    895-904

    This paper concerns the Geffert normal forms for phrase structure grammars. We first generalize them to have a new formulation of minimal linear grammars with cancellation productions, called "cancel minimal linear grammars". Then the generative powers of some classes of those grammars are investigated. It is shown that the class of languages generated by grammars with a unique {AB}-cancellation production properly includes the class of linear languages, while it is included in the class of context-free languages. Furthermore, the corresponding class of languages generated by grammars with a unique {AA}-cancellation production is shown to be a proper subclass of linear languages.

  • Application of the CKY Algorithm to Recognition of Tree Structures for Linear, Monadic Context-Free Tree Grammars

    Akio FUJIYOSHI  

     
    PAPER-Formal Languages

      Vol:
    E90-D No:2
      Page(s):
    388-394

    In this paper, a recognition algorithm for the class of tree languages generated by linear, monadic context-free tree grammars (LM-CFTGs) is proposed. LM-CFTGs define an important class of tree languages because LM-CFTGs are weakly equivalent to tree adjoining grammars (TAGs). The algorithm uses the CKY algorithm as a subprogram and recognizes whether an input tree can be derived from a given LM-CFTG in O(n4) time, where n is the number of nodes of the input tree.

  • Incremental Language Modeling for Automatic Transcription of Broadcast News

    Katsutoshi OHTSUKI  Long NGUYEN  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:2
      Page(s):
    526-532

    In this paper, we address the task of incremental language modeling for automatic transcription of broadcast news speech. Daily broadcast news naturally contains new words that are not in the lexicon of the speech recognition system but are important for downstream applications such as information retrieval or machine translation. To recognize those new words, the lexicon and the language model of the speech recognition system need to be updated periodically. We propose a method of estimating a list of words to be added to the lexicon based on some time-series text data. The experimental results on the RT04 Broadcast News data and other TV audio data showed that this method provided an impressive and stable reduction in both out-of-vocabulary rates and speech recognition word error rates.

  • Analogical Conception of Chomsky Normal Form and Greibach Normal Form for Linear, Monadic Context-Free Tree Grammars

    Akio FUJIYOSHI  

     
    PAPER-Automata and Formal Language Theory

      Vol:
    E89-D No:12
      Page(s):
    2933-2938

    This paper presents the analogical conception of Chomsky normal form and Greibach normal form for linear, monadic context-free tree grammars (LM-CFTGs). LM-CFTGs generate the same class of languages as four well-known mildly context-sensitive grammars. It will be shown that any LM-CFTG can be transformed into equivalent ones in both normal forms. As Chomsky normal form and Greibach normal form for context-free grammars (CFGs) play a very important role in the study of formal properties of CFGs, it is expected that the Chomsky-like normal form and the Greibach-like normal form for LM-CFTGs will provide deeper analyses of the class of languages generated by mildly context-sensitive grammars.

  • Formal Design of Arithmetic Circuits Based on Arithmetic Description Language

    Naofumi HOMMA  Yuki WATANABE  Takafumi AOKI  Tatsuo HIGUCHI  

     
    PAPER-Circuit Synthesis

      Vol:
    E89-A No:12
      Page(s):
    3500-3509

    This paper presents a formal design of arithmetic circuits using an arithmetic description language called ARITH. The key idea in ARITH is to describe arithmetic algorithms directly with high-level mathematical objects (i.e., number representation systems and arithmetic operations/formulae). Using ARITH, we can provide formal description of arithmetic algorithms including those using unconventional number systems. In addition, the described arithmetic algorithms can be formally verified by equivalence checking with formula manipulations. The verified ARITH descriptions are easily translated into the equivalent HDL descriptions. In this paper, we also present an application of ARITH to an arithmetic module generator, which supports a variety of hardware algorithms for 2-operand adders, multi-operand adders, multipliers, constant-coefficient multipliers and multiply accumulators. The language processing system of ARITH incorporated in the generator verifies the correctness of ARITH descriptions in a formal method. As a result, we can obtain highly-reliable arithmetic modules whose functions are completely verified at the algorithm level.

101-120hit(282hit)