IEICE global.ieice.org Site

Keyword Search Result

[Keyword] language(282hit)

101-120hit(282hit)

Training Set Selection for Building Compact and Efficient Language Models
Keiji YASUDA Hirofumi YAMAMOTO Eiichiro SUMITA

PAPER-Natural Language Processing

Vol:
E92-D No:3
Page(s):
506-511
For statistical language model training, target domain matched corpora are required. However, training corpora sometimes include both target domain matched and unmatched sentences. In such a case, training set selection is effective for both reducing model size and improving model performance. In this paper, training set selection method for statistical language model training is described. The method provides two advantages for training a language model. One is its capacity to improve the language model performance, and the other is its capacity to reduce computational loads for the language model. The method has four steps. 1) Sentence clustering is applied to all available corpora. 2) Language models are trained on each cluster. 3) Perplexity on the development set is calculated using the language models. 4) For the final language model training, we use the clusters whose language models yield low perplexities. The experimental results indicate that the language model trained on the data selected by our method gives lower perplexity on an open test set than a language model trained on all available corpora.
Polynomial Time Inductive Inference of TTSP Graph Languages from Positive Data
Ryoji TAKAMI Yusuke SUZUKI Tomoyuki UCHIDA Takayoshi SHOUDAI

PAPER

Vol:
E92-D No:2
Page(s):
181-190
Two-Terminal Series Parallel (TTSP, for short) graphs are used as data models in applications for electric networks and scheduling problems. We propose a TTSP term graph which is a TTSP graph having structured variables, that is, a graph pattern over a TTSP graph. Let TGTTSP be the set of all TTSP term graphs whose variable labels are mutually distinct. For a TTSP term graph g in TGTTSP, the TTSP graph language of g, denoted by L(g), is the set of all TTSP graphs obtained from g by substituting arbitrary TTSP graphs for all variables in g. Firstly, when a TTSP graph G and a TTSP term graph g are given as inputs, we present a polynomial time matching algorithm which decides whether or not L(g) contains G. The minimal language problem for the class LTTSP={L(g) | g ∈ TGTTSP} is, given a set S of TTSP graphs, to find a TTSP term graph g in TGTTSP such that L(g) is minimal among all TTSP graph languages which contain all TTSP graphs in S. Secondly, we give a polynomial time algorithm for solving the minimal language problem for LTTSP. Finally, we show that LTTSP is polynomial time inductively inferable from positive data.
Monotone Increasing Binary Similarity and Its Application to Automatic Document-Acquisition of a Category
Izumi SUZUKI Yoshiki MIKAMI Ario OHSATO

PAPER-Knowledge Acquisition

Vol:
E91-D No:11
Page(s):
2545-2551
A technique that acquires documents in the same category with a given short text is introduced. Regarding the given text as a training document, the system marks up the most similar document, or sufficiently similar documents, from among the document domain (or entire Web). The system then adds the marked documents to the training set to learn the set, and this process is repeated until no more documents are marked. Setting a monotone increasing property to the similarity as it learns enables the system to 1) detect the correct timing so that no more documents remain to be marked and to 2) decide the threshold value that the classifier uses. In addition, under the condition that the normalization process is limited to what term weights are divided by a p-norm of the weights, the linear classifier in which training documents are indexed in a binary manner is the only instance that satisfies the monotone increasing property. The feasibility of the proposed technique was confirmed through an examination of binary similarity and using English and German documents randomly selected from the Web.
Some Results on Primitive Words, Square-Free Words, and Disjunctive Languages
Tetsuo MORIYA

LETTER-Automata and Formal Language Theory

Vol:
E91-D No:10
Page(s):
2514-2516
In this paper, we give some resuts on primitive words, square-free words and disjunctive languages. We show that for a word u ∈Σ+, every element of λ(cp(u)) is d-primitive iff it is square-free, where cp(u) is the set of all cyclic-permutations of u, and λ(cp(u)) is the set of all primitive roots of it. Next we show that pmqn is a primitive word for every n, m ≥1 and primitive words p, q, under the condition that |p| = |q| and (m, n) ≠ (1, 1). We also give a condition of disjunctiveness for a language.
A Method for Recognizing Noisy Romanized Japanese Words in Learner English
Ryo NAGATA Jun-ichi KAKEGAWA Hiromi SUGIMOTO Yukiko YABUTA

PAPER-Educational Technology

Vol:
E91-D No:10
Page(s):
2458-2466
This paper describes a method for recognizing romanized Japanese words in learner English. They become noise and problematic in a variety of systems and tools for language learning and teaching including text analysis, spell checking, and grammatical error detection because they are Japanese words and thus mostly unknown to such systems and tools. A problem one encounters when recognizing romanized Japanese words in learner English is that the spelling rules of romanized Japanese words are often violated. To address this problem, the described method uses a clustering algorithm reinforced by a small set of rules. Experiments show that it achieves an F-measure of 0.879 and outperforms other methods. They also show that it only requires the target text and an English word list of reasonable size.
On Algebraic Properties of Delay-Nonconflicting Languages in Supervisory Control under Communication Delays
Jung-Min YANG Seong-Jin PARK

LETTER-Systems and Control

Vol:
E91-A No:8
Page(s):
2237-2239
In networked control systems, uncontrollable events may unexpectedly occur in a plant before a proper control action is applied to the plant due to communication delays. In the area of supervisory control of discrete event systems, Park and Cho [5] proposed the notion of delay-nonconflictingness for the existence of a supervisor achieving a given language specification under communication delays. In this paper, we present the algebraic properties of delay-nonconflicting languages which are necessary for solving supervisor synthesis problems under communication delays. Specifically, we show that the class of prefix-closed and delay-nonconflicting languages is closed under intersection, which leads to the existence of a unique infimal prefix-closed and delay-nonconflicting superlanguage of a given language specification.
Decomposition of Task-Level Concurrency on C Programs Applied to the Design of Multiprocessor SoC
Mohammad ZALFANY URFIANTO Tsuyoshi ISSHIKI Arif ULLAH KHAN Dongju LI Hiroaki KUNIEDA

PAPER-VLSI Design Technology and CAD

Vol:
E91-A No:7
Page(s):
1748-1756
A simple extension used to assist the decomposition of task-level concurrency within C programs is presented in this paper. The concurrency decomposition is meant to be used as the point of entry for Multiprocessor System-on-Chips (MPSoC) architectures' design-flow. Our methodology allows the (re)use of readily available reference C programs and enables easy and rapid exploration for various alternatives of task partitioning strategies; a crucial task that greatly influences the overall quality of the designed MPSoC. A test case using a JPEG encoder application has been performed and the results are presented in this paper.
Study of Spatial Configurations of Equipment for Online Sign Interpretation Service
Kaoru NAKAZONO Saori TANAKA

PAPER-Media Communication

Vol:
E91-D No:6
Page(s):
1613-1621
This paper discusses the design of configurations of videophone equipment aimed at online sign interpretation. We classified interpretation services into three types of situations: on-site interpretation, partial online interpretation, and full online interpretation. For each situation, the spatial configurations of the equipment are considered keeping the issue of nonverbal signals in mind. Simulation experiments of sign interpretation were performed using these spatial configurations and the qualities of the configurations were assessed. The preferred configurations had the common characteristics that the hearing subject could see the face of his/her principal conversation partner, that is, the deaf subject. The results imply that hearing people who do not understand sign language utilize nonverbal signals for facilitating interpreter-mediated conversation.
On the Use of Structures for Spoken Language Understanding: A Two-Step Approach
Minwoo JEONG Gary Geunbae LEE

PAPER-Natural Language Processing

Vol:
E91-D No:5
Page(s):
1552-1561
Spoken language understanding (SLU) aims to map a user's speech into a semantic frame. Since most of the previous works use the semantic structures for SLU, we verify that the structure is useful even for noisy input. We apply a structured prediction method to SLU problem and compare it to an unstructured one. In addition, we present a combined method to embed long-distance dependency between entities in a cascaded manner. On air travel data, we show that our approach improves performance over baseline models.
Query Language for Location-Based Services: A Model Checking Approach
Christian HOAREAU Ichiro SATOH

PAPER-Ubiquitous Computing

Vol:
E91-D No:4
Page(s):
976-985
We present a model checking approach to the rationale, implementation, and applications of a query language for location-based services. Such query mechanisms are necessary so that users, objects, and/or services can effectively benefit from the location-awareness of their surrounding environment. The underlying data model is founded on a symbolic model of space organized in a tree structure. Once extended to a semantic model for modal logic, we regard location query processing as a model checking problem, and thus define location queries as hybrid logic-based formulas. Our approach is unique to existing research because it explores the connection between location models and query processing in ubiquitous computing systems, relies on a sound theoretical basis, and provides modal logic-based query mechanisms for expressive searches over a decentralized data structure. A prototype implementation is also presented and will be discussed.
Automatic Language Identification with Discriminative Language Characterization Based on SVM
Hongbin SUO Ming LI Ping LU Yonghong YAN

PAPER-Language Identification

Vol:
E91-D No:3
Page(s):
567-575
Robust automatic language identification (LID) is the task of identifying the language from a short utterance spoken by an unknown speaker. The mainstream approaches include parallel phone recognition language modeling (PPRLM), support vector machine (SVM) and the general Gaussian mixture models (GMMs). These systems map the cepstral features of spoken utterances into high level scores by classifiers. In this paper, in order to increase the dimension of the score vector and alleviate the inter-speaker variability within the same language, multiple data groups based on supervised speaker clustering are employed to generate the discriminative language characterization score vectors (DLCSV). The back-end SVM classifiers are used to model the probability distribution of each target language in the DLCSV space. Finally, the output scores of back-end classifiers are calibrated by a pair-wise posterior probability estimation (PPPE) algorithm. The proposed language identification frameworks are evaluated on 2003 NIST Language Recognition Evaluation (LRE) databases and the experiments show that the system described in this paper produces comparable results to the existing systems. Especially, the SVM framework achieves an equal error rate (EER) of 4.0% in the 30-second task and outperforms the state-of-art systems by more than 30% relative error reduction. Besides, the performances of proposed PPRLM and GMMs algorithms achieve an EER of 5.1% and 5.0% respectively.
Language Modeling Using PLSA-Based Topic HMM
Atsushi SAKO Tetsuya TAKIGUCHI Yasuo ARIKI

PAPER-Language Modeling

Vol:
E91-D No:3
Page(s):
522-528
In this paper, we propose a PLSA-based language model for sports-related live speech. This model is implemented using a unigram rescaling technique that combines a topic model and an n-gram. In the conventional method, unigram rescaling is performed with a topic distribution estimated from a recognized transcription history. This method can improve the performance, but it cannot express topic transition. By incorporating the concept of topic transition, it is expected that the recognition performance will be improved. Thus, the proposed method employs a "Topic HMM" instead of a history to estimate the topic distribution. The Topic HMM is an Ergodic HMM that expresses typical topic distributions as well as topic transition probabilities. Word accuracy results from our experiments confirmed the superiority of the proposed method over a trigram and a PLSA-based conventional method that uses a recognized history.
Learning of Finite Unions of Tree Patterns with Internal Structured Variables from Queries
Satoshi MATSUMOTO Takayoshi SHOUDAI Tomoyuki UCHIDA Tetsuhiro MIYAHARA Yusuke SUZUKI

PAPER-Algorithmic Learning Theory

Vol:
E91-D No:2
Page(s):
222-230
A linear term tree is defined as an edge-labeled rooted tree pattern with ordered children and internal structured variables whose labels are mutually distinct. A variable can be replaced with arbitrary edge-labeled rooted ordered trees. We consider the polynomial time learnability of finite unions of linear term trees in the exact learning model formalized by Angluin. The language L(t) of a linear term tree t is the set of all trees obtained from t by substituting arbitrary edge-labeled rooted ordered trees for all variables in t. Moreover, for a finite set S of linear term trees, we define L(S)=∪t∈S L(t). A target of learning, denoted by T*, is a finite set of linear term trees, where the number of edge labels is infinite. In this paper, for any set T* of m linear term trees (m ≥ 0), we present a query learning algorithm which exactly identifies T* in polynomial time using at most 2mn2 Restricted Subset queries and at most m+1 Equivalence queries, where n is the maximum size of counterexamples. Finally, we note that finite sets of linear term trees are not learnable in polynomial time using Restricted Equivalence, Membership and Subset queries.
A Machine Learning Approach for an Indonesian-English Cross Language Question Answering System
Ayu PURWARIANTI Masatoshi TSUCHIYA Seiichi NAKAGAWA

PAPER-Natural Language Processing

Vol:
E90-D No:11
Page(s):
1841-1852
We have built a CLQA (Cross Language Question Answering) system for a source language with limited data resources (e.g. Indonesian) using a machine learning approach. The CLQA system consists of four modules: question analyzer, keyword translator, passage retriever and answer finder. We used machine learning in two modules, the question classifier (part of the question analyzer) and the answer finder. In the question classifier, we classify the EAT (Expected Answer Type) of a question by using SVM (Support Vector Machine) method. Features for the classification module are basically the output of our shallow question parsing module. To improve the classification score, we use statistical information extracted from our Indonesian corpus. In the answer finder module, using an approach different from the common approach in which answer is located by matching the named entity of the word corpus with the EAT of question, we locate the answer by text chunking the word corpus. The features for the SVM based text chunking process consist of question features, word corpus features and similarity scores between the word corpus and the question keyword. In this way, we eliminate the named entity tagging process for the target document. As for the keyword translator module, we use an Indonesian-English dictionary to translate Indonesian keywords into English. We also use some simple patterns to transform some borrowed English words. The keywords are then combined in boolean queries in order to retrieve relevant passages using IDF scores. We first conducted an experiment using 2,837 questions (about 10% are used as the test data) obtained from 18 Indonesian college students. We next conducted a similar experiment using the NTCIR (NII Test Collection for IR Systems) 2005 CLQA task by translating the English questions into Indonesian. Compared to the Japanese-English and Chinese-English CLQA results in the NTCIR 2005, we found that our system is superior to others except for one system that uses a high data resource employing 3 dictionaries. Further, a rough comparison with two other Indonesian-English CLQA systems revealed that our system achieved higher accuracy score.
Statistical-Based Approach to Non-segmented Language Processing
Virach SORNLERTLAMVANICH Thatsanee CHAROENPORN Shisanu TONGCHIM Canasai KRUENGKRAI Hitoshi ISAHARA

PAPER

Vol:
E90-D No:10
Page(s):
1565-1573
Several approaches have been studied to cope with the exceptional features of non-segmented languages. When there is no explicit information about the boundary of a word, segmenting an input text is a formidable task in language processing. Not only the contemporary word list, but also usages of the words have to be maintained to cover the use in the current texts. The accuracy and efficiency in higher processing do heavily rely on this word boundary identification task. In this paper, we introduce some statistical based approaches to tackle the problem due to the ambiguity in word segmentation. The word boundary identification problem is then defined as a part of others for performing the unified language processing in total. To exhibit the ability in conducting the unified language processing, we selectively study the tasks of language identification, word extraction, and dictionary-less search engine.
On the Generative Powers of Some Extensions of Minimal Linear Grammars
Kaoru ONODERA

PAPER-Automata and Formal Language Theory

Vol:
E90-D No:6
Page(s):
895-904
This paper concerns the Geffert normal forms for phrase structure grammars. We first generalize them to have a new formulation of minimal linear grammars with cancellation productions, called "cancel minimal linear grammars". Then the generative powers of some classes of those grammars are investigated. It is shown that the class of languages generated by grammars with a unique {AB}-cancellation production properly includes the class of linear languages, while it is included in the class of context-free languages. Furthermore, the corresponding class of languages generated by grammars with a unique {AA}-cancellation production is shown to be a proper subclass of linear languages.
Application of the CKY Algorithm to Recognition of Tree Structures for Linear, Monadic Context-Free Tree Grammars
Akio FUJIYOSHI

PAPER-Formal Languages

Vol:
E90-D No:2
Page(s):
388-394
In this paper, a recognition algorithm for the class of tree languages generated by linear, monadic context-free tree grammars (LM-CFTGs) is proposed. LM-CFTGs define an important class of tree languages because LM-CFTGs are weakly equivalent to tree adjoining grammars (TAGs). The algorithm uses the CKY algorithm as a subprogram and recognizes whether an input tree can be derived from a given LM-CFTG in O(n4) time, where n is the number of nodes of the input tree.
Incremental Language Modeling for Automatic Transcription of Broadcast News
Katsutoshi OHTSUKI Long NGUYEN

PAPER-Speech and Hearing

Vol:
E90-D No:2
Page(s):
526-532
In this paper, we address the task of incremental language modeling for automatic transcription of broadcast news speech. Daily broadcast news naturally contains new words that are not in the lexicon of the speech recognition system but are important for downstream applications such as information retrieval or machine translation. To recognize those new words, the lexicon and the language model of the speech recognition system need to be updated periodically. We propose a method of estimating a list of words to be added to the lexicon based on some time-series text data. The experimental results on the RT04 Broadcast News data and other TV audio data showed that this method provided an impressive and stable reduction in both out-of-vocabulary rates and speech recognition word error rates.
Analogical Conception of Chomsky Normal Form and Greibach Normal Form for Linear, Monadic Context-Free Tree Grammars
Akio FUJIYOSHI

PAPER-Automata and Formal Language Theory

Vol:
E89-D No:12
Page(s):
2933-2938
This paper presents the analogical conception of Chomsky normal form and Greibach normal form for linear, monadic context-free tree grammars (LM-CFTGs). LM-CFTGs generate the same class of languages as four well-known mildly context-sensitive grammars. It will be shown that any LM-CFTG can be transformed into equivalent ones in both normal forms. As Chomsky normal form and Greibach normal form for context-free grammars (CFGs) play a very important role in the study of formal properties of CFGs, it is expected that the Chomsky-like normal form and the Greibach-like normal form for LM-CFTGs will provide deeper analyses of the class of languages generated by mildly context-sensitive grammars.
Formal Design of Arithmetic Circuits Based on Arithmetic Description Language
Naofumi HOMMA Yuki WATANABE Takafumi AOKI Tatsuo HIGUCHI

PAPER-Circuit Synthesis

Vol:
E89-A No:12
Page(s):
3500-3509
This paper presents a formal design of arithmetic circuits using an arithmetic description language called ARITH. The key idea in ARITH is to describe arithmetic algorithms directly with high-level mathematical objects (i.e., number representation systems and arithmetic operations/formulae). Using ARITH, we can provide formal description of arithmetic algorithms including those using unconventional number systems. In addition, the described arithmetic algorithms can be formally verified by equivalence checking with formula manipulations. The verified ARITH descriptions are easily translated into the equivalent HDL descriptions. In this paper, we also present an application of ARITH to an arithmetic module generator, which supports a variety of hardware algorithms for 2-operand adders, multi-operand adders, multipliers, constant-coefficient multipliers and multiply accumulators. The language processing system of ARITH incorporated in the generator verifies the correctness of ARITH descriptions in a formal method. As a result, we can obtain highly-reliable arithmetic modules whose functions are completely verified at the algorithm level.

101-120hit(282hit)

Keyword Search Result

[Keyword] language(282hit)

Training Set Selection for Building Compact and Efficient Language Models

Polynomial Time Inductive Inference of TTSP Graph Languages from Positive Data

Monotone Increasing Binary Similarity and Its Application to Automatic Document-Acquisition of a Category

Some Results on Primitive Words, Square-Free Words, and Disjunctive Languages

A Method for Recognizing Noisy Romanized Japanese Words in Learner English

On Algebraic Properties of Delay-Nonconflicting Languages in Supervisory Control under Communication Delays

Decomposition of Task-Level Concurrency on C Programs Applied to the Design of Multiprocessor SoC

Study of Spatial Configurations of Equipment for Online Sign Interpretation Service

On the Use of Structures for Spoken Language Understanding: A Two-Step Approach

Query Language for Location-Based Services: A Model Checking Approach

Automatic Language Identification with Discriminative Language Characterization Based on SVM

Language Modeling Using PLSA-Based Topic HMM

Learning of Finite Unions of Tree Patterns with Internal Structured Variables from Queries

A Machine Learning Approach for an Indonesian-English Cross Language Question Answering System

Statistical-Based Approach to Non-segmented Language Processing

On the Generative Powers of Some Extensions of Minimal Linear Grammars

Application of the CKY Algorithm to Recognition of Tree Structures for Linear, Monadic Context-Free Tree Grammars

Incremental Language Modeling for Automatic Transcription of Broadcast News

Analogical Conception of Chomsky Normal Form and Greibach Normal Form for Linear, Monadic Context-Free Tree Grammars

Formal Design of Arithmetic Circuits Based on Arithmetic Description Language

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles