Omid DEHZANGI Bin MA Eng Siong CHNG Haizhou LI
This paper investigates a new method for fusion of scores generated by multiple classification sub-systems that help to further reduce the classification error rate in Spoken Language Recognition (SLR). In recent studies, a variety of effective classification algorithms have been developed for SLR. Hence, it has been a common practice in the National Institute of Standards and Technology (NIST) Language Recognition Evaluations (LREs) to fuse the results from several classification sub-systems to boost the performance of the SLR systems. In this work, we introduce a discriminative performance measure to optimize the performance of the fusion of 7 language classifiers developed as IIR's submission to the 2009 NIST LRE. We present an Error Corrective Fusion (ECF) method in which we iteratively learn the fusion weights to minimize error rate of the fusion system. Experiments conducted on the 2009 NIST LRE corpus demonstrate a significant improvement compared to individual sub-systems. Comparison study is also conducted to show the effectiveness of the ECF method.
Learning to rank refers to machine learning techniques for training the model in a ranking task. Learning to rank is useful for many applications in Information Retrieval, Natural Language Processing, and Data Mining. Intensive studies have been conducted on the problem and significant progress has been made [1],[2]. This short paper gives an introduction to learning to rank, and it specifically explains the fundamental problems, existing approaches, and future work of learning to rank. Several learning to rank methods using SVM techniques are described in details.
Kaoru FUJIOKA Hirofumi KATSUNO
This paper concerns cancel minimal linear grammars ([5]) that was introduced to generalize Geffert normal forms for phrase structure grammars. We consider the generative power of restricted cancel minimal linear grammars: the grammars have only one nonterminal symbol C except the start symbol S, and their productions consist of context-free type productions, the left-hand side of which is S and the right-hand side contains at most one occurrence of S, and a unique cancellation production Cm ε that replaces the string Cm by the empty string ε. We show that, for any given positive integer m, the class of languages generated by cancel minimal linear grammars with Cm ε, is properly included in the class of linear languages. Conversely, we show that for any linear language L, there exists some positive integer m such that a cancel minimal linear grammar with Cm ε generates L. We also show how the generative power of cancel minimal linear grammars with a unique cancellation production Cm ε vary according to changes of m and restrictions imposed on occurrences of terminal symbols in the right-hand side of productions.
The Krivine-style evaluation mechanism is well-known in the implementation of higher-order functions, allowing to avoid some useless closure building. There have been a few type systems that can verify the safety of the mechanism. The incorporation of the proposed ideas into an existing compiler, however, would require significant changes in the type system of the compiler due to the use of some dedicated form of types and typing rules in the proposals. This limitation motivates us to propose an alternative light-weight Krivine typing mechanism that does not need to extend any existing type system significantly. This paper shows how GADTs (Generalized algebraic data types) can be used for typing a ZINC machine following the Krivine-style evaluation mechanism. This idea is new as far as we know. Some existing typed compilers like GHC (Glasgow Haskell compiler) already support GADTs; they can benefit from the Krivine-style evaluation mechanism in the operational semantics with no particular extension in their type systems for the safety. We show the GHC type checker allows to prove mechanically that ZINC instructions are well-typed, which highlights the effectiveness of GADTs.
Prachya BOONKWAN Thepchai SUPNITHI
This paper presents a syntax-based framework for gap resolution in analytic languages. CCG, reputable for dealing with deletion under coordination, is extended with a memory mechanism similar to the slot-and-filler mechanism, resulting in a wider coverage of syntactic gaps patterns. Though our grammar formalism is more expressive than the canonical CCG, its generative power is bounded by Partially Linear Indexed Grammar. Despite the spurious ambiguity originated from the memory mechanism, we also show that its probabilistic parsing is feasible by using the dual decomposition algorithm.
Rodion MOISEEV Shinpei HAYASHI Motoshi SAEKI
Object Constraint Language (OCL) is frequently applied in software development for stipulating formal constraints on software models. Its platform-independent characteristic allows for wide usage during the design phase. However, application in platform-specific processes, such as coding, is less obvious because it requires usage of bespoke tools for that platform. In this paper we propose an approach to generate assertion code for OCL constraints for multiple platform specific languages, using a unified framework based on structural similarities of programming languages. We have succeeded in automating the process of assertion code generation for four different languages using our tool. To show effectiveness of our approach in terms of development effort, an experiment was carried out and summarised.
Michael PAUL Andrew FINCH Eiichiro SUMITA
This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches. The method can be applied to any language pair in which the source language is unsegmented and the target language segmentation is known. In the first step, an iterative bootstrap method is applied to learn multiple segmentation schemes that are consistent with the phrasal segmentations of an SMT system trained on the resegmented bitext. In the second step, multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating five Asian languages into English revealed that the proposed method of integrating multiple segmentation schemes outperforms SMT models trained on any of the learned word segmentations and performs comparably to available monolingually built segmentation tools.
Yan DENG Wei-Qiang ZHANG Yan-Min QIAN Jia LIU
One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.
Junbin ZHANG Yong QI Di HOU Ming LI
Context-aware applications are a key aspect of pervasive computing. The core issue of context-aware application development is how to make the application behave suitably according to the changing context without coupling such context dependencies in the program. Several programming paradigms and languages have been proposed to facilitate the development, but they are either lack of sufficient flexibility or somewhat complex for programming and deploying. A reference programming model is proposed in this paper to make up inadequacy of those approaches. In the model, virtual tables constructed by system and maintained by space manager connect knowledge of both developer and space manager while separating dependency between context and application logic from base program. Hierarchy and architecture of the model are presented, and implementation suggestions are also discussed. Validation and evaluation show that the programming model is lightweight and easy to be implemented and deployed. Moreover, the model brings better flexibility for developing context-aware applications.
Pablo Rosales TEJADA Jae-Yoon JUNG
Ubiquitous technologies such as sensor network and RFID have enabled companies to realize more rapid and agile manufacturing and service systems. In this paper, we addresses how the huge amount of real-time events coming from these devices can be filtered and integrated to business process such as manufacturing, logistics, and supply chain process. In particular, we focus on complex event processing of sensor and RFID events in order to integrate them to business rules in business activities. We also illustrate a ubiquitous event processing system, named ueFilter, which helps to filter and aggregate sensor event, to detect event patterns from sensors and RFID by means of event pattern languages (EPL), and trigger event-condition-action (ECA) in logistics processes.
Chinese new words and their part-of-speech (POS) are particularly problematic in Chinese natural language processing. With the fast development of internet and information technology, it is impossible to get a complete system dictionary for Chinese natural language processing, as new words out of the basic system dictionary are always being created. A latent semi-CRF model, which combines the strengths of LDCRF (Latent-Dynamic Conditional Random Field) and semi-CRF, is proposed to detect the new words together with their POS synchronously regardless of the types of the new words from the Chinese text without being pre-segmented. Unlike the original semi-CRF, the LDCRF is applied to generate the candidate entities for training and testing the latent semi-CRF, which accelerates the training speed and decreases the computation cost. The complexity of the latent semi-CRF could be further adjusted by tuning the number of hidden variables in LDCRF and the number of the candidate entities from the Nbest outputs of the LDCRF. A new-words-generating framework is proposed for model training and testing, under which the definitions and distributions of the new words conform to the ones existing in real text. Specific features called "Global Fragment Information" for new word detection and POS tagging are adopted in the model training and testing. The experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags. The proposed model is found to be performing competitively with the state-of-the-art models presented.
Chul-Joo KIM Jeong-Han YUN Seonggun KIM Kwang-Moo CHOE Taisook HAN
Esterel is an imperative synchronous language for control-dominant reactive systems. Regardless of imperative features of Esterel, combination of parallel execution and preemption makes it difficult to build control flow graphs (CFGs) of Esterel programs. Simple and convenient CFGs can help to analyze Esterel programs. However, previous researches are not suitable for flow analyses of imperative languages. In this work, we present a method to construct over-approximated CFGs for Pure Esterel. Generated CFGs expose invisible interferences among threads and show program structures explicitly so that they are useful for program analyses based on graph theory or control-/data- flows.
The manner of a person's eye movement conveys much about nonverbal information and emotional intent beyond speech. This paper describes work on expressing emotion through eye behaviors in virtual agents based on the parameters selected from the AU-Coded facial expression database and real-time eye movement data (pupil size, blink rate and saccade). A rule-based approach to generate primary (joyful, sad, angry, afraid, disgusted and surprise) and intermediate emotions (emotions that can be represented as the mixture of two primary emotions) utilized the MPEG4 FAPs (facial animation parameters) is introduced. Meanwhile, based on our research, a scripting tool, named EEMML (Emotional Eye Movement Markup Language) that enables authors to describe and generate emotional eye movement of virtual agents, is proposed.
Ryo NAGATA Jun-ichi KAKEGAWA Yukiko YABUTA
This paper proposes a topic-independent method for automatically scoring essay content. Unlike conventional topic-dependent methods, it predicts the human-assigned score of a given essay without training essays written to the same topic as the target essay. To achieve this, this paper introduces a new measure called MIDF that measures how important and relevant a word is in a given essay. The proposed method predicts the score relying on the distribution of MIDF. Surprisingly, experiments show that the proposed method achieves an accuracy of 0.848 and performs as well as or even better than conventional topic-dependent methods.
In this paper, we obtain some refinement of representation theorems for context-free languages by using Dyck languages, insertion systems, strictly locally testable languages, and morphisms. For instance, we improved the Chomsky-Schützenberger representation theorem and show that each context-free language L can be represented in the form L=h(D ∪ R), where D is a Dyck language, R is a strictly 3-testable language, and h is a morphism. A similar representation for context-free languages can be obtained, using insertion systems of weight (3,0) and strictly 4-testable languages.
Tomohisa SANO Shiho Hoshi NOBESAWA Hiroyuki OKAMOTO Hiroya SUSUKI Masaki MATSUBARA Hiroaki SAITO
Toponyms and other named entities are main issues in unknown word processing problem. Our purpose is to salvage unknown toponyms, not only for avoiding noises but also providing them information of area candidates to where they may belong. Most of previous toponym resolution methods were targeting disambiguation among area candidates, which is caused by the multiple existence of a toponym. These approaches were mostly based on gazetteers and contexts. When it comes to the documents which may contain toponyms worldwide, like newspaper articles, toponym resolution is not just an ambiguity resolution, but an area candidate selection from all the areas on Earth. Thus we propose an automatic toponym resolution method which enables to identify its area candidates based only on their surface statistics, in place of dictionary-lookup approaches. Our method combines two modules, area candidate reduction and area candidate examination which uses block-unit data, to obtain high accuracy without reducing recall rate. Our empirical result showed 85.54% precision rate, 91.92% recall rate and .89 F-measure value on average. This method is a flexible and robust approach for toponym resolution targeting unrestricted number of areas.
Michiko YASUKAWA Hui Tian LIM Hidetoshi YOKOO
In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.
Yoshihide KATO Shigeki MATSUBARA
This paper describes an incremental parser based on an adjoining operation. By using the operation, we can avoid the problem of infinite local ambiguity. This paper further proposes a restricted version of the adjoining operation, which preserves lexical dependencies of partial parse trees. Our experimental results showed that the restriction enhances the accuracy of the incremental parsing.
Sutee SUDPRASERT Asanee KAWTRAKUL Christian BOITET Vincent BERMENT
In this paper, we present a new dependency parsing method for languages which have very small annotated corpus and for which methods of segmentation and morphological analysis producing a unique (automatically disambiguated) result are very unreliable. Our method works on a morphosyntactic lattice factorizing all possible segmentation and part-of-speech tagging results. The quality of the input to syntactic analysis is hence much better than that of an unreliable unique sequence of lemmatized and tagged words. We propose an adaptation of Eisner's algorithm for finding the k-best dependency trees in a morphosyntactic lattice structure encoding multiple results of morphosyntactic analysis. Moreover, we present how to use Dependency Insertion Grammar in order to adjust the scores and filter out invalid trees, the use of language model to rescore the parse trees and the k-best extension of our parsing model. The highest parsing accuracy reported in this paper is 74.32% which represents a 6.31% improvement compared to the model taking the input from the unreliable morphosyntactic analysis tools.
Vera SHEINMAN Takenobu TOKUNAGA
In this study we introduce AdjScales, a method for scaling similar adjectives by their strength. It combines existing Web-based computational linguistic techniques in order to automatically differentiate between similar adjectives that describe the same property by strength. Though this kind of information is rarely present in most of the lexical resources and dictionaries, it may be useful for language learners that try to distinguish between similar words. Additionally, learners might gain from a simple visualization of these differences using unidimensional scales. The method is evaluated by comparison with annotation on a subset of adjectives from WordNet by four native English speakers. It is also compared against two non-native speakers of English. The collected annotation is an interesting resource in its own right. This work is a first step toward automatic differentiation of meaning between similar words for language learners. AdjScales can be useful for lexical resource enhancement.