1-12hit |
Peerasak INTARAPAIBOON Thanaruk THEERAMUNKONG
Multi-slot information extraction, also known as frame extraction, is a task that identify several related entities simultaneously. Most researches on this task are concerned with applying IE patterns (rules) to extract related entities from unstructured documents. An important obstacle for the success in this task is unknowing where text portions containing interested information are. This problem is more complicated when involving languages with sentence boundary ambiguity, e.g. the Thai language. Applying IE rules to all reasonable text portions can degrade the effect of this obstacle, but it raises another problem that is incorrect (unwanted) extractions. This paper aims to present a method for removing these incorrect extractions. In the method, extractions are represented as intuitionistic fuzzy sets, and a similarity measure for IFSs is used to calculate distance between IFS of an unclassified extraction and that of each already-classified extraction. The concept of k nearest neighbor is adopted to design whether the unclassified extraction is correct or not. From the experiment on various domains, the proposed technique improves extraction precision while satisfactorily preserving recall.
Hyun-Joo KIM Jong-Hyun KIM Jung-Tai KIM Ik-Kyun KIM Tai-Myung CHUNG
The recent cyber-attacks utilize various malware as a means of attacks for the attacker's malicious purposes. They are aimed to steal confidential information or seize control over major facilities after infiltrating the network of a target organization. Attackers generally create new malware or many different types of malware by using an automatic malware creation tool which enables remote control over a target system easily and disturbs trace-back of these attacks. The paper proposes a generation method of malware behavior patterns as well as the detection techniques in order to detect the known and even unknown malware efficiently. The behavior patterns of malware are generated with Multiple Sequence Alignment (MSA) of API call sequences of malware. Consequently, we defined these behavior patterns as a “feature-chain” of malware for the analytical purpose. The initial generation of the feature-chain consists of extracting API call sequences with API hooking library, classifying malware samples by the similar behavior, and making the representative sequences from the MSA results. The detection mechanism of numerous malware is performed by measuring similarity between API call sequence of a target process (suspicious executables) and feature-chain of malware. By comparing with other existing methods, we proved the effectiveness of our proposed method based on Longest Common Subsequence (LCS) algorithm. Also we evaluated that our method outperforms other antivirus systems with 2.55 times in detection rate and 1.33 times in accuracy rate for malware detection.
In our previous work, we proposed to combine ConceptNet and WordNet for Word Sense Disambiguation (WSD). The ConceptNet was automatically disambiguated through Normalized Google Distance (NGD) similarity. In this letter, we present several techniques to enhance the performance of the ConceptNet disambiguation and use this enriched semantic knowledge in WSD task. We propose to enrich both the WordNet semantic knowledge and NGD to disambiguate the concepts in ConceptNet. Furthermore, we apply the enriched semantic knowledge to improve the performance of WSD. From a number of experiments, the proposed method has been obtained enhanced results.
Ground truth based image segmentation evaluation paradigm plays an important role in objective evaluation of segmentation algorithms. So far, many evaluation methods in terms of comparing clusterings in machine learning field have been developed. However, most traditional pairwise similarity measures, which only compare a machine generated clustering to a “true” clustering, have their limitations in some cases, e.g. when multiple ground truths are available for the same image. In this letter, we propose utilizing an information theoretic measure, named NJMI (Normalized Joint Mutual Information), to handle the situations which the pairwise measures can not deal with. We illustrate the effectiveness of NJMI for both unsupervised and supervised segmentation evaluation.
Raul Ernesto MENENDEZ-MORA Ryutaro ICHISE
An ability to assess similarity lies close to the core of cognition. Its understanding support the comprehension of human success in tasks like problem solving, categorization, memory retrieval, inductive reasoning, etc, and this is the main reason that it is a common research topic. In this paper, we introduce the idea of semantic differences and commonalities between words to the similarity computation process. Five new semantic similarity metrics are obtained after applying this scheme to traditional WordNet-based measures. We also combine the node based similarity measures with a corpus-independent way of computing the information content. In an experimental evaluation of our approach on two standard word pairs datasets, four of the measures outperformed their classical version, while the other performed as well as their unmodified counterparts.
Yoo Rhee OH Yong Guk KIM Mina KIM Hong Kook KIM Mi Suk LEE Hyun Joo BAE
In this paper, we propose a text corpus design method for a Korean stereo super-wideband speech database. Since a small-sized text corpus for speech coding is generally required for speech coding, the corpus should be designed to comply with the pronunciation behavior of natural conversation in order to ensure efficient speech quality tests. To this end, the proposed design method utilizes a similarity measure between the phoneme distribution occurring from natural conversation and that from the designed text corpus. In order to achieve this goal, we first collect and refine text data from textbooks and websites. Next, a corpus is designed from the refined text data based on the similarity measure to compare phoneme distributions. We then construct a Korean stereo super-wideband speech (K-SW) database using the designed text corpus, where the recording environment is set to meet the conditions defined by ITU-T. Finally, the subjective quality of the K-SW database is evaluated using an ITU-T super-wideband codec in order to demonstrate that the K-SW database is useful for developing and evaluating super-wideband codecs.
Sang-Hyuk LEE Keun Ho RYU Gyoyong SOHN
In this study, we investigated the relationship between similarity measures and entropy for fuzzy sets. First, we developed fuzzy entropy by using the distance measure for fuzzy sets. We pointed out that the distance between the fuzzy set and the corresponding crisp set equals fuzzy entropy. We also found that the sum of the similarity measure and the entropy between the fuzzy set and the corresponding crisp set constitutes the total information in the fuzzy set. Finally, we derived a similarity measure from entropy and showed by a simple example that the maximum similarity measure can be obtained using a minimum entropy formulation.
The deployment of historical trajectories of moving objects has greatly increased for various applications in road networks. For instance, similar patterns of moving-object trajectories are very useful for designing the transportation network of a new city. In this paper, we define a spatio-temporal similarity measure based on a road network distance, rather than a Euclidean distance. We also propose a new similar trajectory search algorithm based on the spatio-temporal measure by using an efficient pruning mechanism. Finally, we show the efficiency of our algorithm, both in terms of retrieval accuracy and retrieval efficiency.
Bo-Yeong KANG Dae-Won KIM Qing LI
A great deal of research has been made to model the vagueness and uncertainty in information retrieval. One such research is fuzzy ranking models, which have been showing their superior performance in handling the uncertainty involved in the retrieval process. However, these conventional fuzzy ranking models have a limited ability to incorporate the user preference when calculating the rank of documents. To address this issue, in this study we develop a new fuzzy ranking model based on the user preference. Through the experiments on the TREC-2 collection of Wall Street Journal documents, we show that the proposed method outperforms the conventional fuzzy ranking models.
Juan D. VELASQUEZ Hiroshi YASUDA Terumasa AOKI Richard WEBER
The behavior of visitors browsing in a web site offers a lot of information about their requirements and the way they use the respective site. Analyzing such behavior can provide the necessary information in order to improve the web site's structure. The literature contains already several suggestions on how to characterize web site usage and to identify the respective visitor requirements based on clustering of visitor sessions. Here we propose to combine visitor behavior with the content of the respective web pages and the similarity between different page sequences in order to define a similarity measure between different visits. This similarity serves as input for clustering of visitor sessions. The application of our approach to a bank's web site and its visitor sessions shows its potential for internet-based businesses.
Cheng-Jian LIN Cheng-Hung CHEN
In this paper, a Compensatory Neuro-Fuzzy Network (CNFN) for nonlinear system control is proposed. The compensatory fuzzy reasoning method is using adaptive fuzzy operations of neural fuzzy network that can make the fuzzy logic system more adaptive and effective. An on-line learning algorithm is proposed to automatically construct the CNFN. They are created and adapted as on-line learning proceeds via simultaneous structure and parameter learning. The structure learning is based on the fuzzy similarity measure and the parameter learning is based on backpropagation algorithm. The advantages of the proposed learning algorithm are that it converges quickly and the obtained fuzzy rules are more precise. The performance of CNFN compares excellently with other various exiting model.
Most conventional methods used in character recognition extract geometrical features, such as stroke direction and connectivity, and compare them with reference patterns in a stored dictionary. Unfortunately, geometrical features are easily degraded by blurs and stains, and by the graphical designs such as used in Japanese newspaper headlines. This noise must be removed before recognition commences, but no preprocessing method is perfectly accurate. This paper proposes a method for recognizing degraded characters as well as characters printed on graphical designs. This method extracts features from binary images, and a new similarity measure, the complementary similarity measure, is used as a discriminant function; it compares the similarity and dissimilarity of binary patterns with reference dictionary patterns. Experiments are conducted using the standard character database ETL-2, which consists of machine-printed Kanji, Hiragana, Katakana, alphanumeric, and special characters. The results show that our method is much more robust against noise than the conventional geometrical-feature method. It also achieves high recognition rates of over 97% for characters with textured foregrounds, over 99% for characters with textured backgrounds, over 98% for outline fonts and over 99% for reverse contrast characters. The experiments for recognizing both the fontstyles and character category show that it also achieves high recognition rates against noise.