1-9hit |
Semih YUMUSAK Erdogan DOGDU Halife KODAZ
Linked data sets are created using semantic Web technologies and they are usually big and the number of such datasets is growing. The query execution is therefore costly, and knowing the content of data in such datasets should help in targeted querying. Our aim in this paper is to classify linked data sets by their knowledge content. Earlier projects such as LOD Cloud, LODStats, and SPARQLES analyze linked data sources in terms of content, availability and infrastructure. In these projects, linked data sets are classified and tagged principally using VoID vocabulary and analyzed according to their content, availability and infrastructure. Although all linked data sources listed in these projects appear to be classified or tagged, there are a limited number of studies on automated tagging and classification of newly arriving linked data sets. Here, we focus on automated classification of linked data sets using semantic scoring methods. We have collected the SPARQL endpoints of 1,328 unique linked datasets from Datahub, LOD Cloud, LODStats, SPARQLES, and SpEnD projects. We have then queried textual descriptions of resources in these data sets using their rdfs:comment and rdfs:label property values. We analyzed these texts in a similar manner with document analysis techniques by assuming every SPARQL endpoint as a separate document. In this regard, we have used WordNet semantic relations library combined with an adapted term frequency-inverted document frequency (tfidf) analysis on the words and their semantic neighbours. In WordNet database, we have extracted information about comment/label objects in linked data sources by using hypernym, hyponym, homonym, meronym, region, topic and usage semantic relations. We obtained some significant results on hypernym and topic semantic relations; we can find words that identify data sets and this can be used in automatic classification and tagging of linked data sources. By using these words, we experimented different classifiers with different scoring methods, which results in better classification accuracy results.
Natthawut KERTKEIDKACHORN Ryutaro ICHISE
Knowledge graphs (KG) play a crucial role in many modern applications. However, constructing a KG from natural language text is challenging due to the complex structure of the text. Recently, many approaches have been proposed to transform natural language text to triples to obtain KGs. Such approaches have not yet provided efficient results for mapping extracted elements of triples, especially the predicate, to their equivalent elements in a KG. Predicate mapping is essential because it can reduce the heterogeneity of the data and increase the searchability over a KG. In this article, we propose T2KG, an automatic KG creation framework for natural language text, to more effectively map natural language text to predicates. In our framework, a hybrid combination of a rule-based approach and a similarity-based approach is presented for mapping a predicate to its corresponding predicate in a KG. Based on experimental results, the hybrid approach can identify more similar predicate pairs than a baseline method in the predicate mapping task. An experiment on KG creation is also conducted to investigate the performance of the T2KG. The experimental results show that the T2KG also outperforms the baseline in KG creation. Although KG creation is conducted in open domains, in which prior knowledge is not provided, the T2KG still achieves an F1 score of approximately 50% when generating triples in the KG creation task. In addition, an empirical study on knowledge population using various text sources is conducted, and the results indicate the T2KG could be used to obtain knowledge that is not currently available from DBpedia.
Semih YUMUSAK Erdogan DOGDU Halife KODAZ Andreas KAMILARIS Pierre-Yves VANDENBUSSCHE
Linked data endpoints are online query gateways to semantically annotated linked data sources. In order to query these data sources, SPARQL query language is used as a standard. Although a linked data endpoint (i.e. SPARQL endpoint) is a basic Web service, it provides a platform for federated online querying and data linking methods. For linked data consumers, SPARQL endpoint availability and discovery are crucial for live querying and semantic information retrieval. Current studies show that availability of linked datasets is very low, while the locations of linked data endpoints change frequently. There are linked data respsitories that collect and list the available linked data endpoints or resources. It is observed that around half of the endpoints listed in existing repositories are not accessible (temporarily or permanently offline). These endpoint URLs are shared through repository websites, such as Datahub.io, however, they are weakly maintained and revised only by their publishers. In this study, a novel metacrawling method is proposed for discovering and monitoring linked data sources on the Web. We implemented the method in a prototype system, named SPARQL Endpoints Discovery (SpEnD). SpEnD starts with a “search keyword” discovery process for finding relevant keywords for the linked data domain and specifically SPARQL endpoints. Then, the collected search keywords are utilized to find linked data sources via popular search engines (Google, Bing, Yahoo, Yandex). By using this method, most of the currently listed SPARQL endpoints in existing endpoint repositories, as well as a significant number of new SPARQL endpoints, have been discovered. We analyze our findings in comparison to Datahub collection in detail.
Natthawut KERTKEIDKACHORN Ryutaro ICHISE
Mapping instances to the Linked Open Data (LOD) cloud plays an important role for enriching information of instances, since the LOD cloud contains abundant amounts of interlinked instances describing the instances. Consequently, many techniques have been introduced for mapping instances to a LOD data set; however, most of them merely focus on tackling with the problem of heterogeneity. Unfortunately, the problem of the large number of LOD data sets has yet to be addressed. Owing to the number of LOD data sets, mapping an instance to a LOD data set is not sufficient because an identical instance might not exist in that data set. In this article, we therefore introduce a heuristic expansion based framework for mapping instances to LOD data sets. The key idea of the framework is to gradually expand the search space from one data set to another data set in order to discover identical instances. In experiments, the framework could successfully map instances to the LOD data sets by increasing the coverage to 90.36%. Experimental results also indicate that the heuristic function in the framework could efficiently limit the expansion space to a reasonable space. Based upon the limited expansion space, the framework could effectively reduce the number of candidate pairs to 9.73% of the baseline without affecting any performances.
Linked data entity resolution is the detection of instances that reside in different repositories but co-describe the same topic. The quality of the resolution result depends on the appropriateness of the configuration, including the selected matching properties and the similarity measures. Because such configuration details are currently set differently across domains and repositories, a general resolution approach for every repository is necessary. In this paper, we present cLink, a system that can perform entity resolution on any input effectively by using a learning algorithm to find the optimal configuration. Experiments show that cLink achieves high performance even when being given only a small amount of training data. cLink also outperforms recent systems, including the ones that use the supervised learning approach.
Khalid MAHMOOD Asif RAZA Madan KRISHNAMURTHY Hironao TAKAHASHI
The growing trends in Internet usage for data and knowledge sharing calls for dynamic classification of web contents, particularly at the edges of the Internet. Rather than considering Linked Data as an integral part of Big Data, we propose Autonomous Decentralized Semantic-based Content Classifier (ADSCC) for dynamic classification of unstructured web contents, using Linked Data and web metadata in Content Delivery Network (CDN). The proposed framework ensures efficient categorization of URLs (even overlapping categories) by dynamically mapping the changing user-defined categories to ontologies' category/classes. This dynamic classification is performed by the proposed system that mainly involves three main algorithms/modules: Dynamic Mapping algorithm, Autonomous coordination-based Inference algorithm, and Context-based disambiguation. Evaluation results show that the proposed system achieves (on average), the precision, recall and F-measure within the 93-97% range.
In recent years, there has been a significant growth in the importance of data mining of graph-structured data due to this technology's rapid increase in both scale and application areas. Many previous studies have investigated decision tree learning on Semantic Web-based linked data to uncover implicit knowledge. In the present paper, we suggest a new random forest algorithm for linked data to overcome the underlying limitations of the decision tree algorithm, such as local optimal decisions and generalization error. Moreover, we designed a parallel processing environment for random forest learning to manage large-size linked data and increase the efficiency of multiple tree generation. For this purpose, we modified the previous candidate feature searching method of the decision tree algorithm for linked data to reduce the feature searching space of random forest learning and developed feature selection methods that are adjusted to linked data. Using a distributed index-based search engine, we designed a parallel random forest learning system for linked data to generate random forests in parallel. Our proposed system enables users to simultaneously generate multiple decision trees from distributed stored linked data. To evaluate the performance of the proposed algorithm, we performed experiments to compare the classification accuracy when using the single decision tree algorithm. The experimental results revealed that our random forest algorithm is more accurate than the single decision tree algorithm.
Md-Mizanur RAHOMAN Ryutaro ICHISE
Keyword-based linked data information retrieval is an easy choice for general-purpose users, but the implementation of such an approach is a challenge because mere keywords do not hold semantic information. Some studies have incorporated templates in an effort to bridge this gap, but most such approaches have proven ineffective because of inefficient template management. Because linked data can be presented in a structured format, we can assume that the data's internal statistics can be used to effectively influence template management. In this work, we explore the use of this influence for template creation, ranking, and scaling. Then, we demonstrate how our proposal for automatic linked data information retrieval can be used alongside familiar keyword-based information retrieval methods, and can also be incorporated alongside other techniques, such as ontology inclusion and sophisticated matching, in order to achieve increased levels of performance.
The Linking Open Data (LOD) cloud is a collection of linked Resource Description Framework (RDF) data with over 31 billion RDF triples. Accessing linked data is a challenging task because each data set in the LOD cloud has a specific ontology schema, and familiarity with the ontology schema used is required in order to query various linked data sets. However, manually checking each data set is time-consuming, especially when many data sets from various domains are used. This difficulty can be overcome without user interaction by using an automatic method that integrates different ontology schema. In this paper, we propose a Mid-Ontology learning approach that can automatically construct a simple ontology, linking related ontology predicates (class or property) in different data sets. Our Mid-Ontology learning approach consists of three main phases: data collection, predicate grouping, and Mid-Ontology construction. Experiments show that our Mid-Ontology learning approach successfully integrates diverse ontology schema with a high quality, and effectively retrieves related information with the constructed Mid-Ontology.