SpEnD: Linked Data SPARQL Endpoints Discovery Using Search Engines

Semih YUMUSAK; Erdogan DOGDU; Halife KODAZ; Andreas KAMILARIS; Pierre-Yves VANDENBUSSCHE

doi:10.1587/transinf.2016DAP0025

SpEnD: Linked Data SPARQL Endpoints Discovery Using Search Engines

Semih YUMUSAK, Erdogan DOGDU, Halife KODAZ, Andreas KAMILARIS, Pierre-Yves VANDENBUSSCHE

Full Text Views

0

Cite this

Summary :

Linked data endpoints are online query gateways to semantically annotated linked data sources. In order to query these data sources, SPARQL query language is used as a standard. Although a linked data endpoint (i.e. SPARQL endpoint) is a basic Web service, it provides a platform for federated online querying and data linking methods. For linked data consumers, SPARQL endpoint availability and discovery are crucial for live querying and semantic information retrieval. Current studies show that availability of linked datasets is very low, while the locations of linked data endpoints change frequently. There are linked data respsitories that collect and list the available linked data endpoints or resources. It is observed that around half of the endpoints listed in existing repositories are not accessible (temporarily or permanently offline). These endpoint URLs are shared through repository websites, such as Datahub.io, however, they are weakly maintained and revised only by their publishers. In this study, a novel metacrawling method is proposed for discovering and monitoring linked data sources on the Web. We implemented the method in a prototype system, named SPARQL Endpoints Discovery (SpEnD). SpEnD starts with a “search keyword” discovery process for finding relevant keywords for the linked data domain and specifically SPARQL endpoints. Then, the collected search keywords are utilized to find linked data sources via popular search engines (Google, Bing, Yahoo, Yandex). By using this method, most of the currently listed SPARQL endpoints in existing endpoint repositories, as well as a significant number of new SPARQL endpoints, have been discovered. We analyze our findings in comparison to Datahub collection in detail.

Publication: IEICE TRANSACTIONS on Information Vol.E100-D No.4 pp.758-767

Publication Date: 2017/04/01

Publicized: 2017/01/17

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2016DAP0025

Type of Manuscript: Special Section PAPER (Special Section on Data Engineering and Information Management)

Category

Authors

Semih YUMUSAK
  KTO Karatay Univ.
Erdogan DOGDU
  Cankaya University
Halife KODAZ
  Selcuk University
Andreas KAMILARIS
  Insight Research Centre for Data Analytics
Pierre-Yves VANDENBUSSCHE
  Fujitsu Ireland Limited

Keyword

linked data, semantic Web, SPARQL endpoint, endpoint discovery, metasearch, knowledge graph

Cite this

Copy

Semih YUMUSAK, Erdogan DOGDU, Halife KODAZ, Andreas KAMILARIS, Pierre-Yves VANDENBUSSCHE, "SpEnD: Linked Data SPARQL Endpoints Discovery Using Search Engines" in IEICE TRANSACTIONS on Information, vol. E100-D, no. 4, pp. 758-767, April 2017, doi: 10.1587/transinf.2016DAP0025.
Abstract: Linked data endpoints are online query gateways to semantically annotated linked data sources. In order to query these data sources, SPARQL query language is used as a standard. Although a linked data endpoint (i.e. SPARQL endpoint) is a basic Web service, it provides a platform for federated online querying and data linking methods. For linked data consumers, SPARQL endpoint availability and discovery are crucial for live querying and semantic information retrieval. Current studies show that availability of linked datasets is very low, while the locations of linked data endpoints change frequently. There are linked data respsitories that collect and list the available linked data endpoints or resources. It is observed that around half of the endpoints listed in existing repositories are not accessible (temporarily or permanently offline). These endpoint URLs are shared through repository websites, such as Datahub.io, however, they are weakly maintained and revised only by their publishers. In this study, a novel metacrawling method is proposed for discovering and monitoring linked data sources on the Web. We implemented the method in a prototype system, named SPARQL Endpoints Discovery (SpEnD). SpEnD starts with a “search keyword” discovery process for finding relevant keywords for the linked data domain and specifically SPARQL endpoints. Then, the collected search keywords are utilized to find linked data sources via popular search engines (Google, Bing, Yahoo, Yandex). By using this method, most of the currently listed SPARQL endpoints in existing endpoint repositories, as well as a significant number of new SPARQL endpoints, have been discovered. We analyze our findings in comparison to Datahub collection in detail.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016DAP0025/_p

Copy

@ARTICLE{e100-d_4_758,
author={Semih YUMUSAK, Erdogan DOGDU, Halife KODAZ, Andreas KAMILARIS, Pierre-Yves VANDENBUSSCHE, },
journal={IEICE TRANSACTIONS on Information},
title={SpEnD: Linked Data SPARQL Endpoints Discovery Using Search Engines},
year={2017},
volume={E100-D},
number={4},
pages={758-767},
abstract={Linked data endpoints are online query gateways to semantically annotated linked data sources. In order to query these data sources, SPARQL query language is used as a standard. Although a linked data endpoint (i.e. SPARQL endpoint) is a basic Web service, it provides a platform for federated online querying and data linking methods. For linked data consumers, SPARQL endpoint availability and discovery are crucial for live querying and semantic information retrieval. Current studies show that availability of linked datasets is very low, while the locations of linked data endpoints change frequently. There are linked data respsitories that collect and list the available linked data endpoints or resources. It is observed that around half of the endpoints listed in existing repositories are not accessible (temporarily or permanently offline). These endpoint URLs are shared through repository websites, such as Datahub.io, however, they are weakly maintained and revised only by their publishers. In this study, a novel metacrawling method is proposed for discovering and monitoring linked data sources on the Web. We implemented the method in a prototype system, named SPARQL Endpoints Discovery (SpEnD). SpEnD starts with a “search keyword” discovery process for finding relevant keywords for the linked data domain and specifically SPARQL endpoints. Then, the collected search keywords are utilized to find linked data sources via popular search engines (Google, Bing, Yahoo, Yandex). By using this method, most of the currently listed SPARQL endpoints in existing endpoint repositories, as well as a significant number of new SPARQL endpoints, have been discovered. We analyze our findings in comparison to Datahub collection in detail.},
keywords={},
doi={10.1587/transinf.2016DAP0025},
ISSN={1745-1361},
month={April},}

Copy

TY - JOUR
TI - SpEnD: Linked Data SPARQL Endpoints Discovery Using Search Engines
T2 - IEICE TRANSACTIONS on Information
SP - 758
EP - 767
AU - Semih YUMUSAK
AU - Erdogan DOGDU
AU - Halife KODAZ
AU - Andreas KAMILARIS
AU - Pierre-Yves VANDENBUSSCHE
PY - 2017
DO - 10.1587/transinf.2016DAP0025
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E100-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2017
AB - Linked data endpoints are online query gateways to semantically annotated linked data sources. In order to query these data sources, SPARQL query language is used as a standard. Although a linked data endpoint (i.e. SPARQL endpoint) is a basic Web service, it provides a platform for federated online querying and data linking methods. For linked data consumers, SPARQL endpoint availability and discovery are crucial for live querying and semantic information retrieval. Current studies show that availability of linked datasets is very low, while the locations of linked data endpoints change frequently. There are linked data respsitories that collect and list the available linked data endpoints or resources. It is observed that around half of the endpoints listed in existing repositories are not accessible (temporarily or permanently offline). These endpoint URLs are shared through repository websites, such as Datahub.io, however, they are weakly maintained and revised only by their publishers. In this study, a novel metacrawling method is proposed for discovering and monitoring linked data sources on the Web. We implemented the method in a prototype system, named SPARQL Endpoints Discovery (SpEnD). SpEnD starts with a “search keyword” discovery process for finding relevant keywords for the linked data domain and specifically SPARQL endpoints. Then, the collected search keywords are utilized to find linked data sources via popular search engines (Google, Bing, Yahoo, Yandex). By using this method, most of the currently listed SPARQL endpoints in existing endpoint repositories, as well as a significant number of new SPARQL endpoints, have been discovered. We analyze our findings in comparison to Datahub collection in detail.
ER -