Grid-Based Parallel Algorithms of Join Queries for Analyzing Multi-Dimensional Data on MapReduce

Miyoung JANG; Jae-Woo CHANG

doi:10.1587/transinf.2016IIP0010

Grid-Based Parallel Algorithms of Join Queries for Analyzing Multi-Dimensional Data on MapReduce

Miyoung JANG, Jae-Woo CHANG

Full Text Views

0

Cite this

Summary :

Recently, the join processing of large-scale datasets in MapReduce environments has become an important issue. However, the existing MapReduce-based join algorithms suffer from too much overhead for constructing and updating the data index. Moreover, the similarity computation cost is high because the existing algorithms partition data without considering the data distribution. In this paper, we propose two grid-based join algorithms for MapReduce. First, we propose a similarity join algorithm that evenly distributes join candidates using a dynamic grid index, which partitions data considering data density and similarity threshold. We use a bottom-up approach by merging initial grid cells into partitions and assigning them to MapReduce jobs. Second, we propose a k-NN join query processing algorithm for MapReduce. To reduce the data transmission cost, we determine an optimal grid cell size by considering the data distribution of randomly selected samples. Then, we perform kNN join by assigning the only related join data to a reducer. From performance analysis, we show that our similarity join query processing algorithm and our k-NN join algorithm outperform existing algorithms by up to 10 times, in terms of query processing time.

Publication: IEICE TRANSACTIONS on Information Vol.E101-D No.4 pp.964-976

Publication Date: 2018/04/01

Publicized: 2018/01/19

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2016IIP0010

Type of Manuscript: Special Section PAPER (Special Section on Intelligent Information and Communication Technology and its Applications to Creative Activity Support)

Category: Technologies for Knowledge Support Platform

Authors

Miyoung JANG
Electronics and Telecommunications Research Institute (ETRI)
Jae-Woo CHANG
Chonbuk National Univ.

Keyword

MapReduce based join query processing, similarity join algorithm, k-NN join algorithm, grid partitioning method

Cite this

Copy

Miyoung JANG, Jae-Woo CHANG, "Grid-Based Parallel Algorithms of Join Queries for Analyzing Multi-Dimensional Data on MapReduce" in IEICE TRANSACTIONS on Information, vol. E101-D, no. 4, pp. 964-976, April 2018, doi: 10.1587/transinf.2016IIP0010.
Abstract: Recently, the join processing of large-scale datasets in MapReduce environments has become an important issue. However, the existing MapReduce-based join algorithms suffer from too much overhead for constructing and updating the data index. Moreover, the similarity computation cost is high because the existing algorithms partition data without considering the data distribution. In this paper, we propose two grid-based join algorithms for MapReduce. First, we propose a similarity join algorithm that evenly distributes join candidates using a dynamic grid index, which partitions data considering data density and similarity threshold. We use a bottom-up approach by merging initial grid cells into partitions and assigning them to MapReduce jobs. Second, we propose a k-NN join query processing algorithm for MapReduce. To reduce the data transmission cost, we determine an optimal grid cell size by considering the data distribution of randomly selected samples. Then, we perform kNN join by assigning the only related join data to a reducer. From performance analysis, we show that our similarity join query processing algorithm and our k-NN join algorithm outperform existing algorithms by up to 10 times, in terms of query processing time.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016IIP0010/_p

Copy

@ARTICLE{e101-d_4_964,
author={Miyoung JANG, Jae-Woo CHANG, },
journal={IEICE TRANSACTIONS on Information},
title={Grid-Based Parallel Algorithms of Join Queries for Analyzing Multi-Dimensional Data on MapReduce},
year={2018},
volume={E101-D},
number={4},
pages={964-976},
abstract={Recently, the join processing of large-scale datasets in MapReduce environments has become an important issue. However, the existing MapReduce-based join algorithms suffer from too much overhead for constructing and updating the data index. Moreover, the similarity computation cost is high because the existing algorithms partition data without considering the data distribution. In this paper, we propose two grid-based join algorithms for MapReduce. First, we propose a similarity join algorithm that evenly distributes join candidates using a dynamic grid index, which partitions data considering data density and similarity threshold. We use a bottom-up approach by merging initial grid cells into partitions and assigning them to MapReduce jobs. Second, we propose a k-NN join query processing algorithm for MapReduce. To reduce the data transmission cost, we determine an optimal grid cell size by considering the data distribution of randomly selected samples. Then, we perform kNN join by assigning the only related join data to a reducer. From performance analysis, we show that our similarity join query processing algorithm and our k-NN join algorithm outperform existing algorithms by up to 10 times, in terms of query processing time.},
keywords={},
doi={10.1587/transinf.2016IIP0010},
ISSN={1745-1361},
month={April},}

Copy

TY - JOUR
TI - Grid-Based Parallel Algorithms of Join Queries for Analyzing Multi-Dimensional Data on MapReduce
T2 - IEICE TRANSACTIONS on Information
SP - 964
EP - 976
AU - Miyoung JANG
AU - Jae-Woo CHANG
PY - 2018
DO - 10.1587/transinf.2016IIP0010
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E101-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2018
AB - Recently, the join processing of large-scale datasets in MapReduce environments has become an important issue. However, the existing MapReduce-based join algorithms suffer from too much overhead for constructing and updating the data index. Moreover, the similarity computation cost is high because the existing algorithms partition data without considering the data distribution. In this paper, we propose two grid-based join algorithms for MapReduce. First, we propose a similarity join algorithm that evenly distributes join candidates using a dynamic grid index, which partitions data considering data density and similarity threshold. We use a bottom-up approach by merging initial grid cells into partitions and assigning them to MapReduce jobs. Second, we propose a k-NN join query processing algorithm for MapReduce. To reduce the data transmission cost, we determine an optimal grid cell size by considering the data distribution of randomly selected samples. Then, we perform kNN join by assigning the only related join data to a reducer. From performance analysis, we show that our similarity join query processing algorithm and our k-NN join algorithm outperform existing algorithms by up to 10 times, in terms of query processing time.
ER -