Naive Bayes Classifier Based Partitioner for MapReduce

Lei CHEN; Wei LU; Ergude BAO; Liqiang WANG; Weiwei XING; Yuanyuan CAI

doi:10.1587/transfun.E101.A.778

IEICE TRANSACTIONS on Fundamentals

Naive Bayes Classifier Based Partitioner for MapReduce

Lei CHEN, Wei LU, Ergude BAO, Liqiang WANG, Weiwei XING, Yuanyuan CAI

Full Text Views

0

Cite this

Summary :

MapReduce is an effective framework for processing large datasets in parallel over a cluster. Data locality and data skew on the reduce side are two essential issues in MapReduce. Improving data locality can decrease network traffic by moving reduce tasks to the nodes where the reducer input data is located. Data skew will lead to load imbalance among reducer nodes. Partitioning is an important feature of MapReduce because it determines the reducer nodes to which map output results will be sent. Therefore, an effective partitioner can improve MapReduce performance by increasing data locality and decreasing data skew on the reduce side. Previous studies considering both essential issues can be divided into two categories: those that preferentially improve data locality, such as LEEN, and those that preferentially improve load balance, such as CLP. However, all these studies ignore the fact that for different types of jobs, the priority of data locality and data skew on the reduce side may produce different effects on the execution time. In this paper, we propose a naive Bayes classifier based partitioner, namely, BAPM, which achieves better performance because it can automatically choose the proper algorithm (LEEN or CLP) by leveraging the naive Bayes classifier, i.e., considering job type and bandwidth as classification attributes. Our experiments are performed in a Hadoop cluster, and the results show that BAPM boosts the computing performance of MapReduce. The selection accuracy reaches 95.15%. Further, compared with other popular algorithms, under specific bandwidths, the improvement BAPM achieved is up to 31.31%.

Publication: IEICE TRANSACTIONS on Fundamentals Vol.E101-A No.5 pp.778-786

Publication Date: 2018/05/01

Publicized

Online ISSN: 1745-1337

DOI: 10.1587/transfun.E101.A.778

Type of Manuscript: PAPER

Category: Graphs and Networks

Authors

Lei CHEN
  Beijing Jiaotong University
Wei LU
  Beijing Jiaotong University
Ergude BAO
  Beijing Jiaotong University
Liqiang WANG
  University of Central Florida
Weiwei XING
  Beijing Jiaotong University
Yuanyuan CAI
  Beijing Technology and Business University

Keyword

MapReduce, hadoop, data locality, data skew, naive Bayes, bandwidth, job type

Cite this

Copy

Lei CHEN, Wei LU, Ergude BAO, Liqiang WANG, Weiwei XING, Yuanyuan CAI, "Naive Bayes Classifier Based Partitioner for MapReduce" in IEICE TRANSACTIONS on Fundamentals, vol. E101-A, no. 5, pp. 778-786, May 2018, doi: 10.1587/transfun.E101.A.778.
Abstract: MapReduce is an effective framework for processing large datasets in parallel over a cluster. Data locality and data skew on the reduce side are two essential issues in MapReduce. Improving data locality can decrease network traffic by moving reduce tasks to the nodes where the reducer input data is located. Data skew will lead to load imbalance among reducer nodes. Partitioning is an important feature of MapReduce because it determines the reducer nodes to which map output results will be sent. Therefore, an effective partitioner can improve MapReduce performance by increasing data locality and decreasing data skew on the reduce side. Previous studies considering both essential issues can be divided into two categories: those that preferentially improve data locality, such as LEEN, and those that preferentially improve load balance, such as CLP. However, all these studies ignore the fact that for different types of jobs, the priority of data locality and data skew on the reduce side may produce different effects on the execution time. In this paper, we propose a naive Bayes classifier based partitioner, namely, BAPM, which achieves better performance because it can automatically choose the proper algorithm (LEEN or CLP) by leveraging the naive Bayes classifier, i.e., considering job type and bandwidth as classification attributes. Our experiments are performed in a Hadoop cluster, and the results show that BAPM boosts the computing performance of MapReduce. The selection accuracy reaches 95.15%. Further, compared with other popular algorithms, under specific bandwidths, the improvement BAPM achieved is up to 31.31%.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/transfun.E101.A.778/_p

Copy

@ARTICLE{e101-a_5_778,
author={Lei CHEN, Wei LU, Ergude BAO, Liqiang WANG, Weiwei XING, Yuanyuan CAI, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Naive Bayes Classifier Based Partitioner for MapReduce},
year={2018},
volume={E101-A},
number={5},
pages={778-786},
abstract={MapReduce is an effective framework for processing large datasets in parallel over a cluster. Data locality and data skew on the reduce side are two essential issues in MapReduce. Improving data locality can decrease network traffic by moving reduce tasks to the nodes where the reducer input data is located. Data skew will lead to load imbalance among reducer nodes. Partitioning is an important feature of MapReduce because it determines the reducer nodes to which map output results will be sent. Therefore, an effective partitioner can improve MapReduce performance by increasing data locality and decreasing data skew on the reduce side. Previous studies considering both essential issues can be divided into two categories: those that preferentially improve data locality, such as LEEN, and those that preferentially improve load balance, such as CLP. However, all these studies ignore the fact that for different types of jobs, the priority of data locality and data skew on the reduce side may produce different effects on the execution time. In this paper, we propose a naive Bayes classifier based partitioner, namely, BAPM, which achieves better performance because it can automatically choose the proper algorithm (LEEN or CLP) by leveraging the naive Bayes classifier, i.e., considering job type and bandwidth as classification attributes. Our experiments are performed in a Hadoop cluster, and the results show that BAPM boosts the computing performance of MapReduce. The selection accuracy reaches 95.15%. Further, compared with other popular algorithms, under specific bandwidths, the improvement BAPM achieved is up to 31.31%.},
keywords={},
doi={10.1587/transfun.E101.A.778},
ISSN={1745-1337},
month={May},}

Copy

TY - JOUR
TI - Naive Bayes Classifier Based Partitioner for MapReduce
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 778
EP - 786
AU - Lei CHEN
AU - Wei LU
AU - Ergude BAO
AU - Liqiang WANG
AU - Weiwei XING
AU - Yuanyuan CAI
PY - 2018
DO - 10.1587/transfun.E101.A.778
JO - IEICE TRANSACTIONS on Fundamentals
SN - 1745-1337
VL - E101-A
IS - 5
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - May 2018
AB - MapReduce is an effective framework for processing large datasets in parallel over a cluster. Data locality and data skew on the reduce side are two essential issues in MapReduce. Improving data locality can decrease network traffic by moving reduce tasks to the nodes where the reducer input data is located. Data skew will lead to load imbalance among reducer nodes. Partitioning is an important feature of MapReduce because it determines the reducer nodes to which map output results will be sent. Therefore, an effective partitioner can improve MapReduce performance by increasing data locality and decreasing data skew on the reduce side. Previous studies considering both essential issues can be divided into two categories: those that preferentially improve data locality, such as LEEN, and those that preferentially improve load balance, such as CLP. However, all these studies ignore the fact that for different types of jobs, the priority of data locality and data skew on the reduce side may produce different effects on the execution time. In this paper, we propose a naive Bayes classifier based partitioner, namely, BAPM, which achieves better performance because it can automatically choose the proper algorithm (LEEN or CLP) by leveraging the naive Bayes classifier, i.e., considering job type and bandwidth as classification attributes. Our experiments are performed in a Hadoop cluster, and the results show that BAPM boosts the computing performance of MapReduce. The selection accuracy reaches 95.15%. Further, compared with other popular algorithms, under specific bandwidths, the improvement BAPM achieved is up to 31.31%.
ER -

IEICE TRANSACTIONS on Fundamentals