Hadoop I/O Performance Improvement by File Layout Optimization

Eita FUJISHIMA; Kenji NAKASHIMA; Saneyasu YAMAGUCHI

doi:10.1587/transinf.2017EDP7114

Hadoop I/O Performance Improvement by File Layout Optimization

Eita FUJISHIMA, Kenji NAKASHIMA, Saneyasu YAMAGUCHI

Full Text Views

0

Cite this

Summary :

Hadoop is a popular open-source MapReduce implementation. In the cases of jobs, wherein huge scale of output files of all relevant Map tasks are transmitted into Reduce tasks, such as TeraSort, the Reduce tasks are the bottleneck tasks and are I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we propose methods for improving the performance of Reduce tasks of such jobs by considering the following two things. One is that these files are accessed sequentially on an HDD, and the other is that each zone in an HDD has different sequential I/O performance. The proposed methods control the location to store intermediate data by modifying block bitmap of filesystem, which manages utilization (free or used) of blocks in an HDD. In addition, we propose striping layout for applying these methods for virtualized environment using image files. We then present performance evaluation of the proposed method and demonstrate that our methods improve the Hadoop application performance.

Publication: IEICE TRANSACTIONS on Information Vol.E101-D No.2 pp.415-427

Publication Date: 2018/02/01

Publicized: 2017/11/22

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2017EDP7114

Type of Manuscript: PAPER

Category: Data Engineering, Web Information Systems

Authors

Eita FUJISHIMA
  Kogakuin University
Kenji NAKASHIMA
  Kogakuin University
Saneyasu YAMAGUCHI
  Kogakuin University

Keyword

Hadoop, big data, HDD, filesystem

Cite this

Copy

Eita FUJISHIMA, Kenji NAKASHIMA, Saneyasu YAMAGUCHI, "Hadoop I/O Performance Improvement by File Layout Optimization" in IEICE TRANSACTIONS on Information, vol. E101-D, no. 2, pp. 415-427, February 2018, doi: 10.1587/transinf.2017EDP7114.
Abstract: Hadoop is a popular open-source MapReduce implementation. In the cases of jobs, wherein huge scale of output files of all relevant Map tasks are transmitted into Reduce tasks, such as TeraSort, the Reduce tasks are the bottleneck tasks and are I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we propose methods for improving the performance of Reduce tasks of such jobs by considering the following two things. One is that these files are accessed sequentially on an HDD, and the other is that each zone in an HDD has different sequential I/O performance. The proposed methods control the location to store intermediate data by modifying block bitmap of filesystem, which manages utilization (free or used) of blocks in an HDD. In addition, we propose striping layout for applying these methods for virtualized environment using image files. We then present performance evaluation of the proposed method and demonstrate that our methods improve the Hadoop application performance.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2017EDP7114/_p

Copy

@ARTICLE{e101-d_2_415,
author={Eita FUJISHIMA, Kenji NAKASHIMA, Saneyasu YAMAGUCHI, },
journal={IEICE TRANSACTIONS on Information},
title={Hadoop I/O Performance Improvement by File Layout Optimization},
year={2018},
volume={E101-D},
number={2},
pages={415-427},
abstract={Hadoop is a popular open-source MapReduce implementation. In the cases of jobs, wherein huge scale of output files of all relevant Map tasks are transmitted into Reduce tasks, such as TeraSort, the Reduce tasks are the bottleneck tasks and are I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we propose methods for improving the performance of Reduce tasks of such jobs by considering the following two things. One is that these files are accessed sequentially on an HDD, and the other is that each zone in an HDD has different sequential I/O performance. The proposed methods control the location to store intermediate data by modifying block bitmap of filesystem, which manages utilization (free or used) of blocks in an HDD. In addition, we propose striping layout for applying these methods for virtualized environment using image files. We then present performance evaluation of the proposed method and demonstrate that our methods improve the Hadoop application performance.},
keywords={},
doi={10.1587/transinf.2017EDP7114},
ISSN={1745-1361},
month={February},}

Copy

TY - JOUR
TI - Hadoop I/O Performance Improvement by File Layout Optimization
T2 - IEICE TRANSACTIONS on Information
SP - 415
EP - 427
AU - Eita FUJISHIMA
AU - Kenji NAKASHIMA
AU - Saneyasu YAMAGUCHI
PY - 2018
DO - 10.1587/transinf.2017EDP7114
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E101-D
IS - 2
JA - IEICE TRANSACTIONS on Information
Y1 - February 2018
AB - Hadoop is a popular open-source MapReduce implementation. In the cases of jobs, wherein huge scale of output files of all relevant Map tasks are transmitted into Reduce tasks, such as TeraSort, the Reduce tasks are the bottleneck tasks and are I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we propose methods for improving the performance of Reduce tasks of such jobs by considering the following two things. One is that these files are accessed sequentially on an HDD, and the other is that each zone in an HDD has different sequential I/O performance. The proposed methods control the location to store intermediate data by modifying block bitmap of filesystem, which manages utilization (free or used) of blocks in an HDD. In addition, we propose striping layout for applying these methods for virtualized environment using image files. We then present performance evaluation of the proposed method and demonstrate that our methods improve the Hadoop application performance.
ER -