IEICE global.ieice.org Site

Keyword Search Result

[Keyword] hadoop(15hit)

1-15hit

Job-Aware File-Storage Optimization for Improved Hadoop I/O Performance
Makoto NAKAGAMI Jose A.B. FORTES Saneyasu YAMAGUCHI

PAPER-Software System

Pubricized:
2020/06/30
Vol:
E103-D No:10
Page(s):
2083-2093
Hadoop is a popular data-analytics platform based on Google's MapReduce programming model. Hard-disk drives (HDDs) are generally used in big-data analysis, and the effectiveness of the Hadoop platform can be optimized by enhancing its I/O performance. HDD performance varies depending on whether the data are stored in the inner or outer disk zones. This paper proposes a method that utilizes the knowledge of job characteristics to realize efficient data storage in HDDs, which in turn, helps improve Hadoop performance. Per the proposed method, job files that need to be frequently accessed are stored in outer disk tracks which are capable of facilitating sequential-access speeds that are higher than those provided by inner tracks. Thus, the proposed method stores temporary and permanent files in the outer and inner zones, respectively, thereby facilitating fast access to frequently required data. Results of performance evaluation demonstrate that the proposed method improves Hadoop performance by 15.4% when compared to normal cases when file placement is not used. Additionally, the proposed method outperforms a previously proposed placement approach by 11.1%.
An Efficient Block Assignment Policy in Hadoop Distributed File System for Multimedia Data Processing
Cheolgi KIM Daechul LEE Jaehyun LEE Jaehwan LEE

LETTER-Computer System

Pubricized:
2019/05/21
Vol:
E102-D No:8
Page(s):
1569-1571
Hadoop, a distributed processing framework for big-data, is now widely used for multimedia processing. However, when processing video data from a Hadoop distributed file system (HDFS), unnecessary network traffic is generated due to an inefficient HDFS block slice policy for picture frames in video files. We propose a new block replication policy to solve this problem and compare the newly proposed HDFS with the original HDFS via extensive experiments. The proposed HDFS reduces network traffic, and increases locality between processing cores and file locations.
Delay Distribution Based Remote Data Fetch Scheme for Hadoop Clusters in Public Cloud
Ravindra Sandaruwan RANAWEERA Eiji OKI Nattapong KITSUWAN

PAPER-Network

Pubricized:
2019/02/04
Vol:
E102-B No:8
Page(s):
1617-1625
Apache Hadoop and its ecosystem have become the de facto platform for processing large-scale data, or Big Data, because it hides the complexity of distributed computing, scheduling, and communication while providing fault-tolerance. Cloud-based environments are becoming a popular platform for hosting Hadoop clusters due to their low initial cost and limitless capacity. However, cloud-based Hadoop clusters bring their own challenges due to contradictory design principles. Hadoop is designed on the shared-nothing principle while cloud is based on the concepts of consolidation and resource sharing. Most of Hadoop's features are designed for on-premises data centers where the cluster topology is known. Hadoop depends on the rack assignment of servers (configured by the cluster administrator) to calculate the distance between servers. Hadoop calculates the distance between servers to find the best remote server from which to fetch data from when fetching non-local data. However, public cloud environment providers do not share rack information of virtual servers with their tenants. Lack of rack information of servers may allow Hadoop to fetch data from a remote server that is on the other side of the data center. To overcome this problem, we propose a delay distribution based scheme to find the closest server to fetch non-local data for public cloud-based Hadoop clusters. The proposed scheme bases server selection on the delay distributions between server pairs. Delay distribution is calculated measuring the round-trip time between servers periodically. Our experiments observe that the proposed scheme outperforms conventional Hadoop nearly by 12% in terms of non-local data fetch time. This reduction in data fetch time will lead to a reduction in job run time, especially in real-world multi-user clusters where non-local data fetching can happen frequently.
Distributed Video Decoding on Hadoop
Illo YOON Saehanseul YI Chanyoung OH Hyeonjin JUNG Youngmin YI

PAPER-Cluster Computing

Pubricized:
2018/09/18
Vol:
E101-D No:12
Page(s):
2933-2941
Video analytics is usually time-consuming as it not only requires video decoding as a first step but also usually applies complex computer vision and machine learning algorithms to the decoded frame. To achieve high efficiency in video analytics with ever increasing frame size, many researches have been conducted for distributed video processing using Hadoop. However, most approaches focused on processing multiple video files on multiple nodes. Such approaches require a number of video files to achieve any speedup, and could easily result in load imbalance when the size of video files is reasonably long since a video file itself is processed sequentially. In contrast, we propose a distributed video decoding method with an extended FFmpeg and VideoRecordReader, by which a single large video file can be processed in parallel across multiple nodes in Hadoop. The experimental results show that a case study of face detection and SURF system achieve 40.6 times and 29.1 times of speedups respectively on a four-node cluster with 12 mappers in each node, showing good scalability.
Naive Bayes Classifier Based Partitioner for MapReduce
Lei CHEN Wei LU Ergude BAO Liqiang WANG Weiwei XING Yuanyuan CAI

PAPER-Graphs and Networks

Vol:
E101-A No:5
Page(s):
778-786
MapReduce is an effective framework for processing large datasets in parallel over a cluster. Data locality and data skew on the reduce side are two essential issues in MapReduce. Improving data locality can decrease network traffic by moving reduce tasks to the nodes where the reducer input data is located. Data skew will lead to load imbalance among reducer nodes. Partitioning is an important feature of MapReduce because it determines the reducer nodes to which map output results will be sent. Therefore, an effective partitioner can improve MapReduce performance by increasing data locality and decreasing data skew on the reduce side. Previous studies considering both essential issues can be divided into two categories: those that preferentially improve data locality, such as LEEN, and those that preferentially improve load balance, such as CLP. However, all these studies ignore the fact that for different types of jobs, the priority of data locality and data skew on the reduce side may produce different effects on the execution time. In this paper, we propose a naive Bayes classifier based partitioner, namely, BAPM, which achieves better performance because it can automatically choose the proper algorithm (LEEN or CLP) by leveraging the naive Bayes classifier, i.e., considering job type and bandwidth as classification attributes. Our experiments are performed in a Hadoop cluster, and the results show that BAPM boosts the computing performance of MapReduce. The selection accuracy reaches 95.15%. Further, compared with other popular algorithms, under specific bandwidths, the improvement BAPM achieved is up to 31.31%.
Hadoop I/O Performance Improvement by File Layout Optimization
Eita FUJISHIMA Kenji NAKASHIMA Saneyasu YAMAGUCHI

PAPER-Data Engineering, Web Information Systems

Pubricized:
2017/11/22
Vol:
E101-D No:2
Page(s):
415-427
Hadoop is a popular open-source MapReduce implementation. In the cases of jobs, wherein huge scale of output files of all relevant Map tasks are transmitted into Reduce tasks, such as TeraSort, the Reduce tasks are the bottleneck tasks and are I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we propose methods for improving the performance of Reduce tasks of such jobs by considering the following two things. One is that these files are accessed sequentially on an HDD, and the other is that each zone in an HDD has different sequential I/O performance. The proposed methods control the location to store intermediate data by modifying block bitmap of filesystem, which manages utilization (free or used) of blocks in an HDD. In addition, we propose striping layout for applying these methods for virtualized environment using image files. We then present performance evaluation of the proposed method and demonstrate that our methods improve the Hadoop application performance.
HVTS: Hadoop-Based Video Transcoding System for Media Services
Seokhyun SON Myoungjin KIM

LETTER-Graphs and Networks

Vol:
E100-A No:5
Page(s):
1248-1253
In this letter, we propose a Hadoop-based Video Transcoding System (HVTS), which is designed to run on all major cloud computing services. HVTS is highly adapted to the structure and policies of Hadoop, thus it has additional capacities for transcoding, task distribution, load balancing, and content replication and distribution. To evaluate, our proposed system, we carry out two performance tests on our local testbed, transcoding and robustness to data node and task failures. The results confirmed that our system delivers satisfactory performance in facilitating seamless streaming services in cloud computing environments.
A New Efficient Resource Management Framework for Iterative MapReduce Processing in Large-Scale Data Analysis
Seungtae HONG Kyongseok PARK Chae-Deok LIM Jae-Woo CHANG

This paper has been cancelled due to violation of duplicate submission policy on IEICE Transactions on Information and Systems on September 5, 2019.

PAPER

Pubricized:
2017/01/17
Vol:
E100-D No:4
Page(s):
704-717
- HTML
- Errata[Uploaded on March 1,2018]
To analyze large-scale data efficiently, studies on Hadoop, one of the most popular MapReduce frameworks, have been actively done. Meanwhile, most of the large-scale data analysis applications, e.g., data clustering, are required to do the same map and reduce functions repeatedly. However, Hadoop cannot provide an optimal performance for iterative MapReduce jobs because it derives a result by doing one phase of map and reduce functions. To solve the problems, in this paper, we propose a new efficient resource management framework for iterative MapReduce processing in large-scale data analysis. For this, we first design an iterative job state-machine for managing the iterative MapReduce jobs. Secondly, we propose an invariant data caching mechanism for reducing the I/O costs of data accesses. Thirdly, we propose an iterative resource management technique for efficiently managing the resources of a Hadoop cluster. Fourthly, we devise a stop condition check mechanism for preventing unnecessary computation. Finally, we show the performance superiority of the proposed framework by comparing it with the existing frameworks.
System Status Aware Hadoop Scheduling Methods for Job Performance Improvement
Masatoshi KAWARASAKI Hyuma WATANABE

PAPER-Fundamentals of Information Systems

Pubricized:
2015/03/26
Vol:
E98-D No:7
Page(s):
1275-1285
MapReduce and its open software implementation Hadoop are now widely deployed for big data analysis. As MapReduce runs over a cluster of massive machines, data transfer often becomes a bottleneck in job processing. In this paper, we explore the influence of data transfer to job processing performance and analyze the mechanism of job performance deterioration caused by data transfer oriented congestion at disk I/O and/or network I/O. Based on this analysis, we update Hadoop's Heartbeat messages to contain the real time system status for each machine, like disk I/O and link usage rate. This enhancement makes Hadoop's scheduler be aware of each machine's workload and make more accurate decision of scheduling. The experiment has been done to evaluate the effectiveness of enhanced scheduling methods and discussions are provided to compare the several proposed scheduling policies.
Long-Term Performance Evaluation of Hadoop Jobs in Public and Community Clouds
Kento AIDA Omar ABDUL-RAHMAN Eisaku SAKANE Kazutaka MOTOYAMA

PAPER-Computer System

Pubricized:
2015/02/25
Vol:
E98-D No:6
Page(s):
1176-1184
Cloud computing is a widely used computing platform in business and academic communities. Performance is an important issue when a user runs an application in the cloud. The user may want to estimate the application-execution time beforehand to guarantee the application performance or to choose the most suitable cloud. Moreover, the cloud system architect and the designer need to understand the application performance characteristics, such as the scalability or the utilization of cloud platforms, to improve performance. However, because the application performance in clouds sometime fluctuates, estimation of the application performance is difficult. In this paper, we discuss the performance fluctuation of Hadoop jobs in both a public cloud and a community cloud for one to three months. The experimental results indicate phenomena that we cannot see without long-term experiments and phenomena inherent in Hadoop. The results suggest better ways to estimate Hadoop application performances in clouds. For example, we should be aware of application characteristics (CPU intensive or communication intensive), datacenter characteristics (busy or not), and time frame (time of day and day of the week) to estimate the performance fluctuation due to workload congestion in cloud platforms. Furthermore, we should be aware of performance degradation due to task re-execution in Hadoop applications.
A Distributed and Cooperative NameNode Cluster for a Highly-Available Hadoop Distributed File System
Yonghwan KIM Tadashi ARARAGI Junya NAKAMURA Toshimitsu MASUZAWA

PAPER-Computer System

Pubricized:
2014/12/26
Vol:
E98-D No:4
Page(s):
835-851
Recently, Hadoop has attracted much attention from engineers and researchers as an emerging and effective framework for Big Data. HDFS (Hadoop Distributed File System) can manage a huge amount of data with high performance and reliability using only commodity hardware. However, HDFS requires a single master node, called a NameNode, to manage the entire namespace (or all the i-nodes) of a file system. This causes the SPOF (Single Point Of Failure) problem because the file system becomes inaccessible when the NameNode fails. This also causes a bottleneck of efficiency since all the access requests to the file system have to contact the NameNode. Hadoop 2.0 resolves the SPOF problem by introducing manual failover based on two NameNodes, Active and Standby. However, it still has the efficiency bottleneck problem since all the access requests have to contact the Active in ordinary executions. It may also lose the advantage of using commodity hardware since the two NameNodes have to share a highly reliable sophisticated storage. In this paper, we propose a new HDFS architecture to resolve all the problems mentioned above.
Reducing I/O Cost in OLAP Query Processing with MapReduce
Woo-Lam KANG Hyeon-Gyu KIM Yoon-Joon LEE

LETTER-Data Engineering, Web Information Systems

Pubricized:
2014/10/22
Vol:
E98-D No:2
Page(s):
444-447
This paper presents a method to reduce I/O cost in MapReduce when online analytical processing (OLAP) queries are used for data analysis. The proposed method consists of two basic ideas. First, to reduce network transmission cost, mappers are organized to receive only data necessary to perform a map task, not an entire set of input data. Second, to reduce storage consumption, only record IDs are stored for checkpointing, not the raw records. Experiments conducted with TPC-H benchmark show that the proposed method is about 40% faster than Hive, the well-known data warehouse solution for MapReduce, while reducing the size of data stored for checkpoining to about 80%.
MapReduce Job Scheduling Based on Remaining Job Sizes
Tatsuma MATSUKI Tetsuya TAKINE

PAPER-Network System

Vol:
E98-B No:1
Page(s):
180-189
The MapReduce job scheduler implemented in Hadoop is a mechanism to decide which job is allowed to use idle resources in Hadoop. In terms of the mean job response time, the performance of the job scheduler strongly depends on the job arrival pattern, which includes job size (i.e., the amount of required resources) and their arrival order. Because existing schedulers do not utilize information about job sizes, however, those schedulers suffer severe performance degradation with some arrival patterns. In this paper, we propose a scheduler that estimates and utilizes remaining job sizes, in order to achieve good performance regardless of job arrival patterns. Through simulation experiments, we confirm that for various arrival patterns, the proposed scheduler achieves better performance than the existing schedulers.
Efficient K-Nearest Neighbor Graph Construction Using MapReduce for Large-Scale Data Sets
Tomohiro WARASHINA Kazuo AOYAMA Hiroshi SAWADA Takashi HATTORI

PAPER-Data Engineering, Web Information Systems

Vol:
E97-D No:12
Page(s):
3142-3154
This paper presents an efficient method using Hadoop MapReduce for constructing a K-nearest neighbor graph (K-NNG) from a large-scale data set. K-NNG has been utilized as a data structure for data analysis techniques in various applications. If we are to apply the techniques to a large-scale data set, it is desirable that we develop an efficient K-NNG construction method. We focus on NN-Descent, which is a recently proposed method that efficiently constructs an approximate K-NNG. NN-Descent is implemented on a shared-memory system with OpenMP-based parallelization, and its extension for the Hadoop MapReduce framework is implied for a larger data set such that the shared-memory system is difficult to deal with. However, a simple extension for the Hadoop MapReduce framework is impractical since it requires extremely high system performance because of the high memory consumption and the low data transmission efficiency of MapReduce jobs. The proposed method relaxes the requirement by improving the MapReduce jobs, which employs an appropriate key-value pair format and an efficient sampling strategy. Experiments on large-scale data sets demonstrate that the proposed method both works efficiently and is scalable in terms of a data size, the number of machine nodes, and the graph structural parameter K.
An Efficiency-Aware Scheduling for Data-Intensive Computations on MapReduce Clusters
Hui ZHAO Shuqiang YANG Hua FAN Zhikun CHEN Jinghu XU

PAPER

Vol:
E96-D No:12
Page(s):
2654-2662
Scheduling plays a key role in MapReduce systems. In this paper, we explore the efficiency of an MapReduce cluster running lots of independent and continuously arriving MapReduce jobs. Data locality and load balancing are two important factors to improve computation efficiency in MapReduce systems for data-intensive computations. Traditional cluster scheduling technologies are not well suitable for MapReduce environment, there are some in-used schedulers for the popular open-source Hadoop MapReduce implementation, however, they can not well optimize both factors. Our main objective is to minimize total flowtime of all jobs, given it's a strong NP-hard problem, we adopt some effective heuristics to seek satisfied solution. In this paper, we formalize the scheduling problem as job selection problem, a load balance aware job selection algorithm is proposed, in task level we design a strict data locality tasks scheduling algorithm for map tasks on map machines and a load balance aware scheduling algorithm for reduce tasks on reduce machines. Comprehensive experiments have been conducted to compare our scheduling strategy with well-known Hadoop scheduling strategies. The experimental results validate the efficiency of our proposed scheduling strategy.

Keyword Search Result

[Keyword] hadoop(15hit)

Job-Aware File-Storage Optimization for Improved Hadoop I/O Performance

An Efficient Block Assignment Policy in Hadoop Distributed File System for Multimedia Data Processing

Delay Distribution Based Remote Data Fetch Scheme for Hadoop Clusters in Public Cloud

Distributed Video Decoding on Hadoop

Naive Bayes Classifier Based Partitioner for MapReduce

Hadoop I/O Performance Improvement by File Layout Optimization

HVTS: Hadoop-Based Video Transcoding System for Media Services

A New Efficient Resource Management Framework for Iterative MapReduce Processing in Large-Scale Data Analysis

System Status Aware Hadoop Scheduling Methods for Job Performance Improvement

Long-Term Performance Evaluation of Hadoop Jobs in Public and Community Clouds

A Distributed and Cooperative NameNode Cluster for a Highly-Available Hadoop Distributed File System

Reducing I/O Cost in OLAP Query Processing with MapReduce

MapReduce Job Scheduling Based on Remaining Job Sizes

Efficient K-Nearest Neighbor Graph Construction Using MapReduce for Large-Scale Data Sets

An Efficiency-Aware Scheduling for Data-Intensive Computations on MapReduce Clusters

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles