The search functionality is under construction.

Keyword Search Result

[Keyword] MapReduce(26hit)

1-20hit(26hit)

  • Job-Aware File-Storage Optimization for Improved Hadoop I/O Performance

    Makoto NAKAGAMI  Jose A.B. FORTES  Saneyasu YAMAGUCHI  

     
    PAPER-Software System

      Pubricized:
    2020/06/30
      Vol:
    E103-D No:10
      Page(s):
    2083-2093

    Hadoop is a popular data-analytics platform based on Google's MapReduce programming model. Hard-disk drives (HDDs) are generally used in big-data analysis, and the effectiveness of the Hadoop platform can be optimized by enhancing its I/O performance. HDD performance varies depending on whether the data are stored in the inner or outer disk zones. This paper proposes a method that utilizes the knowledge of job characteristics to realize efficient data storage in HDDs, which in turn, helps improve Hadoop performance. Per the proposed method, job files that need to be frequently accessed are stored in outer disk tracks which are capable of facilitating sequential-access speeds that are higher than those provided by inner tracks. Thus, the proposed method stores temporary and permanent files in the outer and inner zones, respectively, thereby facilitating fast access to frequently required data. Results of performance evaluation demonstrate that the proposed method improves Hadoop performance by 15.4% when compared to normal cases when file placement is not used. Additionally, the proposed method outperforms a previously proposed placement approach by 11.1%.

  • Skew-Aware Collective Communication for MapReduce Shuffling

    Harunobu DAIKOKU  Hideyuki KAWASHIMA  Osamu TATEBE  

     
    PAPER-Computer System

      Pubricized:
    2019/07/29
      Vol:
    E102-D No:12
      Page(s):
    2389-2399

    This paper proposes and examines the three in-memory shuffling methods designed to address problems in MapReduce shuffling caused by skewed data. Coupled Shuffle Architecture (CSA) employs a single pairwise all-to-all exchange to shuffle both blocks, units of shuffle transfer, and meta-blocks, which contain the metadata of corresponding blocks. Decoupled Shuffle Architecture (DSA) separates the shuffling of meta-blocks and blocks, and applies different all-to-all exchange algorithms to each shuffling process, attempting to mitigate the impact of stragglers in strongly skewed distributions. Decoupled Shuffle Architecture with Skew-Aware Meta-Shuffle (DSA w/ SMS) autonomously determines the proper placement of blocks based on the memory consumption of each worker process. This approach targets extremely skewed situations where some worker processes could exceed their node memory limitation. This study evaluates implementations of the three shuffling methods in our prototype in-memory MapReduce engine, which employs high performance interconnects such as InfiniBand and Intel Omni-Path. Our results suggest that DSA w/ SMS is the only viable solution for extremely skewed data distributions. We also present a detailed investigation of the performance of CSA and DSA in various skew situations.

  • An Efficient Parallel Triangle Enumeration on the MapReduce Framework

    Hongyeon KIM  Jun-Ki MIN  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2019/07/11
      Vol:
    E102-D No:10
      Page(s):
    1902-1915

    A triangle enumerating problem is one of fundamental problems of graph data. Although several triangle enumerating algorithms based on MapReduce have been proposed, they still suffer from generating a lot of intermediate data. In this paper, we propose the efficient MapReduce algorithms to enumerate every triangle in the massive graph based on a vertex partition. Since a triangle is composed of an edge and a wedge, our algorithms check the existence of an edge connecting the end-nodes of each wedge. To generate every triangle from a graph in parallel, we first split a graph into several vertex partitions and group the edges and wedges in the graph for each pair of vertex partitions. Then, we form the triangles appearing in each group. Furthermore, to enhance the performance of our algorithm, we remove the duplicated wedges existing in several groups. Our experimental evaluation shows the performance of our proposed algorithm is better than that of the state-of-the-art algorithm in diverse environments.

  • Proxy Responses by FPGA-Based Switch for MapReduce Stragglers

    Koya MITSUZUKA  Michihiro KOIBUCHI  Hideharu AMANO  Hiroki MATSUTANI  

     
    PAPER-Computer System

      Pubricized:
    2018/06/15
      Vol:
    E101-D No:9
      Page(s):
    2258-2268

    In parallel processing applications, a few worker nodes called “stragglers”, which execute their tasks significantly slower than other tasks, increase the execution time of the job. In this paper, we propose a network switch based straggler handling system to mitigate the burden of the compute nodes. We also propose how to offload detecting stragglers and computing their results in the network switch with no additional communications between worker nodes. We introduce some approximate techniques for the proxy computation and response at the switch; thus our switch is called “ApproxSW.” As a result of a simulation experiment, the proposed approximation based on task similarity achieves the best accuracy in terms of quality of generated Map outputs. We also analyze how to suppress unnecessary proxy computation by the ApproxSW. We implement ApproxSW on NetFPGA-SUME board that has four 10Gbit Ethernet (10GbE) interfaces and a Virtex-7 FPGA. Experimental results shows that the ApproxSW functions do not degrade the original 10GbE switch performance.

  • Naive Bayes Classifier Based Partitioner for MapReduce

    Lei CHEN  Wei LU  Ergude BAO  Liqiang WANG  Weiwei XING  Yuanyuan CAI  

     
    PAPER-Graphs and Networks

      Vol:
    E101-A No:5
      Page(s):
    778-786

    MapReduce is an effective framework for processing large datasets in parallel over a cluster. Data locality and data skew on the reduce side are two essential issues in MapReduce. Improving data locality can decrease network traffic by moving reduce tasks to the nodes where the reducer input data is located. Data skew will lead to load imbalance among reducer nodes. Partitioning is an important feature of MapReduce because it determines the reducer nodes to which map output results will be sent. Therefore, an effective partitioner can improve MapReduce performance by increasing data locality and decreasing data skew on the reduce side. Previous studies considering both essential issues can be divided into two categories: those that preferentially improve data locality, such as LEEN, and those that preferentially improve load balance, such as CLP. However, all these studies ignore the fact that for different types of jobs, the priority of data locality and data skew on the reduce side may produce different effects on the execution time. In this paper, we propose a naive Bayes classifier based partitioner, namely, BAPM, which achieves better performance because it can automatically choose the proper algorithm (LEEN or CLP) by leveraging the naive Bayes classifier, i.e., considering job type and bandwidth as classification attributes. Our experiments are performed in a Hadoop cluster, and the results show that BAPM boosts the computing performance of MapReduce. The selection accuracy reaches 95.15%. Further, compared with other popular algorithms, under specific bandwidths, the improvement BAPM achieved is up to 31.31%.

  • Grid-Based Parallel Algorithms of Join Queries for Analyzing Multi-Dimensional Data on MapReduce

    Miyoung JANG  Jae-Woo CHANG  

     
    PAPER-Technologies for Knowledge Support Platform

      Pubricized:
    2018/01/19
      Vol:
    E101-D No:4
      Page(s):
    964-976

    Recently, the join processing of large-scale datasets in MapReduce environments has become an important issue. However, the existing MapReduce-based join algorithms suffer from too much overhead for constructing and updating the data index. Moreover, the similarity computation cost is high because the existing algorithms partition data without considering the data distribution. In this paper, we propose two grid-based join algorithms for MapReduce. First, we propose a similarity join algorithm that evenly distributes join candidates using a dynamic grid index, which partitions data considering data density and similarity threshold. We use a bottom-up approach by merging initial grid cells into partitions and assigning them to MapReduce jobs. Second, we propose a k-NN join query processing algorithm for MapReduce. To reduce the data transmission cost, we determine an optimal grid cell size by considering the data distribution of randomly selected samples. Then, we perform kNN join by assigning the only related join data to a reducer. From performance analysis, we show that our similarity join query processing algorithm and our k-NN join algorithm outperform existing algorithms by up to 10 times, in terms of query processing time.

  • Simultaneous Processing of Multi-Skyline Queries with MapReduce

    Junsu KIM  Kyong-Ha LEE  Myoung-Ho KIM  

     
    LETTER-Fundamentals of Information Systems

      Pubricized:
    2017/04/07
      Vol:
    E100-D No:7
      Page(s):
    1516-1520

    With rapid increase of the number of applications as well as the sizes of data, multi-query processing on the MapReduce framework has gained much attention. Meanwhile, there have been much interest in skyline query processing due to its power of multi-criteria decision making and analysis. Recently, there have been attempts to optimize multi-query processing in MapReduce. However, they are not appropriate to process multiple skyline queries efficiently and they also require modifications of the Hadoop internals. In this paper, we propose an efficient method for processing multi-skyline queries with MapReduce without any modification of the Hadoop internals. Through various experiments, we show that our approach outperforms previous studies by orders of magnitude.

  • A New Efficient Resource Management Framework for Iterative MapReduce Processing in Large-Scale Data Analysis

    Seungtae HONG  Kyongseok PARK  Chae-Deok LIM  Jae-Woo CHANG  

    This paper has been cancelled due to violation of duplicate submission policy on IEICE Transactions on Information and Systems on September 5, 2019.
     
    PAPER

      Pubricized:
    2017/01/17
      Vol:
    E100-D No:4
      Page(s):
    704-717
    • HTML
    • Errata[Uploaded on March 1,2018]

    To analyze large-scale data efficiently, studies on Hadoop, one of the most popular MapReduce frameworks, have been actively done. Meanwhile, most of the large-scale data analysis applications, e.g., data clustering, are required to do the same map and reduce functions repeatedly. However, Hadoop cannot provide an optimal performance for iterative MapReduce jobs because it derives a result by doing one phase of map and reduce functions. To solve the problems, in this paper, we propose a new efficient resource management framework for iterative MapReduce processing in large-scale data analysis. For this, we first design an iterative job state-machine for managing the iterative MapReduce jobs. Secondly, we propose an invariant data caching mechanism for reducing the I/O costs of data accesses. Thirdly, we propose an iterative resource management technique for efficiently managing the resources of a Hadoop cluster. Fourthly, we devise a stop condition check mechanism for preventing unnecessary computation. Finally, we show the performance superiority of the proposed framework by comparing it with the existing frameworks.

  • 2PTS: A Two-Phase Task Scheduling Algorithm for MapReduce

    Byungnam LIM  Yeeun SHIM  Yon Dohn CHUNG  

     
    LETTER-Fundamentals of Information Systems

      Pubricized:
    2016/06/06
      Vol:
    E99-D No:9
      Page(s):
    2377-2380

    For an efficient processing of large data in a distributed system, Hadoop MapReduce performs task scheduling such that tasks are distributed with consideration of the data locality. The data locality, however, is limitedly exploited, since it is pursued one node at a time basis without considering the global optimality. In this paper, we propose a novel task scheduling algorithm that globally considers the data locality. Through experiments, we show our algorithm improves the performance of MapReduce in various situations.

  • DynamicAdjust: Dynamic Resource Adjustment for Mitigating Skew in MapReduce

    Zhihong LIU  Aimal KHAN  Peixin CHEN  Yaping LIU  Zhenghu GONG  

     
    LETTER-Fundamentals of Information Systems

      Pubricized:
    2016/03/07
      Vol:
    E99-D No:6
      Page(s):
    1686-1689

    MapReduce still suffers from a problem known as skew, where load is unevenly distributed among tasks. Existing solutions follow a similar pattern that estimates the load of each task and then rebalances the load among tasks. However, these solutions often incur heavy overhead due to the load estimation and rebalancing. In this paper, we present DynamicAdjust, a dynamic resource adjustment technique for mitigating skew in MapReduce. Instead of rebalancing the load among tasks, DynamicAdjust adjusts resources dynamically for the tasks that need more computation, thereby accelerating these tasks. Through experiments using real MapReduce workloads on a 21-node Hadoop cluster, we show that DynamicAdjust can effectively mitigate the skew and speed up the job completion time by up to 37.27% compared to the native Hadoop YARN.

  • Optimizing Hash Join with MapReduce on Multi-Core CPUs

    Tong YUAN  Zhijing LIU  Hui LIU  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2016/02/04
      Vol:
    E99-D No:5
      Page(s):
    1316-1325

    In this paper, we exploit MapReduce framework and other optimizations to improve the performance of hash join algorithms on multi-core CPUs, including No partition hash join and partition hash join. We first implement hash join algorithms with a shared-memory MapReduce model on multi-core CPUs, including partition phase, build phase, and probe phase. Then we design an improved cuckoo hash table for our hash join, which consists of a cuckoo hash table and a chained hash table. Based on our implementation, we also propose two optimizations, one for the usage of SIMD instructions, and the other for partition phase. Through experimental result and analysis, we finally find that the partition hash join often outperforms the No partition hash join, and our hash join algorithm is faster than previous work by an average of 30%.

  • The Efficient Algorithms for Constructing Enhanced Quadtrees Using MapReduce

    Hongyeon KIM  Sungmin KANG  Seokjoo LEE  Jun-Ki MIN  

     
    PAPER

      Pubricized:
    2016/01/14
      Vol:
    E99-D No:4
      Page(s):
    918-926

    MapReduce is considered as the de facto framework for storing and processing massive data due to its fascinating features: simplicity, flexibility, fault tolerance and scalability. However, since the MapReduce framework does not provide an efficient access method to data (i.e., an index), whole data should be retrieved even though a user wants to access a small portion of data. Thus, in this paper, we devise an efficient algorithm constructing quadtrees with MapReduce. Our proposed algorithms reduce the index construction time by utilizing a sampling technique to partition a data set. To improve the query performance, we extend the quadtree construction algorithm in which the adjacent nodes of a quadtree are integrated when the number of points located in the nodes is less than the predefined threshold. Furthermore, we present an effective algorithm for incremental update. Our experimental results show the efficiency of our proposed algorithms in diverse environments.

  • Reusing the Results of Queries in MapReduce Systems by Adopting Shared Storage

    Zhanye WANG  Chuanyi LIU  Dongsheng WANG  

     
    PAPER

      Vol:
    E99-B No:2
      Page(s):
    315-325

    Over the last few years, Apache MapReduce has become the prevailing framework for large scale data processing. Instead of writing MapReduce programs which are too obscure to express, many developers usually adopt high level query languages, such as Hive or Pig Latin, to finish their complex queries. These languages automatically compile each query into a workflow of MapReduce jobs, so they greatly facilitate the querying and management of large datasets. One option to speed up the execution of workflows is to save the results produced previously and reuse them in the future if needed. In this paper we present SuperRack, which uses shared storage devices to store the results of each workflow and allows a new query to reuse these results in order to avoid redundant computation and hasten execution. We propose several novel techniques to improve the access and storage efficiency of the previous results. We also evaluate SuperRack to verify its feasibility and effectiveness. Experiments show that our solution outperforms Hive significantly under TPC-H benchmark and real life workloads.

  • A Cloud-Friendly Communication-Optimal Implementation for Strassen's Matrix Multiplication Algorithm

    Jie ZHOU  Feng YU  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2015/07/27
      Vol:
    E98-D No:11
      Page(s):
    1896-1905

    Due to its on-demand and pay-as-you-go properties, cloud computing has become an attractive alternative for HPC applications. However, communication-intensive applications with complex communication patterns still cannot be performed efficiently on cloud platforms, which are equipped with MapReduce technologies, such as Hadoop and Spark. In particular, one major obstacle is that MapReduce's simple programming model cannot explicitly manipulate data transfers between compute nodes. Another obstacle is cloud's relatively poor network performance compared with traditional HPC platforms. The traditional Strassen's algorithm of square matrix multiplication has a recursive and complex pattern on the HPC platform. Therefore, it cannot be directly applied to the cloud platform. In this paper, we demonstrate how to make Strassen's algorithm with complex communication patterns “cloud-friendly”. By reorganizing Strassen's algorithm in an iterative pattern, we completely separate its computations and communications, making it fit to MapReduce programming model. By adopting a novel data/task parallel strategy, we solve Strassen's data dependency problems, making it well balanced. This is the first instance of Strassen's algorithm in MapReduce-style systems, which also matches Strassen's communication lower bound. Further experimental results show that it achieves a speedup ranging from 1.03× to 2.50× over the classical Θ(n3) algorithm. We believe the principle can be applied to many other complex scientific applications.

  • System Status Aware Hadoop Scheduling Methods for Job Performance Improvement

    Masatoshi KAWARASAKI  Hyuma WATANABE  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2015/03/26
      Vol:
    E98-D No:7
      Page(s):
    1275-1285

    MapReduce and its open software implementation Hadoop are now widely deployed for big data analysis. As MapReduce runs over a cluster of massive machines, data transfer often becomes a bottleneck in job processing. In this paper, we explore the influence of data transfer to job processing performance and analyze the mechanism of job performance deterioration caused by data transfer oriented congestion at disk I/O and/or network I/O. Based on this analysis, we update Hadoop's Heartbeat messages to contain the real time system status for each machine, like disk I/O and link usage rate. This enhancement makes Hadoop's scheduler be aware of each machine's workload and make more accurate decision of scheduling. The experiment has been done to evaluate the effectiveness of enhanced scheduling methods and discussions are provided to compare the several proposed scheduling policies.

  • 3D Objects Tracking by MapReduce GPGPU-Enhanced Particle Filter

    Jieyun ZHOU  Xiaofeng LI  Haitao CHEN  Rutong CHEN  Masayuki NUMAO  

     
    PAPER

      Pubricized:
    2015/01/21
      Vol:
    E98-D No:5
      Page(s):
    1035-1044

    Objects tracking methods have been wildly used in the field of video surveillance, motion monitoring, robotics and so on. Particle filter is one of the promising methods, but it is difficult to apply to real-time objects tracking because of its high computation cost. In order to reduce the processing cost without sacrificing the tracking quality, this paper proposes a new method for real-time 3D objects tracking, using parallelized particle filter algorithms by MapReduce architecture which is running on GPGPU. Our methods are as follows. First, we use a Kinect to get the 3D information of objects. Unlike the conventional 2D-based objects tracking, 3D objects tracking adds depth information. It can track not only from the x and y axis but also from the z axis, and the depth information can correct some errors in 2D objects tracking. Second, to solve the high computation cost problem, we use the MapReduce architecture on GPGPU to parallelize the particle filter algorithm. We implement the particle filter algorithms on GPU and evaluate the performance by actually running a program on CUDA5.5.

  • k-Dominant Skyline Query Computation in MapReduce Environment

    Md. Anisuzzaman SIDDIQUE  Hao TIAN  Yasuhiko MORIMOTO  

     
    PAPER

      Pubricized:
    2015/01/21
      Vol:
    E98-D No:5
      Page(s):
    1027-1034

    Filtering uninteresting data is important to utilize “big data”. Skyline query is popular technique to filter uninteresting data, in which it selects a set of objects that are not dominated by another from a given large database. However, a skyline query often retrieves too many objects to analyze intensively especially for high-dimensional dataset. To solve the problem, k-dominant skyline queries have been introduced. The size of databases sometimes become too large to compute in a centralized environment. Conventional algorithms for computing k-dominant skyline queries are not well suited for parallel and distributed environments, such as the MapReduce framework. In this paper, we consider an efficient parallel algorithm to process k-dominant skyline query in MapReduce framework. Extensive experiments demonstrate the scalability of proposed algorithm for synthetic big datasets under different settings of data distribution, dimensionality, and cardinality.

  • Reducing I/O Cost in OLAP Query Processing with MapReduce

    Woo-Lam KANG  Hyeon-Gyu KIM  Yoon-Joon LEE  

     
    LETTER-Data Engineering, Web Information Systems

      Pubricized:
    2014/10/22
      Vol:
    E98-D No:2
      Page(s):
    444-447

    This paper presents a method to reduce I/O cost in MapReduce when online analytical processing (OLAP) queries are used for data analysis. The proposed method consists of two basic ideas. First, to reduce network transmission cost, mappers are organized to receive only data necessary to perform a map task, not an entire set of input data. Second, to reduce storage consumption, only record IDs are stored for checkpointing, not the raw records. Experiments conducted with TPC-H benchmark show that the proposed method is about 40% faster than Hive, the well-known data warehouse solution for MapReduce, while reducing the size of data stored for checkpoining to about 80%.

  • MapReduce Job Scheduling Based on Remaining Job Sizes

    Tatsuma MATSUKI  Tetsuya TAKINE  

     
    PAPER-Network System

      Vol:
    E98-B No:1
      Page(s):
    180-189

    The MapReduce job scheduler implemented in Hadoop is a mechanism to decide which job is allowed to use idle resources in Hadoop. In terms of the mean job response time, the performance of the job scheduler strongly depends on the job arrival pattern, which includes job size (i.e., the amount of required resources) and their arrival order. Because existing schedulers do not utilize information about job sizes, however, those schedulers suffer severe performance degradation with some arrival patterns. In this paper, we propose a scheduler that estimates and utilizes remaining job sizes, in order to achieve good performance regardless of job arrival patterns. Through simulation experiments, we confirm that for various arrival patterns, the proposed scheduler achieves better performance than the existing schedulers.

  • Efficient K-Nearest Neighbor Graph Construction Using MapReduce for Large-Scale Data Sets

    Tomohiro WARASHINA  Kazuo AOYAMA  Hiroshi SAWADA  Takashi HATTORI  

     
    PAPER-Data Engineering, Web Information Systems

      Vol:
    E97-D No:12
      Page(s):
    3142-3154

    This paper presents an efficient method using Hadoop MapReduce for constructing a K-nearest neighbor graph (K-NNG) from a large-scale data set. K-NNG has been utilized as a data structure for data analysis techniques in various applications. If we are to apply the techniques to a large-scale data set, it is desirable that we develop an efficient K-NNG construction method. We focus on NN-Descent, which is a recently proposed method that efficiently constructs an approximate K-NNG. NN-Descent is implemented on a shared-memory system with OpenMP-based parallelization, and its extension for the Hadoop MapReduce framework is implied for a larger data set such that the shared-memory system is difficult to deal with. However, a simple extension for the Hadoop MapReduce framework is impractical since it requires extremely high system performance because of the high memory consumption and the low data transmission efficiency of MapReduce jobs. The proposed method relaxes the requirement by improving the MapReduce jobs, which employs an appropriate key-value pair format and an efficient sampling strategy. Experiments on large-scale data sets demonstrate that the proposed method both works efficiently and is scalable in terms of a data size, the number of machine nodes, and the graph structural parameter K.

1-20hit(26hit)