The search functionality is under construction.

Keyword Search Result

[Keyword] hadoop(15hit)

1-15hit
  • Job-Aware File-Storage Optimization for Improved Hadoop I/O Performance

    Makoto NAKAGAMI  Jose A.B. FORTES  Saneyasu YAMAGUCHI  

     
    PAPER-Software System

      Pubricized:
    2020/06/30
      Vol:
    E103-D No:10
      Page(s):
    2083-2093

    Hadoop is a popular data-analytics platform based on Google's MapReduce programming model. Hard-disk drives (HDDs) are generally used in big-data analysis, and the effectiveness of the Hadoop platform can be optimized by enhancing its I/O performance. HDD performance varies depending on whether the data are stored in the inner or outer disk zones. This paper proposes a method that utilizes the knowledge of job characteristics to realize efficient data storage in HDDs, which in turn, helps improve Hadoop performance. Per the proposed method, job files that need to be frequently accessed are stored in outer disk tracks which are capable of facilitating sequential-access speeds that are higher than those provided by inner tracks. Thus, the proposed method stores temporary and permanent files in the outer and inner zones, respectively, thereby facilitating fast access to frequently required data. Results of performance evaluation demonstrate that the proposed method improves Hadoop performance by 15.4% when compared to normal cases when file placement is not used. Additionally, the proposed method outperforms a previously proposed placement approach by 11.1%.

  • An Efficient Block Assignment Policy in Hadoop Distributed File System for Multimedia Data Processing

    Cheolgi KIM  Daechul LEE  Jaehyun LEE  Jaehwan LEE  

     
    LETTER-Computer System

      Pubricized:
    2019/05/21
      Vol:
    E102-D No:8
      Page(s):
    1569-1571

    Hadoop, a distributed processing framework for big-data, is now widely used for multimedia processing. However, when processing video data from a Hadoop distributed file system (HDFS), unnecessary network traffic is generated due to an inefficient HDFS block slice policy for picture frames in video files. We propose a new block replication policy to solve this problem and compare the newly proposed HDFS with the original HDFS via extensive experiments. The proposed HDFS reduces network traffic, and increases locality between processing cores and file locations.

  • Delay Distribution Based Remote Data Fetch Scheme for Hadoop Clusters in Public Cloud

    Ravindra Sandaruwan RANAWEERA  Eiji OKI  Nattapong KITSUWAN  

     
    PAPER-Network

      Pubricized:
    2019/02/04
      Vol:
    E102-B No:8
      Page(s):
    1617-1625

    Apache Hadoop and its ecosystem have become the de facto platform for processing large-scale data, or Big Data, because it hides the complexity of distributed computing, scheduling, and communication while providing fault-tolerance. Cloud-based environments are becoming a popular platform for hosting Hadoop clusters due to their low initial cost and limitless capacity. However, cloud-based Hadoop clusters bring their own challenges due to contradictory design principles. Hadoop is designed on the shared-nothing principle while cloud is based on the concepts of consolidation and resource sharing. Most of Hadoop's features are designed for on-premises data centers where the cluster topology is known. Hadoop depends on the rack assignment of servers (configured by the cluster administrator) to calculate the distance between servers. Hadoop calculates the distance between servers to find the best remote server from which to fetch data from when fetching non-local data. However, public cloud environment providers do not share rack information of virtual servers with their tenants. Lack of rack information of servers may allow Hadoop to fetch data from a remote server that is on the other side of the data center. To overcome this problem, we propose a delay distribution based scheme to find the closest server to fetch non-local data for public cloud-based Hadoop clusters. The proposed scheme bases server selection on the delay distributions between server pairs. Delay distribution is calculated measuring the round-trip time between servers periodically. Our experiments observe that the proposed scheme outperforms conventional Hadoop nearly by 12% in terms of non-local data fetch time. This reduction in data fetch time will lead to a reduction in job run time, especially in real-world multi-user clusters where non-local data fetching can happen frequently.

  • Distributed Video Decoding on Hadoop

    Illo YOON  Saehanseul YI  Chanyoung OH  Hyeonjin JUNG  Youngmin YI  

     
    PAPER-Cluster Computing

      Pubricized:
    2018/09/18
      Vol:
    E101-D No:12
      Page(s):
    2933-2941

    Video analytics is usually time-consuming as it not only requires video decoding as a first step but also usually applies complex computer vision and machine learning algorithms to the decoded frame. To achieve high efficiency in video analytics with ever increasing frame size, many researches have been conducted for distributed video processing using Hadoop. However, most approaches focused on processing multiple video files on multiple nodes. Such approaches require a number of video files to achieve any speedup, and could easily result in load imbalance when the size of video files is reasonably long since a video file itself is processed sequentially. In contrast, we propose a distributed video decoding method with an extended FFmpeg and VideoRecordReader, by which a single large video file can be processed in parallel across multiple nodes in Hadoop. The experimental results show that a case study of face detection and SURF system achieve 40.6 times and 29.1 times of speedups respectively on a four-node cluster with 12 mappers in each node, showing good scalability.

  • Naive Bayes Classifier Based Partitioner for MapReduce

    Lei CHEN  Wei LU  Ergude BAO  Liqiang WANG  Weiwei XING  Yuanyuan CAI  

     
    PAPER-Graphs and Networks

      Vol:
    E101-A No:5
      Page(s):
    778-786

    MapReduce is an effective framework for processing large datasets in parallel over a cluster. Data locality and data skew on the reduce side are two essential issues in MapReduce. Improving data locality can decrease network traffic by moving reduce tasks to the nodes where the reducer input data is located. Data skew will lead to load imbalance among reducer nodes. Partitioning is an important feature of MapReduce because it determines the reducer nodes to which map output results will be sent. Therefore, an effective partitioner can improve MapReduce performance by increasing data locality and decreasing data skew on the reduce side. Previous studies considering both essential issues can be divided into two categories: those that preferentially improve data locality, such as LEEN, and those that preferentially improve load balance, such as CLP. However, all these studies ignore the fact that for different types of jobs, the priority of data locality and data skew on the reduce side may produce different effects on the execution time. In this paper, we propose a naive Bayes classifier based partitioner, namely, BAPM, which achieves better performance because it can automatically choose the proper algorithm (LEEN or CLP) by leveraging the naive Bayes classifier, i.e., considering job type and bandwidth as classification attributes. Our experiments are performed in a Hadoop cluster, and the results show that BAPM boosts the computing performance of MapReduce. The selection accuracy reaches 95.15%. Further, compared with other popular algorithms, under specific bandwidths, the improvement BAPM achieved is up to 31.31%.

  • Hadoop I/O Performance Improvement by File Layout Optimization

    Eita FUJISHIMA  Kenji NAKASHIMA  Saneyasu YAMAGUCHI  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2017/11/22
      Vol:
    E101-D No:2
      Page(s):
    415-427

    Hadoop is a popular open-source MapReduce implementation. In the cases of jobs, wherein huge scale of output files of all relevant Map tasks are transmitted into Reduce tasks, such as TeraSort, the Reduce tasks are the bottleneck tasks and are I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we propose methods for improving the performance of Reduce tasks of such jobs by considering the following two things. One is that these files are accessed sequentially on an HDD, and the other is that each zone in an HDD has different sequential I/O performance. The proposed methods control the location to store intermediate data by modifying block bitmap of filesystem, which manages utilization (free or used) of blocks in an HDD. In addition, we propose striping layout for applying these methods for virtualized environment using image files. We then present performance evaluation of the proposed method and demonstrate that our methods improve the Hadoop application performance.

  • HVTS: Hadoop-Based Video Transcoding System for Media Services

    Seokhyun SON  Myoungjin KIM  

     
    LETTER-Graphs and Networks

      Vol:
    E100-A No:5
      Page(s):
    1248-1253

    In this letter, we propose a Hadoop-based Video Transcoding System (HVTS), which is designed to run on all major cloud computing services. HVTS is highly adapted to the structure and policies of Hadoop, thus it has additional capacities for transcoding, task distribution, load balancing, and content replication and distribution. To evaluate, our proposed system, we carry out two performance tests on our local testbed, transcoding and robustness to data node and task failures. The results confirmed that our system delivers satisfactory performance in facilitating seamless streaming services in cloud computing environments.

  • A New Efficient Resource Management Framework for Iterative MapReduce Processing in Large-Scale Data Analysis

    Seungtae HONG  Kyongseok PARK  Chae-Deok LIM  Jae-Woo CHANG  

    This paper has been cancelled due to violation of duplicate submission policy on IEICE Transactions on Information and Systems on September 5, 2019.
     
    PAPER

      Pubricized:
    2017/01/17
      Vol:
    E100-D No:4
      Page(s):
    704-717
    • HTML
    • Errata[Uploaded on March 1,2018]

    To analyze large-scale data efficiently, studies on Hadoop, one of the most popular MapReduce frameworks, have been actively done. Meanwhile, most of the large-scale data analysis applications, e.g., data clustering, are required to do the same map and reduce functions repeatedly. However, Hadoop cannot provide an optimal performance for iterative MapReduce jobs because it derives a result by doing one phase of map and reduce functions. To solve the problems, in this paper, we propose a new efficient resource management framework for iterative MapReduce processing in large-scale data analysis. For this, we first design an iterative job state-machine for managing the iterative MapReduce jobs. Secondly, we propose an invariant data caching mechanism for reducing the I/O costs of data accesses. Thirdly, we propose an iterative resource management technique for efficiently managing the resources of a Hadoop cluster. Fourthly, we devise a stop condition check mechanism for preventing unnecessary computation. Finally, we show the performance superiority of the proposed framework by comparing it with the existing frameworks.

  • System Status Aware Hadoop Scheduling Methods for Job Performance Improvement

    Masatoshi KAWARASAKI  Hyuma WATANABE  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2015/03/26
      Vol:
    E98-D No:7
      Page(s):
    1275-1285

    MapReduce and its open software implementation Hadoop are now widely deployed for big data analysis. As MapReduce runs over a cluster of massive machines, data transfer often becomes a bottleneck in job processing. In this paper, we explore the influence of data transfer to job processing performance and analyze the mechanism of job performance deterioration caused by data transfer oriented congestion at disk I/O and/or network I/O. Based on this analysis, we update Hadoop's Heartbeat messages to contain the real time system status for each machine, like disk I/O and link usage rate. This enhancement makes Hadoop's scheduler be aware of each machine's workload and make more accurate decision of scheduling. The experiment has been done to evaluate the effectiveness of enhanced scheduling methods and discussions are provided to compare the several proposed scheduling policies.

  • Long-Term Performance Evaluation of Hadoop Jobs in Public and Community Clouds

    Kento AIDA  Omar ABDUL-RAHMAN  Eisaku SAKANE  Kazutaka MOTOYAMA  

     
    PAPER-Computer System

      Pubricized:
    2015/02/25
      Vol:
    E98-D No:6
      Page(s):
    1176-1184

    Cloud computing is a widely used computing platform in business and academic communities. Performance is an important issue when a user runs an application in the cloud. The user may want to estimate the application-execution time beforehand to guarantee the application performance or to choose the most suitable cloud. Moreover, the cloud system architect and the designer need to understand the application performance characteristics, such as the scalability or the utilization of cloud platforms, to improve performance. However, because the application performance in clouds sometime fluctuates, estimation of the application performance is difficult. In this paper, we discuss the performance fluctuation of Hadoop jobs in both a public cloud and a community cloud for one to three months. The experimental results indicate phenomena that we cannot see without long-term experiments and phenomena inherent in Hadoop. The results suggest better ways to estimate Hadoop application performances in clouds. For example, we should be aware of application characteristics (CPU intensive or communication intensive), datacenter characteristics (busy or not), and time frame (time of day and day of the week) to estimate the performance fluctuation due to workload congestion in cloud platforms. Furthermore, we should be aware of performance degradation due to task re-execution in Hadoop applications.

  • A Distributed and Cooperative NameNode Cluster for a Highly-Available Hadoop Distributed File System

    Yonghwan KIM  Tadashi ARARAGI  Junya NAKAMURA  Toshimitsu MASUZAWA  

     
    PAPER-Computer System

      Pubricized:
    2014/12/26
      Vol:
    E98-D No:4
      Page(s):
    835-851

    Recently, Hadoop has attracted much attention from engineers and researchers as an emerging and effective framework for Big Data. HDFS (Hadoop Distributed File System) can manage a huge amount of data with high performance and reliability using only commodity hardware. However, HDFS requires a single master node, called a NameNode, to manage the entire namespace (or all the i-nodes) of a file system. This causes the SPOF (Single Point Of Failure) problem because the file system becomes inaccessible when the NameNode fails. This also causes a bottleneck of efficiency since all the access requests to the file system have to contact the NameNode. Hadoop 2.0 resolves the SPOF problem by introducing manual failover based on two NameNodes, Active and Standby. However, it still has the efficiency bottleneck problem since all the access requests have to contact the Active in ordinary executions. It may also lose the advantage of using commodity hardware since the two NameNodes have to share a highly reliable sophisticated storage. In this paper, we propose a new HDFS architecture to resolve all the problems mentioned above.

  • Reducing I/O Cost in OLAP Query Processing with MapReduce

    Woo-Lam KANG  Hyeon-Gyu KIM  Yoon-Joon LEE  

     
    LETTER-Data Engineering, Web Information Systems

      Pubricized:
    2014/10/22
      Vol:
    E98-D No:2
      Page(s):
    444-447

    This paper presents a method to reduce I/O cost in MapReduce when online analytical processing (OLAP) queries are used for data analysis. The proposed method consists of two basic ideas. First, to reduce network transmission cost, mappers are organized to receive only data necessary to perform a map task, not an entire set of input data. Second, to reduce storage consumption, only record IDs are stored for checkpointing, not the raw records. Experiments conducted with TPC-H benchmark show that the proposed method is about 40% faster than Hive, the well-known data warehouse solution for MapReduce, while reducing the size of data stored for checkpoining to about 80%.

  • MapReduce Job Scheduling Based on Remaining Job Sizes

    Tatsuma MATSUKI  Tetsuya TAKINE  

     
    PAPER-Network System

      Vol:
    E98-B No:1
      Page(s):
    180-189

    The MapReduce job scheduler implemented in Hadoop is a mechanism to decide which job is allowed to use idle resources in Hadoop. In terms of the mean job response time, the performance of the job scheduler strongly depends on the job arrival pattern, which includes job size (i.e., the amount of required resources) and their arrival order. Because existing schedulers do not utilize information about job sizes, however, those schedulers suffer severe performance degradation with some arrival patterns. In this paper, we propose a scheduler that estimates and utilizes remaining job sizes, in order to achieve good performance regardless of job arrival patterns. Through simulation experiments, we confirm that for various arrival patterns, the proposed scheduler achieves better performance than the existing schedulers.

  • Efficient K-Nearest Neighbor Graph Construction Using MapReduce for Large-Scale Data Sets

    Tomohiro WARASHINA  Kazuo AOYAMA  Hiroshi SAWADA  Takashi HATTORI  

     
    PAPER-Data Engineering, Web Information Systems

      Vol:
    E97-D No:12
      Page(s):
    3142-3154

    This paper presents an efficient method using Hadoop MapReduce for constructing a K-nearest neighbor graph (K-NNG) from a large-scale data set. K-NNG has been utilized as a data structure for data analysis techniques in various applications. If we are to apply the techniques to a large-scale data set, it is desirable that we develop an efficient K-NNG construction method. We focus on NN-Descent, which is a recently proposed method that efficiently constructs an approximate K-NNG. NN-Descent is implemented on a shared-memory system with OpenMP-based parallelization, and its extension for the Hadoop MapReduce framework is implied for a larger data set such that the shared-memory system is difficult to deal with. However, a simple extension for the Hadoop MapReduce framework is impractical since it requires extremely high system performance because of the high memory consumption and the low data transmission efficiency of MapReduce jobs. The proposed method relaxes the requirement by improving the MapReduce jobs, which employs an appropriate key-value pair format and an efficient sampling strategy. Experiments on large-scale data sets demonstrate that the proposed method both works efficiently and is scalable in terms of a data size, the number of machine nodes, and the graph structural parameter K.

  • An Efficiency-Aware Scheduling for Data-Intensive Computations on MapReduce Clusters

    Hui ZHAO  Shuqiang YANG  Hua FAN  Zhikun CHEN  Jinghu XU  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2654-2662

    Scheduling plays a key role in MapReduce systems. In this paper, we explore the efficiency of an MapReduce cluster running lots of independent and continuously arriving MapReduce jobs. Data locality and load balancing are two important factors to improve computation efficiency in MapReduce systems for data-intensive computations. Traditional cluster scheduling technologies are not well suitable for MapReduce environment, there are some in-used schedulers for the popular open-source Hadoop MapReduce implementation, however, they can not well optimize both factors. Our main objective is to minimize total flowtime of all jobs, given it's a strong NP-hard problem, we adopt some effective heuristics to seek satisfied solution. In this paper, we formalize the scheduling problem as job selection problem, a load balance aware job selection algorithm is proposed, in task level we design a strict data locality tasks scheduling algorithm for map tasks on map machines and a load balance aware scheduling algorithm for reduce tasks on reduce machines. Comprehensive experiments have been conducted to compare our scheduling strategy with well-known Hadoop scheduling strategies. The experimental results validate the efficiency of our proposed scheduling strategy.