The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] HDF(8hit)

1-8hit
  • Delay Distribution Based Remote Data Fetch Scheme for Hadoop Clusters in Public Cloud

    Ravindra Sandaruwan RANAWEERA  Eiji OKI  Nattapong KITSUWAN  

     
    PAPER-Network

      Pubricized:
    2019/02/04
      Vol:
    E102-B No:8
      Page(s):
    1617-1625

    Apache Hadoop and its ecosystem have become the de facto platform for processing large-scale data, or Big Data, because it hides the complexity of distributed computing, scheduling, and communication while providing fault-tolerance. Cloud-based environments are becoming a popular platform for hosting Hadoop clusters due to their low initial cost and limitless capacity. However, cloud-based Hadoop clusters bring their own challenges due to contradictory design principles. Hadoop is designed on the shared-nothing principle while cloud is based on the concepts of consolidation and resource sharing. Most of Hadoop's features are designed for on-premises data centers where the cluster topology is known. Hadoop depends on the rack assignment of servers (configured by the cluster administrator) to calculate the distance between servers. Hadoop calculates the distance between servers to find the best remote server from which to fetch data from when fetching non-local data. However, public cloud environment providers do not share rack information of virtual servers with their tenants. Lack of rack information of servers may allow Hadoop to fetch data from a remote server that is on the other side of the data center. To overcome this problem, we propose a delay distribution based scheme to find the closest server to fetch non-local data for public cloud-based Hadoop clusters. The proposed scheme bases server selection on the delay distributions between server pairs. Delay distribution is calculated measuring the round-trip time between servers periodically. Our experiments observe that the proposed scheme outperforms conventional Hadoop nearly by 12% in terms of non-local data fetch time. This reduction in data fetch time will lead to a reduction in job run time, especially in real-world multi-user clusters where non-local data fetching can happen frequently.

  • Avoiding Performance Impacts by Re-Replication Workload Shifting in HDFS Based Cloud Storage

    Thanda SHWE  Masayoshi ARITSUGI  

     
    PAPER-Cloud Computing

      Pubricized:
    2018/09/18
      Vol:
    E101-D No:12
      Page(s):
    2958-2967

    Data replication in cloud storage systems brings a lot of benefits, such as fault tolerance, data availability, data locality and load balancing both from reliability and performance perspectives. However, each time a datanode fails, data blocks stored on the failed datanode must be restored to maintain replication level. This may be a large burden for the system in which resources are highly utilized with users' application workloads. Although there have been many proposals for replication, the approach of re-replication has not been properly addressed yet. In this paper, we present a deferred re-replication algorithm to dynamically shift the re-replication workload based on current resource utilization status of the system. As workload pattern varies depending on the time of the day, simulation results from synthetic workload demonstrate a large opportunity for minimizing impacts on users' application workloads with the simple algorithm that adjusts re-replication based on current resource utilization. Our approach can reduce performance impacts on users' application workloads while ensuring the same reliability level as default HDFS can provide.

  • Accordion: An Efficient Gear-Shifting for a Power-Proportional Distributed Data-Placement Method

    Hieu Hanh LE  Satoshi HIKIDA  Haruo YOKOTA  

     
    PAPER

      Pubricized:
    2015/01/21
      Vol:
    E98-D No:5
      Page(s):
    1013-1026

    Power-aware distributed file systems for efficient Big Data processing are increasingly moving towards power-proportional designs. However, current data placement methods for such systems have not given careful consideration to the effect of gear-shifting during operations. If the system wants to shift to a higher gear, it must reallocate the updated datasets that were modified in a lower gear when a subset of the nodes was inactive, but without disrupting the servicing of requests from clients. Inefficient gear-shifting that requires a large amount of data reallocation greatly degrades the system performance. To address this challenge, this paper proposes a data placement method known as Accordion, which uses data replication to arrange the data layout comprehensively and provide efficient gear-shifting. Compared with current methods, Accordion reduces the amount of data transferred, which significantly shortens the period required to reallocate the updated data during gear-shifting then able to improve the performance of the systems. The effect of this reduction is larger with higher gears, so Accordion is suitable for smooth gear-shifting in multigear systems. Moreover, the times when the active nodes serve the requests are well distributed, so Accordion is capable of higher scalability than existing methods based on the I/O throughput performance. Accordion does not require any strict constraint on the number of nodes in the system therefore our proposed method is expected to work well in practical environments. Extensive empirical experiments using actual machines with an Accordion prototype based on the Hadoop Distributed File System demonstrated that our proposed method significantly reduced the period required to transfer updated data, i.e., by 66% compared with an existing method.

  • A Study of Effective Replica Reconstruction Schemes for the Hadoop Distributed File System

    Asami HIGAI  Atsuko TAKEFUSA  Hidemoto NAKADA  Masato OGUCHI  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2015/01/13
      Vol:
    E98-D No:4
      Page(s):
    872-882

    Distributed file systems, which manage large amounts of data over multiple commercially available machines, have attracted attention as management and processing systems for Big Data applications. A distributed file system consists of multiple data nodes and provides reliability and availability by holding multiple replicas of data. Due to system failure or maintenance, a data node may be removed from the system, and the data blocks held by the removed data node are lost. If data blocks are missing, the access load of the other data nodes that hold the lost data blocks increases, and as a result, the performance of data processing over the distributed file system decreases. Therefore, replica reconstruction is an important issue to reallocate the missing data blocks to prevent such performance degradation. The Hadoop Distributed File System (HDFS) is a widely used distributed file system. In the HDFS replica reconstruction process, source and destination data nodes for replication are selected randomly. We find that this replica reconstruction scheme is inefficient because data transfer is biased. Therefore, we propose two more effective replica reconstruction schemes that aim to balance the workloads of replication processes. Our proposed replication scheduling strategy assumes that nodes are arranged in a ring, and data blocks are transferred based on this one-directional ring structure to minimize the difference in the amount of transfer data for each node. Based on this strategy, we propose two replica reconstruction schemes: an optimization scheme and a heuristic scheme. We have implemented the proposed schemes in HDFS and evaluate them on an actual HDFS cluster. We also conduct experiments on a large-scale environment by simulation. From the experiments in the actual environment, we confirm that the replica reconstruction throughputs of the proposed schemes show a 45% improvement compared to the HDFS default scheme. We also verify that the heuristic scheme is effective because it shows performance comparable to the optimization scheme. Furthermore, the experimental results on the large-scale simulation environment show that while the optimization scheme is unrealistic because a long time is required to find the optimal solution, the heuristic scheme is very efficient because it can be scalable, and that scheme improved replica reconstruction throughput by up to 25% compared to the default scheme.

  • A Distributed and Cooperative NameNode Cluster for a Highly-Available Hadoop Distributed File System

    Yonghwan KIM  Tadashi ARARAGI  Junya NAKAMURA  Toshimitsu MASUZAWA  

     
    PAPER-Computer System

      Pubricized:
    2014/12/26
      Vol:
    E98-D No:4
      Page(s):
    835-851

    Recently, Hadoop has attracted much attention from engineers and researchers as an emerging and effective framework for Big Data. HDFS (Hadoop Distributed File System) can manage a huge amount of data with high performance and reliability using only commodity hardware. However, HDFS requires a single master node, called a NameNode, to manage the entire namespace (or all the i-nodes) of a file system. This causes the SPOF (Single Point Of Failure) problem because the file system becomes inaccessible when the NameNode fails. This also causes a bottleneck of efficiency since all the access requests to the file system have to contact the NameNode. Hadoop 2.0 resolves the SPOF problem by introducing manual failover based on two NameNodes, Active and Standby. However, it still has the efficiency bottleneck problem since all the access requests have to contact the Active in ordinary executions. It may also lose the advantage of using commodity hardware since the two NameNodes have to share a highly reliable sophisticated storage. In this paper, we propose a new HDFS architecture to resolve all the problems mentioned above.

  • NDCouplingHDFS: A Coupling Architecture for a Power-Proportional Hadoop Distributed File System

    Hieu Hanh LE  Satoshi HIKIDA  Haruo YOKOTA  

     
    PAPER-Data Engineering, Web Information Systems

      Vol:
    E97-D No:2
      Page(s):
    213-222

    Energy-aware distributed file systems are increasingly moving toward power-proportional designs. However, current works have not considered the cost of updating data sets that were modified in a low-power mode, where a subset of nodes were powered off. In detail, when the system moves to a high-power mode, it must internally replicate the updated data to the reactivated nodes. Effectively reflecting the updated data is vital in making a distributed file system, such as the Hadoop Distributed File System (HDFS), power proportional. In the current HDFS design, when the system changes power mode, the block replication process is ineffectively restrained by a single NameNode because of access congestion of the metadata information of blocks. This paper presents a novel architecture, a NameNode and DataNode Coupling Hadoop Distributed File System (NDCouplingHDFS), which effectively reflects the updated blocks when the system goes into high-power mode. This is achieved by coupling metadata management and data management at each node to efficiently localize the range of blocks maintained by the metadata. Experiments using actual machines show that NDCouplingHDFS is able to significantly reduce the execution time required to move updated blocks by 46% relative to the normal HDFS. Moreover, NDCouplingHDFS is capable of increasing the throughput of the system supporting MapReduce by applying an index in metadata management.

  • Design and Analysis on Macro Diversity Scheme for Broadcast Services in Mobile Cellular Systems

    Yang LIU  Hui ZHAO  Yunchuan YANG  Wenbo WANG  Kan ZHENG  

     
    PAPER-Wireless Communication Technologies

      Vol:
    E93-B No:11
      Page(s):
    3113-3120

    Recently, broadcast services are introduced in cellular networks and macro diversity is an effective way to combat fading. In this paper, we propose a kind of distributed space-time block codes (STBCs) for macro diversity which is constructed from the total antennas of multiple cooperating base stations, and all the antennas form an equivalent multiple input multiple output (MIMO) system. This code is termed High-Dimension-Full-Rate-Quasi-Orthogonal STBC (HDFR-QOSTBC) which can be characterized as: (1) It can be applied with any number of transmit antennas especially when the number of transmit antennas is large; (2) The code is with full transmit rate of one; (3) The Maximum Likelihood (ML) decoding complexity of this code is controllable and limited to Nt/2-symbol-decodable for total Nt transmit antennas. Then, we completely analyze the structure of the equivalent channel for the kind of codes and reveal a property that the eigenvectors of the equivalent channel are constant and independent from the channel realization, and this characteristic can be exploited for a new transmission structure with single-symbol linear decoder. Furthermore, we analyze different macro diversity schemes and give a performance comparison. The simulation results show that the proposed scheme is practical for the broadcast systems with significant performance improvement comparing with soft-combination and cyclic delay diversity (CDD) methods.

  • Turbo Equalized Double Window Cancellation and Combining Robust to Large Delay Spread Channel

    JunHwan LEE  Tomoaki OHTSUKI  Masao NAKAGAWA  

     
    PAPER-Wireless Communication Technologies

      Vol:
    E92-B No:2
      Page(s):
    517-526

    In orthogonal frequency division multiplexing (OFDM) the multipath exceeding the guard interval (GI) causes inter-symbol interference (ISI) and inter-carrier interference (ICI), thereby making it difficult to achieve high data rate transmission. In this paper, the double window cancellation and combining (DWCC), introduced in [14], is analyzed by investigating SINR distribution under different delay spread channels. The analysis indicates that the extension of processing window in iterative cancellation can have an adverse effect on the performance for small interference levels. In addition, the optimal combining of DWCC and turbo equalization (TE), named TE-DWCC, is investigated by varying the iterative cancellation procedure between DWCC and channel decoder and the decision feedback type such as hard decision feedback (HDF) or soft decision feedback (SDF). Finally, by changing interference level, code rate, and decision feedback type, the performance of TE-DWCC is compared with the conventional canceller that adopts turbo equalization in the exponentially distributed slow fading channel.