The search functionality is under construction.

Author Search Result

[Author] Michihiro KOIBUCHI(25hit)

1-20hit(25hit)

  • Application Mapping and Scheduling of Uncertain Communication Patterns onto Non-Random and Random Network Topologies

    Yao HU  Michihiro KOIBUCHI  

     
    PAPER-Computer System

      Pubricized:
    2020/07/20
      Vol:
    E103-D No:12
      Page(s):
    2480-2493

    Due to recent technology progress based on big-data processing, many applications present irregular or unpredictable communication patterns among compute nodes in high-performance computing (HPC) systems. Traditional communication infrastructures, e.g., torus or fat-tree interconnection networks, may not handle well their matchmaking problems with these newly emerging applications. There are already many communication-efficient application mapping algorithms for these typical non-random network topologies, which use nearby compute nodes to reduce the network distances. However, for the above unpredictable communication patterns, it is difficult to efficiently map their applications onto the non-random network topologies. In this context, we recommend using random network topologies as the communication infrastructures, which have drawn increasing attention for the use of HPC interconnects due to their small diameter and average shortest path length (ASPL). We make a comparative study to analyze the impact of application mapping performance on non-random and random network topologies. We propose using topology embedding metrics, i.e., diameter and ASPL, and list several diameter/ASPL-based application mapping algorithms to compare their job scheduling performances, assuming that the communication pattern of each application is unpredictable to the computing system. Evaluation with a large compound application workload shows that, when compared to non-random topologies, random topologies can reduce the average turnaround time up to 39.3% by a random connected mapping method and up to 72.1% by a diameter/ASPL-based mapping algorithm. Moreover, when compared to the baseline topology mapping method, the proposed diameter/ASPL-based topology mapping strategy can reduce up to 48.0% makespan and up to 78.1% average turnaround time, and improve up to 1.9x system utilization over random topologies.

  • Optimizing Slot Utilization and Network Topology for Communication Pattern on Circuit-Switched Parallel Computing Systems

    Yao HU  Michihiro KOIBUCHI  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2018/11/16
      Vol:
    E102-D No:2
      Page(s):
    247-260

    In parallel computing systems, the interconnection network forms the critical infrastructure which enables robust and scalable communication between hundreds of thousands of nodes. The traditional packet-switched network tends to suffer from long communication time when network congestion occurs. In this context, we explore the use of circuit switching (CS) to replace packet switches with custom hardware that supports circuit-based switching efficiently with low latency. In our target CS network, a certain amount of bandwidth is guaranteed for each communication pair so that the network latency can be predictable when a limited number of node pairs exchange messages. The number of allocated time slots in every switch is a direct factor to affect the end-to-end latency, we thereby improve the slot utilization and develop a network topology generator to minimize the number of time slots optimized to target applications whose communication patterns are predictable. By a quantitative discrete-event simulation, we illustrate that the minimum necessary number of slots can be reduced to a small number in a generated topology by our design methodology while maintaining network cost 50% less than that in standard tori topologies.

  • A Fully Optical Ring Network-on-Chip with Static and Dynamic Wavelength Allocation

    Ahmadou Dit Adi CISSE  Michihiro KOIBUCHI  Masato YOSHIMI  Hidetsugu IRIE  Tsutomu YOSHINAGA  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2545-2554

    Silicon photonics Network-on-Chips (NoCs) have emerged as an attractive solution to alleviate the high power consumption of traditional electronic interconnects. In this paper, we propose a fully optical ring NoC that combines static and dynamic wavelength allocation communication mechanisms. A different wavelength-channel is statically allocated to each destination node for light weight communication. Contention of simultaneous communication requests from multiple source nodes to the destination is solved by a token based arbitration for the particular wavelength-channel. For heavy load communication, a multiwavelength-channel is available by requesting it in execution time from source node to a special node that manages dynamic allocation of the shared multiwavelength-channel among all nodes. We combine these static and dynamic communication mechanisms in a same network that introduces selection techniques based on message size and congestion information. Using a photonic NoC simulator based on Phoenixsim, we evaluate our architecture under uniform random, neighbor, and hotspot traffic patterns. Simulation results show that our proposed fully optical ring NoC presents a good performance by utilizing adequate static and dynamic channels based on the selection techniques. We also show that our architecture can reduce by more than half, the energy consumption necessary for arbitration compared to hybrid photonic ring and mesh NoCs. A comparison with several previous works in term of architecture hardware cost shows that our architecture can be an attractive cost-performance efficient interconnection infrastructure for future SoCs and CMPs.

  • Proxy Responses by FPGA-Based Switch for MapReduce Stragglers

    Koya MITSUZUKA  Michihiro KOIBUCHI  Hideharu AMANO  Hiroki MATSUTANI  

     
    PAPER-Computer System

      Pubricized:
    2018/06/15
      Vol:
    E101-D No:9
      Page(s):
    2258-2268

    In parallel processing applications, a few worker nodes called “stragglers”, which execute their tasks significantly slower than other tasks, increase the execution time of the job. In this paper, we propose a network switch based straggler handling system to mitigate the burden of the compute nodes. We also propose how to offload detecting stragglers and computing their results in the network switch with no additional communications between worker nodes. We introduce some approximate techniques for the proxy computation and response at the switch; thus our switch is called “ApproxSW.” As a result of a simulation experiment, the proposed approximation based on task similarity achieves the best accuracy in terms of quality of generated Map outputs. We also analyze how to suppress unnecessary proxy computation by the ApproxSW. We implement ApproxSW on NetFPGA-SUME board that has four 10Gbit Ethernet (10GbE) interfaces and a Virtex-7 FPGA. Experimental results shows that the ApproxSW functions do not degrade the original 10GbE switch performance.

  • Efficient Two-Opt Collective-Communication Operations on Low-Latency Random Network Topologies

    Ke CUI  Michihiro KOIBUCHI  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2020/07/03
      Vol:
    E103-D No:12
      Page(s):
    2435-2443

    Random network topologies have been proposed as a low-latency network for parallel computers. Although multicast is a common collective-communication operation, multicast algorithms each of which consists of a large number of unicasts are not well optimized for random network topologies. In this study, we firstly apply a two-opt algorithm for building efficient multicast on random network topologies. The two-opt algorithm creates a skilled ordered list of visiting nodes to minimize the total path hops or the total possible contention counts of unicasts that form the target multicast. We secondly extend to apply the two-opt algorithm for the other collective-communication operations, e.g., allreduce and allgather. The SimGrid discrete-event simulation results show that the two-opt multicast outperforms that in typical MPI implementation by up to 22% of the execution time of an MPI program that repeats the MPI_Bcast function. The two-opt allreduce and the two-opt allgather operations also improve by up to 15% and 14% the execution time when compared to those used in typical MPI implementations, respectively.

  • Traffic-Independent Multi-Path Routing for High-Throughput Data Center Networks

    Ryuta KAWANO  Ryota YASUDO  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2020/08/06
      Vol:
    E103-D No:12
      Page(s):
    2471-2479

    Network throughput has become an important issue for big-data analysis on Warehouse-Scale Computing (WSC) systems. It has been reported that randomly-connected inter-switch networks can enlarge the network throughput. For irregular networks, a multi-path routing method called k-shortest path routing is conventionally utilized. However, it cannot efficiently exploit longer-than-shortest paths that would be detour paths to avoid bottlenecks. In this work, a novel routing method called k-optimized path routing to achieve high throughput is proposed for irregular networks. We introduce a heuristic to select detour paths that can avoid bottlenecks in the network to improve the average-case network throughput. Experimental results by network simulation show that the proposed k-optimized path routing can improve the saturation throughput by up to 18.2% compared to the conventional k-shortest path routing. Moreover, it can reduce the computation time required for optimization to 1/2760 at a minimum compared to our previously proposed method.

  • The Case for Network Coding for Collective Communication on HPC Interconnection Networks Open Access

    Ahmed SHALABY  Ikki FUJIWARA  Michihiro KOIBUCHI  

     
    PAPER-Information Network

      Pubricized:
    2014/12/11
      Vol:
    E98-D No:3
      Page(s):
    661-670

    Recently network bandwidth becomes a performance concern particularly for collective communication since bisection bandwidths of supercomputers become far less than their full bisection bandwidths. In this context we propose the use of a network coding technique to reduce the number of unicasts and the size of data transferred in latency-sensitive collective communications in supercomputers. Our proposed network coding scheme has a hierarchical multicasting structure with intra-group and inter-group unicasts. Quantitative analysis show that the aggregate path hop counts by our hierarchical network coding decrease as much as 94% when compared to conventional unicast-based multicasts. We validate these results by cycle-accurate network simulations. In 1,024-switch networks, the network reduces the execution time of collective communications as much as 70%. We also show that our hierarchical network coding is beneficial for any packet size.

  • A Port Combination Methodology for Application-Specific Networks-on-Chip on FPGAs

    Daihan WANG  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Reconfigurable System and Applications

      Vol:
    E90-D No:12
      Page(s):
    1914-1922

    A temporal correlation based port combination algorithm that customizes the router design in Network-on-Chip (NoC) is proposed for reconfigurable systems in order to minimize required hardware amount. Given the traffic characteristics of the target application and the expected hardware amount reduction rate, the algorithm automatically makes the port combination plan for the networks. Since the port combination technique has the advantage of almost keeping the topology including two-surface layout, it does not affect the design of the other layer, such as task mapping and scheduling. The algorithm shows much better efficiency than the algorithm without temporal correlation. For the multimedia stream processing application, the algorithm can save 55% of the hardware amount without performance degradation, while the none temporal correlation algorithm suffers from 30% performance loss.

  • FOREWORD Open Access

    Michihiro KOIBUCHI  

     
    FOREWORD

      Vol:
    E102-D No:12
      Page(s):
    2280-2280
  • Architectural Design of Next-Generation Science Information Network

    Shigeo URUSHIDANI  Shunji ABE  Kensuke FUKUDA  Jun MATSUKATA  Yusheng JI  Michihiro KOIBUCHI  Shigeki YAMADA  

     
    PAPER

      Vol:
    E90-B No:5
      Page(s):
    1061-1070

    This paper proposes an advanced hybrid network architecture and a comprehensive network design of the next-generation science information network, called SINET3. Effectively combining layer-1 switches and IP/MPLS routers, the network provides layer-1 end-to-end circuit services as well as IP and Ethernet services and enables flexible resource allocation in response to service demands. The detailed network design focuses on the tangible achievement of providing a wide range of network services, such as multiple layer services, multiple virtual private network services, advanced qualities of service, and layer-1 bandwidth on demand services. It also covers high-availability capabilities and effective resource assignment in the hybrid network. The cost reduction effect of our network architecture is also shown in this paper.

  • New Directions for a Japanese Academic Backbone Network Open Access

    Shigeo URUSHIDANI  Shunji ABE  Kenjiro YAMANAKA  Kento AIDA  Shigetoshi YOKOYAMA  Hiroshi YAMADA  Motonori NAKAMURA  Kensuke FUKUDA  Michihiro KOIBUCHI  Shigeki YAMADA  

     
    INVITED PAPER

      Pubricized:
    2014/12/11
      Vol:
    E98-D No:3
      Page(s):
    546-556

    This paper describes an architectural design and related services of a new Japanese academic backbone network, called SINET5, which will be launched in April 2016. The network will cover all 47 prefectures with 100-Gigabit Ethernet technology and connect each pair of prefectures with a minimized latency. This will enable users to leverage evolving cloud-computing powers as well as draw on a high-performance platform for data-intensive applications. The transmission layer will form a fully meshed, SDN-friendly, and reliable network. The services will evolve to be more dynamic and cloud-oriented in response to user demands. Cyber-security measures for the backbone network and tools for performance acceleration and visualization are also discussed.

  • An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs

    Tomoya ITSUBO  Michihiro KOIBUCHI  Hideharu AMANO  Hiroki MATSUTANI  

     
    PAPER

      Pubricized:
    2021/07/01
      Vol:
    E104-D No:12
      Page(s):
    2057-2067

    Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.

  • Job Mapping and Scheduling on Free-Space Optical Networks

    Yao HU  Ikki FUJIWARA  Michihiro KOIBUCHI  

     
    PAPER-Computer System

      Pubricized:
    2016/08/16
      Vol:
    E99-D No:11
      Page(s):
    2694-2704

    A number of parallel applications run on a high-performance computing (HPC) system simultaneously. Job mapping and scheduling become crucial to improve system utilization, because fragmentation prevents an incoming job from being assigned even if there are enough compute nodes unused. Wireless supercomputers and datacenters with free-space optical (FSO) terminals have been proposed to replace the conventional wired interconnection so that a diverse application workload can be better supported by changing their network topologies. In this study we firstly present an efficient job mapping by swapping the endpoints of FSO links in a wireless HPC system. Our evaluation shows that an FSO-equipped wireless HPC system can achieve shorter average queuing length and queuing time for all the dispatched user jobs. Secondly, we consider the use of a more complicated and enhanced scheduling algorithm, which can further improve the system utilization over different host networks, as well as the average response time for all the dispatched user jobs. Finally, we present the performance advantages of the proposed wireless HPC system under more practical assumptions such as different cabinet capacities and diverse subtopology packings.

  • Boosting the Performance of Interconnection Networks by Selective Data Compression

    Naoya NIWA  Hideharu AMANO  Michihiro KOIBUCHI  

     
    PAPER

      Pubricized:
    2022/07/12
      Vol:
    E105-D No:12
      Page(s):
    2057-2065

    This study presents a selective data-compression interconnection network to boost its performance. Data compression virtually increases the effective network bandwidth. One drawback of data compression is a long latency to perform (de-)compression operation at a compute node. In terms of the communication latency, we explore the trade-off between the compression latency overhead and the reduced injection latency by shortening the packet length by compression algorithms. As a result, we present to selectively apply a compression technique to a packet. We perform a compression operation to long packets and it is also taken when network congestion is detected at a source compute node. Through a cycle-accurate network simulation, the selective compression method using the above compression algorithms improves by up to 39% the network throughput with a moderate increase in the communication latency of short packets.

  • A Novel Channel Assignment Method to Ensure Deadlock-Freedom for Deterministic Routing

    Ryuta KAWANO  Hiroshi NAKAHARA  Seiichi TADE  Ikki FUJIWARA  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2017/05/19
      Vol:
    E100-D No:8
      Page(s):
    1798-1806

    Inter-switch networks for HPC systems and data-centers can be improved by applying random shortcut topologies with a reduced number of hops. With minimal routing in such networks; however, deadlock-freedom is not guaranteed. Multiple Virtual Channels (VCs) are efficiently used to avoid this problem. However, previous works do not provide good trade-offs between the number of required VCs and the time and memory complexities of an algorithm. In this work, a novel and fast algorithm, named ACRO, is proposed to endorse the arbitrary routing functions with deadlock-freedom, as well as consuming a small number of VCs. A heuristic approach to reduce VCs is achieved with a hash table, which improves the scalability of the algorithm compared with our previous work. Moreover, experimental results show that ACRO can reduce the average number of VCs by up to 63% when compared with a conventional algorithm that has the same time complexity. Furthermore, ACRO reduces the time complexity by a factor of O(|N|⋅log|N|), when compared with another conventional algorithm that requires almost the same number of VCs.

  • A Layout-Oriented Routing Method for Low-Latency HPC Networks

    Ryuta KAWANO  Hiroshi NAKAHARA  Ikki FUJIWARA  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Interconnection networks

      Pubricized:
    2017/07/14
      Vol:
    E100-D No:12
      Page(s):
    2796-2807

    End-to-end network latency has become an important issue for parallel application on large-scale high performance computing (HPC) systems. It has been reported that randomly-connected inter-switch networks can lower the end-to-end network latency. This latency reduction is established in exchange for a large amount of routing information. That is, minimal routing on irregular networks is achieved by using routing tables for all destinations in the networks. In this work, a novel distributed routing method called LOREN (Layout-Oriented Routing with Entries for Neighbors) to achieve low-latency with a small routing table is proposed for irregular networks whose link length is limited. The routing tables contain both physically and topologically nearby neighbor nodes to ensure livelock-freedom and a small number of hops between nodes. Experimental results show that LOREN reduces the average latencies by 5.8% and improves the network throughput by up to 62% compared with a conventional compact routing method. Moreover, the number of required routing table entries is reduced by up to 91%, which improves scalability and flexibility for implementation.

  • Enhancing Job Scheduling on Inter-Rackscale Datacenters with Free-Space Optical Links

    Yao HU  Michihiro KOIBUCHI  

     
    PAPER-Information networks

      Pubricized:
    2018/09/18
      Vol:
    E101-D No:12
      Page(s):
    2922-2932

    Datacenter growth in traffic and scale is driving innovations in constructing tightly-coupled facilities with low-latency communication for different specific applications. A famous custom design is rackscale (RS) computing by gathering key server resource components into different resource pools. Such a resource-pooling implementation requires a new software stack to manage resource discovery, resource allocation and data communication. The reconfiguration of interconnection networks on their components is potentially needed to support the above demand in RS. In this context as an evolution of the original RS architecture the inter-rackscale (IRS) architecture, which disaggregates hardware components into different racks according to their own areas, has been proposed. The heart of IRS is to use a limited number of free-space optics (FSO) channels for wireless connections between different resource racks, via which selected pairs of racks can communicate directly and thus resource-pooling requirements are met without additional software management. In this study we evaluate the influences of FSO links on IRS networks. Evaluation results show that FSO links reduce average communication hop count for user jobs, which is close to the best possible value of 2 hops and thus provides comparable benchmark performance to that of the counterpart RS architecture. In addition, if four FSO terminals per rack are allowed, the CPU/SSD (GPU) interconnection latency is reduced by 25.99% over Fat-tree and by 67.14% over 2-D Torus. We also present the advantage of an FSO-equipped IRS system in average turnaround time of dispatched jobs for given sets of benchmark workloads.

  • A Compression Router for Low-Latency Network-on-Chip

    Naoya NIWA  Yoshiya SHIKAMA  Hideharu AMANO  Michihiro KOIBUCHI  

     
    PAPER-Computer System

      Pubricized:
    2022/11/08
      Vol:
    E106-D No:2
      Page(s):
    170-180

    Network-on-Chips (NoCs) are important components for scalable many-core processors. Because the performance of parallel applications is usually sensitive to the latency of NoCs, reducing it is a primary requirement. In this study, a compression router that hides the (de)compression-operation delay is proposed. The compression router (de)compresses the contents of the incoming packet before the switch arbitration is completed, thus shortening the packet length without latency penalty and reducing the network injection-and-ejection latency. Evaluation results show that the compression router improves up to 33% of the parallel application performance (conjugate gradients (CG), fast Fourier transform (FT), integer sort (IS), and traveling salesman problem (TSP)) and 63% of the effective network throughput by 1.8 compression ratio on NoC. The cost is an increase in router area and its energy consumption by 0.22mm2 and 1.6 times compared to the conventional virtual-channel router. Another finding is that off-loading the decompressor onto a network interface decreases the compression-router area by 57% at the expense of the moderate increase in communication latency.

  • Layout-Conscious Expandable Topology for Low-Degree Interconnection Networks

    Thao-Nguyen TRUONG  Khanh-Van NGUYEN  Ikki FUJIWARA  Michihiro KOIBUCHI  

     
    PAPER-Computer System

      Pubricized:
    2016/02/02
      Vol:
    E99-D No:5
      Page(s):
    1275-1284

    System expandability becomes a major concern for highly parallel computers and data centers, because their number of nodes gradually increases year by year. In this context we propose a low-degree topology and its floor layout in which a cabinet or node set can be newly inserted by connecting short cables to a single existing cabinet. Our graph analysis shows that the proposed topology has low diameter, low average shortest path length and short average cable length comparable to existing topologies with the same degree. When incrementally adding nodes and cabinets to the proposed topology, its diameter and average shortest path length increase modestly. Our discrete-event simulation results show that the proposed topology provides a comparable performance to 2-D Torus for some parallel applications. The network cost and power consumption of DSN-F modestly increase when compared to the counterpart non-random topologies.

  • Novel Chip Stacking Methods to Extend Both Horizontally and Vertically for Many-Core Architectures with ThrouChip Interface

    Hiroshi NAKAHARA  Tomoya OZAKI  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Architecture

      Pubricized:
    2016/08/24
      Vol:
    E99-D No:12
      Page(s):
    2871-2880

    The increase of recent non-recurrent engineering cost (design, mask and test cost) have made large System-on-Chip (SoC) difficult to develop especially with advanced technology. We radically explore an approach for cheap and flexible chip stacking by using Inductive coupling ThruChip Interface (TCI). In order to connect a large number of small chips for building a large scale system, novel chip stacking methods called the linear stacking and staggered stacking are proposed. They enable the system to be extended to x or/and y dimensions, not only to z dimension. Here, a novel chip staking layout, and its deadlock-free routing design for the case using single-core chips and multi-core chips are shown. The network with 256 nodes formed by the proposed stacking improves the latency of 2D mesh by 13.8% and the performance of NAS Parallel Benchmarks by 5.4% on average compared to that of 2D mesh.

1-20hit(25hit)