The search functionality is under construction.

Keyword Search Result

[Keyword] high-performance computing(10hit)

1-10hit
  • Node-to-Set Disjoint Paths Problem in Cross-Cubes

    Rikuya SASAKI  Hiroyuki ICHIDA  Htoo Htoo Sandi KYAW  Keiichi KANEKO  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2023/10/06
      Vol:
    E107-D No:1
      Page(s):
    53-59

    The increasing demand for high-performance computing in recent years has led to active research on massively parallel systems. The interconnection network in a massively parallel system interconnects hundreds of thousands of processing elements so that they can process large tasks while communicating among others. By regarding the processing elements as nodes and the links between processing elements as edges, respectively, we can discuss various problems of interconnection networks in the framework of the graph theory. Many topologies have been proposed for interconnection networks of massively parallel systems. The hypercube is a very popular topology and it has many variants. The cross-cube is such a topology, which can be obtained by adding one extra edge to each node of the hypercube. The cross-cube reduces the diameter of the hypercube, and allows cycles of odd lengths. Therefore, we focus on the cross-cube and propose an algorithm that constructs disjoint paths from a node to a set of nodes. We give a proof of correctness of the algorithm. Also, we show that the time complexity and the maximum path length of the algorithm are O(n3 log n) and 2n - 3, respectively. Moreover, we estimate that the average execution time of the algorithm is O(n2) based on a computer experiment.

  • Application Mapping and Scheduling of Uncertain Communication Patterns onto Non-Random and Random Network Topologies

    Yao HU  Michihiro KOIBUCHI  

     
    PAPER-Computer System

      Pubricized:
    2020/07/20
      Vol:
    E103-D No:12
      Page(s):
    2480-2493

    Due to recent technology progress based on big-data processing, many applications present irregular or unpredictable communication patterns among compute nodes in high-performance computing (HPC) systems. Traditional communication infrastructures, e.g., torus or fat-tree interconnection networks, may not handle well their matchmaking problems with these newly emerging applications. There are already many communication-efficient application mapping algorithms for these typical non-random network topologies, which use nearby compute nodes to reduce the network distances. However, for the above unpredictable communication patterns, it is difficult to efficiently map their applications onto the non-random network topologies. In this context, we recommend using random network topologies as the communication infrastructures, which have drawn increasing attention for the use of HPC interconnects due to their small diameter and average shortest path length (ASPL). We make a comparative study to analyze the impact of application mapping performance on non-random and random network topologies. We propose using topology embedding metrics, i.e., diameter and ASPL, and list several diameter/ASPL-based application mapping algorithms to compare their job scheduling performances, assuming that the communication pattern of each application is unpredictable to the computing system. Evaluation with a large compound application workload shows that, when compared to non-random topologies, random topologies can reduce the average turnaround time up to 39.3% by a random connected mapping method and up to 72.1% by a diameter/ASPL-based mapping algorithm. Moreover, when compared to the baseline topology mapping method, the proposed diameter/ASPL-based topology mapping strategy can reduce up to 48.0% makespan and up to 78.1% average turnaround time, and improve up to 1.9x system utilization over random topologies.

  • Scalability Analysis of Deeply Pipelined Tsunami Simulation with Multiple FPGAs Open Access

    Antoniette MONDIGO  Tomohiro UENO  Kentaro SANO  Hiroyuki TAKIZAWA  

     
    PAPER-Applications

      Pubricized:
    2019/02/05
      Vol:
    E102-D No:5
      Page(s):
    1029-1036

    Since the hardware resource of a single FPGA is limited, one idea to scale the performance of FPGA-based HPC applications is to expand the design space with multiple FPGAs. This paper presents a scalable architecture of a deeply pipelined stream computing platform, where available parallelism and inter-FPGA link characteristics are investigated to achieve a scaled performance. For a practical exploration of this vast design space, a performance model is presented and verified with the evaluation of a tsunami simulation application implemented on Intel Arria 10 FPGAs. Finally, scalability analysis is performed, where speedup is achieved when increasing the computing pipeline over multiple FPGAs while maintaining the problem size of computation. Performance is scaled with multiple FPGAs; however, performance degradation occurs with insufficient available bandwidth and large pipeline overhead brought by inadequate data stream size. Tsunami simulation results show that the highest scaled performance for 8 cascaded Arria 10 FPGAs is achieved with a single pipeline of 5 stream processing elements (SPEs), which obtained a scaled performance of 2.5 TFlops and a parallel efficiency of 98%, indicating the strong scalability of the multi-FPGA stream computing platform.

  • NEST: Towards Extreme Scale Computing Systems

    Yunfeng LU  Huaxi GU  Xiaoshan YU  Kun WANG  

     
    LETTER-Information Network

      Pubricized:
    2018/08/20
      Vol:
    E101-D No:11
      Page(s):
    2827-2830

    High-performance computing (HPC) has penetrated into various research fields, yet the increase in computing power is limited by conventional electrical interconnections. The proposed architecture, NEST, exploits wavelength routing in arrayed waveguide grating routers (AWGRs) to achieve a scalable, low-latency, and high-throughput network. For the intra pod and inter pod communication, the symmetrical topology of NEST reduces the network diameter, which leads to an increase in latency performance. Moreover, the proposed architecture enables exponential growth of network size. Simulation results demonstrate that NEST shows 36% latency improvement and 30% throughput improvement over the dragonfly on an average.

  • Design Study of Domain Decomposition Operation in Dataflow Architecture FDTD/FIT Dedicated Computer

    Hideki KAWAGUCHI  

     
    PAPER-Electromagnetic Theory

      Vol:
    E101-C No:1
      Page(s):
    20-25

    To aim to achieve a high-performance computation for microwave simulations with low cost, small size machine and low energy consumption, a method of the FDTD dedicated computer has been investigated. It was shown by VHDL logical circuit simulations that the FDTD dedicated computer with a dataflow architecture has much higher performance than that of high-end PC and GPU. Then the remaining task of this work is large scale computations by the dedicated computer, since microwave simulations for only 18×18×Z grid space (Z is the number of girds for z direction) can be executed in a single FPGA at most. To treat much larger numerical model size for practical applications, this paper considers an implementation of a domain decomposition method operation of the FDTD dedicated computer in a single FPGA.

  • Job Mapping and Scheduling on Free-Space Optical Networks

    Yao HU  Ikki FUJIWARA  Michihiro KOIBUCHI  

     
    PAPER-Computer System

      Pubricized:
    2016/08/16
      Vol:
    E99-D No:11
      Page(s):
    2694-2704

    A number of parallel applications run on a high-performance computing (HPC) system simultaneously. Job mapping and scheduling become crucial to improve system utilization, because fragmentation prevents an incoming job from being assigned even if there are enough compute nodes unused. Wireless supercomputers and datacenters with free-space optical (FSO) terminals have been proposed to replace the conventional wired interconnection so that a diverse application workload can be better supported by changing their network topologies. In this study we firstly present an efficient job mapping by swapping the endpoints of FSO links in a wireless HPC system. Our evaluation shows that an FSO-equipped wireless HPC system can achieve shorter average queuing length and queuing time for all the dispatched user jobs. Secondly, we consider the use of a more complicated and enhanced scheduling algorithm, which can further improve the system utilization over different host networks, as well as the average response time for all the dispatched user jobs. Finally, we present the performance advantages of the proposed wireless HPC system under more practical assumptions such as different cabinet capacities and diverse subtopology packings.

  • Layout-Conscious Expandable Topology for Low-Degree Interconnection Networks

    Thao-Nguyen TRUONG  Khanh-Van NGUYEN  Ikki FUJIWARA  Michihiro KOIBUCHI  

     
    PAPER-Computer System

      Pubricized:
    2016/02/02
      Vol:
    E99-D No:5
      Page(s):
    1275-1284

    System expandability becomes a major concern for highly parallel computers and data centers, because their number of nodes gradually increases year by year. In this context we propose a low-degree topology and its floor layout in which a cabinet or node set can be newly inserted by connecting short cables to a single existing cabinet. Our graph analysis shows that the proposed topology has low diameter, low average shortest path length and short average cable length comparable to existing topologies with the same degree. When incrementally adding nodes and cabinets to the proposed topology, its diameter and average shortest path length increase modestly. Our discrete-event simulation results show that the proposed topology provides a comparable performance to 2-D Torus for some parallel applications. The network cost and power consumption of DSN-F modestly increase when compared to the counterpart non-random topologies.

  • The Case for Network Coding for Collective Communication on HPC Interconnection Networks Open Access

    Ahmed SHALABY  Ikki FUJIWARA  Michihiro KOIBUCHI  

     
    PAPER-Information Network

      Pubricized:
    2014/12/11
      Vol:
    E98-D No:3
      Page(s):
    661-670

    Recently network bandwidth becomes a performance concern particularly for collective communication since bisection bandwidths of supercomputers become far less than their full bisection bandwidths. In this context we propose the use of a network coding technique to reduce the number of unicasts and the size of data transferred in latency-sensitive collective communications in supercomputers. Our proposed network coding scheme has a hierarchical multicasting structure with intra-group and inter-group unicasts. Quantitative analysis show that the aggregate path hop counts by our hierarchical network coding decrease as much as 94% when compared to conventional unicast-based multicasts. We validate these results by cycle-accurate network simulations. In 1,024-switch networks, the network reduces the execution time of collective communications as much as 70%. We also show that our hierarchical network coding is beneficial for any packet size.

  • Proposal of a Desk-Side Supercomputer with Reconfigurable Data-Paths Using Rapid Single-Flux-Quantum Circuits

    Naofumi TAKAGI  Kazuaki MURAKAMI  Akira FUJIMAKI  Nobuyuki YOSHIKAWA  Koji INOUE  Hiroaki HONDA  

     
    INVITED PAPER

      Vol:
    E91-C No:3
      Page(s):
    350-355

    We propose a desk-side supercomputer with large-scale reconfigurable data-paths (LSRDPs) using superconducting rapid single-flux-quantum (RSFQ) circuits. It has several sets of computing unit which consists of a general-purpose microprocessor, an LSRDP and a memory. An LSRDP consists of a lot of, e.g., a few thousand, floating-point units (FPUs) and operand routing networks (ORNs) which connect the FPUs. We reconfigure the LSRDP to fit a computation, i.e., a group of floating-point operations, which appears in a 'for' loop of numerical programs by setting the route in ORNs before the execution of the loop. We propose to implement the LSRDPs by RSFQ circuits. The processors and the memories can be implemented by semiconductor technology. We expect that a 10 TFLOPS supercomputer, as well as a refrigerating engine, will be housed in a desk-side rack, using a near-future RSFQ process technology, such as 0.35 µm process.

  • A Grid Portal to Support High-Performance Scientific Computing on Distributed Resources

    Jacobo TARRIO  Juan TOURIÑO  María J. MARTIN  Patricia GONZALEZ  Ramon DOALLO  

     
    PAPER-Distributed, Grid and P2P Computing

      Vol:
    E87-D No:7
      Page(s):
    1843-1849

    Grid computing can help to promote high-performance computing at a low overall cost by encouraging research centers to share their resources. However, research staff usually finds it quite hard to use Grids effectively, due to the need of installing and managing new Grid software. Thus, Grid portals are created, making it easier to take advantage of the full capability of the Grid, favoring in this way its use. The goal of this paper is to describe the process of design and implementation of a Grid Portal with the aim of both supporting distributed high-performance resources and make its use by researchers as transparent as possible. This portal uses standard Grid and Web technologies. We have designed the portal so that it can be adapted to different existing Grid infrastructures, based on the Globus Toolkit, and new functionalities can be easily added. The first prototype of the portal has been tested on an experimental Grid platform, and we present encouraging experiences carried out there.