The search functionality is under construction.

Author Search Result

[Author] Kanemitsu OOTSU(10hit)

1-10hit
  • Enhancing Cup-Stacking Method for Collective Communication

    Takashi YOKOTA  Kanemitsu OOTSU  Shun KOJIMA  

     
    PAPER-Computer System

      Pubricized:
    2023/08/22
      Vol:
    E106-D No:11
      Page(s):
    1808-1821

    An interconnection network is an inevitable component for constructing parallel computers. It connects computation nodes so that the nodes can communicate with each other. As a parallel computation essentially requires inter-node communication according to a parallel algorithm, the interconnection network plays an important role in terms of communication performance. This paper focuses on the collective communication that is frequently performed in parallel computation and this paper addresses the Cup-Stacking method that is proposed in our preceding work. The key issues of the method are splitting a large packet into slices, re-shaping the slice, and stacking the slices, in a genetic algorithm (GA) manner. This paper discusses extending the Cup-Stacking method by introducing additional items (genes) and proposes the extended Cup-Stacking method. Furthermore, this paper places comprehensive discussions on the drawbacks and further optimization of the method. Evaluation results reveal the effectiveness of the extended method, where the proposed method achieves at most seven percent improvement in duration time over the former Cup-Stacking method.

  • Fast Computation with Efficient Object Data Distribution for Large-Scale Hologram Generation on a Multi-GPU Cluster Open Access

    Takanobu BABA  Shinpei WATANABE  Boaz JESSIE JACKIN  Kanemitsu OOTSU  Takeshi OHKAWA  Takashi YOKOTA  Yoshio HAYASAKI  Toyohiko YATAGAI  

     
    PAPER-Human-computer Interaction

      Pubricized:
    2019/03/29
      Vol:
    E102-D No:7
      Page(s):
    1310-1320

    The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include a change of the way of object decomposition, reduction of data transfer between the CPU and GPU, kernel integration, stream processing, and utilization of multiple GPUs within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. Experimental results show that intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain an execution time of 4.28 sec for generating a 1.6 giga-pixel hologram from a 3.2 giga-pixel object. It means a 237.92 times speed-up of the sequential processing by CPU and 41.78 times speed-up of multi-threaded execution on multicore-CPU, using a conventional FFT-based algorithm.

  • On a Cup-Stacking Concept in Repetitive Collective Communication

    Takashi YOKOTA  Kanemitsu OOTSU  Shun KOJIMA  

     
    LETTER-Computer System

      Pubricized:
    2022/04/15
      Vol:
    E105-D No:7
      Page(s):
    1325-1329

    Parallel computing essentially consists of computation and communication and, in many cases, communication performance is vital. Many parallel applications use collective communications, which often dominate the performance of the parallel execution. This paper focuses on collective communication performance to speed-up the parallel execution. This paper firstly offers our experimental result that splitting a session of collective communication to small portions (slices) possibly enables efficient communication. Then, based on the results, this paper proposes a new concept cup-stacking with a genetic algorithm based methodology. The preliminary evaluation results reveal the effectiveness of the proposed method.

  • Accelerating Large-Scale Interconnection Network Simulation by Cellular Automata Concept

    Takashi YOKOTA  Kanemitsu OOTSU  Takeshi OHKAWA  

     
    PAPER-Computer System

      Pubricized:
    2018/10/05
      Vol:
    E102-D No:1
      Page(s):
    52-74

    State-of-the-art parallel systems employ a huge number of computing nodes that are connected by an interconnection network. An interconnection network (ICN) plays an important role in a parallel system, since it is responsible to communication capability. In general, an ICN shows non-linear phenomena in its communication performance, most of them are caused by congestion. Thus, designing a large-scale parallel system requires sufficient discussions through repetitive simulation runs. This causes another problem in simulating large-scale systems within a reasonable cost. This paper shows a promising solution by introducing the cellular automata concept, which is originated in our prior work. Assuming 2D-torus topologies for simplification of discussion, this paper discusses fundamental design of router functions in terms of cellular automata, data structure of packets, alternative modeling of a router function, and miscellaneous optimization. The proposed models have a good affinity to GPGPU technology and, as representative speed-up results, the GPU-based simulator accelerates simulation upto about 1264 times from sequential execution on a single CPU. Furthermore, since the proposed models are applicable in the shared memory model, multithread implementation of the proposed methods achieve about 162 times speed-ups at the maximum.

  • FPGA Components for Integrating FPGAs into Robot Systems

    Takeshi OHKAWA  Kazushi YAMASHINA  Hitomi KIMURA  Kanemitsu OOTSU  Takashi YOKOTA  

     
    PAPER-Emerging Applications

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    363-375

    A component-oriented FPGA design platform is proposed for robot system integration. FPGAs are known to be a power-efficient hardware platform, but the development cost of FPGA-based systems is currently too high to integrate them into robot systems. To solve this problem, we propose an FPGA component that allows FPGA devices to be easily integrated into robot systems based on the Robot Operating System (ROS). ROS-compliant FPGA components offer a seamless interface between the FPGA hardware and software running on the CPU. Two experiments were conducted using the proposed components. For the first experiment, the results show that the execution time of an FPGA component for image processing was 1.7 times faster than that of the original software-based component and was 2.51 times more power efficient than an ordinary PC processor, despite substantial communication overhead. The second experiment showed that an FPGA component for sensor fusion was able to process multiple sensor inputs efficiently and with very low latency via parallel processing.

  • A Static Packet Scheduling Approach for Fast Collective Communication by Using PSO

    Takashi YOKOTA  Kanemitsu OOTSU  Takeshi OHKAWA  

     
    PAPER-Interconnection networks

      Pubricized:
    2017/07/14
      Vol:
    E100-D No:12
      Page(s):
    2781-2795

    Interconnection network is one of the inevitable components in parallel computers, since it is responsible to communication capabilities of the systems. It affects the system-level performance as well as the physical and logical structure of the systems. Although many studies are reported to enhance the interconnection network technology, we have to discuss many issues remaining. One of the most important issues is congestion management. In an interconnection network, many packets are transferred simultaneously and the packets interfere to each other in the network. Congestion arises as a result of the interferences. Its fast spreading speed seriously degrades communication performance and it continues for long time. Thus, we should appropriately control the network to suppress the congested situation for maintaining the maximum performance. Many studies address the problem and present effective methods, however, the maximal performance in an ideal situation is not sufficiently clarified. Solving the ideal performance is, in general, an NP-hard problem. This paper introduces particle swarm optimization (PSO) methodology to overcome the problem. In this paper, we first formalize the optimization problem suitable for the PSO method and present a simple PSO application as naive models. Then, we discuss reduction of the size of search space and introduce three practical variations of the PSO computation models as repetitive model, expansion model, and coding model. We furthermore introduce some non-PSO methods for comparison. Our evaluation results reveal high potentials of the PSO method. The repetitive and expansion models achieve significant acceleration of collective communication performance at most 1.72 times faster than that in the bursty communication condition.

  • Genetic Node-Mapping Methods for Rapid Collective Communications

    Takashi YOKOTA  Kanemitsu OOTSU  Takeshi OHKAWA  

     
    PAPER-Computer System

      Pubricized:
    2019/10/10
      Vol:
    E103-D No:1
      Page(s):
    111-129

    Inter-node communication is essential in parallel computation. The performance of parallel processing depends on the efficiencies in both computation and communication, thus, the communication cost is not negligible. A parallel application program involves a logical communication structure that is determined by the interchange of data between computation nodes. Sometimes the logical communication structure mismatches to that in a real parallel machine. This mismatch results in large communication costs. This paper addresses the node-mapping problem that rearranges logical position of node so that the degree of mismatch is decreased. This paper assumes that parallel programs execute one or more collective communications that follow specific traffic patterns. An appropriate node-mapping achieves high communication performance. This paper proposes a strong heuristic method for solving the node-mapping problem and adapts the method to a genetic algorithm. Evaluation results reveal that the proposed method achieves considerably high performance; it achieves 8.9 (4.9) times speed-up on average in single-(two-)traffic-pattern cases in 32×32 torus networks. Specifically, for some traffic patterns in small-scale networks, the proposed method finds theoretically optimized solutions. Furthermore, this paper discusses in deep about various issues in the proposed method that employs genetic algorithm, such as population of genes, number of generations, and traffic patterns. This paper also discusses applicability to large-scale systems for future practical use.

  • Automatic Generation Tool of FPGA Components for Robots Open Access

    Takeshi OHKAWA  Kazushi YAMASHINA  Takuya MATSUMOTO  Kanemitsu OOTSU  Takashi YOKOTA  

     
    PAPER-Design Tools

      Pubricized:
    2019/03/01
      Vol:
    E102-D No:5
      Page(s):
    1012-1019

    In order to realize intelligent robot system, it is required to process large amount of data input from complex and different kinds of sensors in a short time. FPGA is expected to improve process performance of robots due to better performance per power consumption than high performance CPU, but it has lower development productivity than software. In this paper, we discuss automatic generation of FPGA components for robots. A design tool, developed for easy integration of FPGA into robots, is proposed. The tool named cReComp can automatically convert circuit written in Verilog HDL into a software component compliant to a robot software framework ROS (Robot Operation System), which is the standard in robot development. To evaluate its productivity, we conducted a subject experiment. As a result, we confirmed that the automatic generation is effective to ease the development of FPGA components for robots.

  • Enhancing Entropy Throttling: New Classes of Injection Control in Interconnection Networks

    Takashi YOKOTA  Kanemitsu OOTSU  Takeshi OHKAWA  

     
    PAPER-Interconnection network

      Pubricized:
    2016/08/25
      Vol:
    E99-D No:12
      Page(s):
    2911-2922

    State-of-the-art parallel computers, which are growing in parallelism, require a lot of things in their interconnection networks. Although wide spectrum of efforts in research and development for effective and practical interconnection networks are reported, the problem is still open. One of the largest issues is congestion control that intends to maximize the network performance in terms of throughput and latency. Throttling, or injection limitation, is one of the center ideas of congestion control. We have proposed a new class of throttling method, Entropy Throttling, whose foundation is entropy concept of packets. The throttling method is successful in part, however, its potentials are not sufficiently discussed. This paper aims at exploiting capabilities of the Entropy Throttling method via comprehensive evaluation. Major contributions of this paper are to introduce two ideas of hysteresis function and guard time and also to clarify wide performance characteristics in steady and unsteady communication situations. By introducing the new ideas, we extend the Entropy throttling method. The extended methods improve communication performance at most 3.17 times in the best case and 1.47 times in average compared with non-throttling cases in collective communication, while the method can sustain steady communication performance.

  • A Genetic Approach for Accelerating Communication Performance by Node Mapping

    Takashi YOKOTA  Kanemitsu OOTSU  Takeshi OHKAWA  

     
    LETTER-Architecture

      Pubricized:
    2018/09/18
      Vol:
    E101-D No:12
      Page(s):
    2971-2975

    This paper intends to reduce duration times in typical collective communications. We introduce logical addressing system apart from the physical one and, by rearranging the logical node addresses properly, we intend to reduce communication overheads so that ideal communication is performed. One of the key issues is rearrangement of the logical addressing system. We introduce genetic algorithm (GA) as meta-heuristic solution as well as the random search strategy. Our GA-based method achieves at most 2.50 times speedup in three-traffic-pattern cases.