The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] parallel computer(26hit)

1-20hit(26hit)

  • Enhancing Cup-Stacking Method for Collective Communication

    Takashi YOKOTA  Kanemitsu OOTSU  Shun KOJIMA  

     
    PAPER-Computer System

      Pubricized:
    2023/08/22
      Vol:
    E106-D No:11
      Page(s):
    1808-1821

    An interconnection network is an inevitable component for constructing parallel computers. It connects computation nodes so that the nodes can communicate with each other. As a parallel computation essentially requires inter-node communication according to a parallel algorithm, the interconnection network plays an important role in terms of communication performance. This paper focuses on the collective communication that is frequently performed in parallel computation and this paper addresses the Cup-Stacking method that is proposed in our preceding work. The key issues of the method are splitting a large packet into slices, re-shaping the slice, and stacking the slices, in a genetic algorithm (GA) manner. This paper discusses extending the Cup-Stacking method by introducing additional items (genes) and proposes the extended Cup-Stacking method. Furthermore, this paper places comprehensive discussions on the drawbacks and further optimization of the method. Evaluation results reveal the effectiveness of the extended method, where the proposed method achieves at most seven percent improvement in duration time over the former Cup-Stacking method.

  • The Implementation of a Hybrid Router and Dynamic Switching Algorithm on a Multi-FPGA System

    Tomoki SHIMIZU  Kohei ITO  Kensuke IIZUKA  Kazuei HIRONAKA  Hideharu AMANO  

     
    PAPER

      Pubricized:
    2022/06/30
      Vol:
    E105-D No:12
      Page(s):
    2008-2018

    The multi-FPGA system known as, the Flow-in-Cloud (FiC) system, is composed of mid-range FPGAs that are directly interconnected by high-speed serial links. FiC is currently being developed as a server for multi-access edge computing (MEC), which is one of the core technologies of 5G. Because the applications of MEC are sometimes timing-critical, a static time division multiplexing (STDM) network has been used on FiC. However, the STDM network exhibits the disadvantage of decreasing link utilization, especially under light traffic. To solve this problem, we propose a hybrid router that combines packet switching for low-priority communication and STDM for high-priority communication. In our hybrid network, the packet switching uses slots that are unused by the STDM; therefore, best-effort communication by packet switching and QoS guarantee communication by the STDM can be used simultaneously. Furthermore, to improve each link utilization under a low network traffic load, we propose a dynamic communication switching algorithm. In our algorithm, each router monitors the network load metrics, and according to the metrics, timing-critical tasks select the STDM according to the metrics only when congestion occurs. This can achieve both QoS guarantee and efficient utilization of each link with a small resource overhead. In our evaluation, the dynamic algorithm was up to 24.6% faster on the execution time with a high network load compared to the packet switching on a real multi-FPGA system with 24 boards.

  • On a Cup-Stacking Concept in Repetitive Collective Communication

    Takashi YOKOTA  Kanemitsu OOTSU  Shun KOJIMA  

     
    LETTER-Computer System

      Pubricized:
    2022/04/15
      Vol:
    E105-D No:7
      Page(s):
    1325-1329

    Parallel computing essentially consists of computation and communication and, in many cases, communication performance is vital. Many parallel applications use collective communications, which often dominate the performance of the parallel execution. This paper focuses on collective communication performance to speed-up the parallel execution. This paper firstly offers our experimental result that splitting a session of collective communication to small portions (slices) possibly enables efficient communication. Then, based on the results, this paper proposes a new concept cup-stacking with a genetic algorithm based methodology. The preliminary evaluation results reveal the effectiveness of the proposed method.

  • A Genetic Approach for Accelerating Communication Performance by Node Mapping

    Takashi YOKOTA  Kanemitsu OOTSU  Takeshi OHKAWA  

     
    LETTER-Architecture

      Pubricized:
    2018/09/18
      Vol:
    E101-D No:12
      Page(s):
    2971-2975

    This paper intends to reduce duration times in typical collective communications. We introduce logical addressing system apart from the physical one and, by rearranging the logical node addresses properly, we intend to reduce communication overheads so that ideal communication is performed. One of the key issues is rearrangement of the logical addressing system. We introduce genetic algorithm (GA) as meta-heuristic solution as well as the random search strategy. Our GA-based method achieves at most 2.50 times speedup in three-traffic-pattern cases.

  • A Static Packet Scheduling Approach for Fast Collective Communication by Using PSO

    Takashi YOKOTA  Kanemitsu OOTSU  Takeshi OHKAWA  

     
    PAPER-Interconnection networks

      Pubricized:
    2017/07/14
      Vol:
    E100-D No:12
      Page(s):
    2781-2795

    Interconnection network is one of the inevitable components in parallel computers, since it is responsible to communication capabilities of the systems. It affects the system-level performance as well as the physical and logical structure of the systems. Although many studies are reported to enhance the interconnection network technology, we have to discuss many issues remaining. One of the most important issues is congestion management. In an interconnection network, many packets are transferred simultaneously and the packets interfere to each other in the network. Congestion arises as a result of the interferences. Its fast spreading speed seriously degrades communication performance and it continues for long time. Thus, we should appropriately control the network to suppress the congested situation for maintaining the maximum performance. Many studies address the problem and present effective methods, however, the maximal performance in an ideal situation is not sufficiently clarified. Solving the ideal performance is, in general, an NP-hard problem. This paper introduces particle swarm optimization (PSO) methodology to overcome the problem. In this paper, we first formalize the optimization problem suitable for the PSO method and present a simple PSO application as naive models. Then, we discuss reduction of the size of search space and introduce three practical variations of the PSO computation models as repetitive model, expansion model, and coding model. We furthermore introduce some non-PSO methods for comparison. Our evaluation results reveal high potentials of the PSO method. The repetitive and expansion models achieve significant acceleration of collective communication performance at most 1.72 times faster than that in the bursty communication condition.

  • Enhancing Entropy Throttling: New Classes of Injection Control in Interconnection Networks

    Takashi YOKOTA  Kanemitsu OOTSU  Takeshi OHKAWA  

     
    PAPER-Interconnection network

      Pubricized:
    2016/08/25
      Vol:
    E99-D No:12
      Page(s):
    2911-2922

    State-of-the-art parallel computers, which are growing in parallelism, require a lot of things in their interconnection networks. Although wide spectrum of efforts in research and development for effective and practical interconnection networks are reported, the problem is still open. One of the largest issues is congestion control that intends to maximize the network performance in terms of throughput and latency. Throttling, or injection limitation, is one of the center ideas of congestion control. We have proposed a new class of throttling method, Entropy Throttling, whose foundation is entropy concept of packets. The throttling method is successful in part, however, its potentials are not sufficiently discussed. This paper aims at exploiting capabilities of the Entropy Throttling method via comprehensive evaluation. Major contributions of this paper are to introduce two ideas of hysteresis function and guard time and also to clarify wide performance characteristics in steady and unsteady communication situations. By introducing the new ideas, we extend the Entropy throttling method. The extended methods improve communication performance at most 3.17 times in the best case and 1.47 times in average compared with non-throttling cases in collective communication, while the method can sustain steady communication performance.

  • MMLRU Selection Function: A Simple and Efficient Output Selection Function in Adaptive Routing

    Michihiro KOIBUCHI  Akiya JOURAKU  Hideharu AMANO  

     
    PAPER-Computer Systems

      Vol:
    E88-D No:1
      Page(s):
    109-118

    Adaptive routing algorithms, which dynamically select the route of a packet, have been widely studied for interconnection networks in massively parallel computers. An output selection function (OSF), which decides the output channel when some legal channels are free, is essential for an adaptive routing. In this paper, we propose a simple and efficient OSF called minimal multiplexed and least-recently-used (MMLRU). The MMLRU selection function has the following simple strategies for distributing the traffic: 1) each router locally grasps the congestion information by the utilization ratio of its own physical channels; 2) it is divided into the two selection steps, the choice from available physical channels and the choice from available virtual channels. The MMLRU selection function can be used on any type of network topology and adaptive routing algorithm. Simulation results show that the MMLRU selection function improves throughput and latency especially when the number of dimension becomes larger or the number of nodes per dimension become larger.

  • REX: A Reconfigurable Experimental System for Evaluating Parallel Computer Systems

    Yuetsu KODAMA  Toshihiro KATASHITA  Kenji SAYANO  

     
    PAPER

      Vol:
    E86-D No:10
      Page(s):
    2016-2024

    REX is a reconfigurable experimental system for evaluating and developing parallel computer systems. It consists of large-scale FPGAs, and enables the systems to be reconfigured from their processors to the network topology in order to support their evaluation and development. We evaluated REX using several implementations of parallel computer systems, and showed that it had enough scalability of gates, memory throughput and network throughput. We also showed that REX was an effective tool because of its emulation speed and reconfigurability to develop systems.

  • Concurrency Control and Performance Evaluation of Parallel B-tree Structures

    Jun MIYAZAKI  Haruo YOKOTA  

     
    PAPER-Databases

      Vol:
    E85-D No:8
      Page(s):
    1269-1283

    The Fat-Btree which is a new parallel B-tree structure has been proposed to improve the access performance of shared-nothing parallel database systems. Since the Fat-Btree has only a part of index nodes on each processing element, it can reduce the synchronization cost in update operations. For these reasons, both retrieval and update operations can be processed at high throughput compared to previously proposed parallel B-tree structures for shared-nothing computers. Though we tried to apply some conventional concurrency control methods to the Fat-Btree, e.g., B-OPT and ARIES/IM, which were designed for shared-everything machines, we found that these methods are not always appropriate for the Fat-Btree. In this paper, it is shown that the conventional methods are not suitable for the Fat-Btree and other parallel B-trees. We propose a new deadlock free concurrency control protocol, named INC-OPT, to improve the performance of the Fat-Btree more effectively than the B-OPT and ARIES/IM. Furthermore, in order to prove that the Fat-Btree provides the impact on the performance of shared-nothing parallel databases, we compare the real performance of three types of parallel B-tree structures, Fat-Btree, Copy-Whole-Btree, and Single-Index-Btree, on an nCUBE3 machine where the INC-OPT is applied.

  • Fault-Tolerant Ring- and Toroidal Mesh-Connected Processor Arrays Able to Enhance Emulation of Hypercubes

    Nobuo TSUDA  

     
    PAPER

      Vol:
    E84-D No:11
      Page(s):
    1452-1461

    An advanced spare-connection scheme for K-out-of-N redundancy is proposed for constructing fault-tolerant ring- or toroidal mesh-connected processing-node arrays able to enhance emulation of binary hypercubes by using bypass networks. With this scheme, a component redundancy configuration for a base array with a fixed number of primary nodes, such as that for 8-node ring or 32-node toroidal mesh, can be constructed by using bypass links with a segmented bus structure to selectively connect the primary nodes to a spare node in parallel. These bypass links are allocated to the primary nodes by graph-node coloring with a minimum inter-node distance of three in order to use the bypass links as the hypercube connections as well as to attain strong fault tolerance for reconfiguring the base array with the primary network topology. An extended redundancy configuration for a large fault-tolerant array can be constructed by connecting the component configurations by using external switches of a hub type provided at the bus nodes of the bypass links. This configuration has a network topology of the parallel star-connections of sub-hypercubes whose diameter is smaller than that of the regular hypercube.

  • High-Availability Scheme for Shared Servers of Cluster Systems Using Commands Transfer

    Yuzuru MAYA  Soichi ISONO  Akira OHTSUJI  

     
    PAPER-Computer Systems

      Vol:
    E83-D No:5
      Page(s):
    1073-1081

    For cluster systems consisting of multiple processing nodes and shared servers which consist of an on-line and a backup shared server, we propose a hot-standby scheme for shared servers. In this scheme for shared servers, when the on-line shared server receives a command from a node, it sends only an update command and its data identifier to the backup shared server. Both the on-line and the backup shared server execute the update command independently, and the command result of the on-line shared server is identical to that of the backup shared server. When the on-line shared server fails, the backup reconstructs the shared data by using its own shared data and the user data from each node. We evaluated the system recovery time and the performance overhead for this hot-standby scheme. It enables the performance overhead to be ignored, and the system recovery time to be shortened to 20 seconds in cluster systems.

  • An Efficient Method for Reconfiguring the 1 1/2 Track-Switch Mesh Array

    Tadayoshi HORITA  Itsuo TAKANAMI  

     
    PAPER-Fault Tolerant Computing

      Vol:
    E82-D No:12
      Page(s):
    1545-1553

    As VLSI technology has developed, the interest in implementing an entire or significant part of a parallel computer system using wafer scale integration is growing. The major problem for the case is the possibility of drastically low yield and/or reliability of the system if there is no strategy for coping with such situations. Various strategies to restructure the faulty physical system into the fault-free target logical system are described in the literature [1]-[5]. In this paper, we propose an efficient approximate method which can reconstruct the 1 1/2 track-switch mesh arrays with faulty PEs using hardware as well as software. A logical circuit added to each PE and a network connecting the circuits are used to decide spare PEs which compensate for faulty PEs. The hardware compexity of each circuit is much less than that of a PE where the size of each additional circuit is independent of array sizes and constant. By using the exclusive hardware scheme, a built-in self-reconfigurable system without using a host computer is realizable and the time for reconfiguring arrays becomes very short. The simulation result of the performance of the method shows that the reconstructing efficiency of our algorithm is a little less than those of the exaustive and Shigei's ones [6] and [7], but much better than that of the neural one [3]. We also compare the time complexities of reconstructions by hardware as well as software, and the hardware complexity in terms of the number of gates in the logical circuit added to each PE among the other methods.

  • Iterative Methods for Dense Linear Systems on Distributed Memory Parallel Computers

    Muneharu YOKOYAMA  Takaomi SHIGEHARA  Hiroshi MIZOGUCHI  Taketoshi MISHIMA  

     
    PAPER

      Vol:
    E82-A No:3
      Page(s):
    483-486

    The Conjugate Residual method, one of the iterative methods for solving linear systems, is applied to the problems with a dense coefficient matrix on distributed memory parallel computers. Based on an assumption on the computation and communication times of the proposed algorithm for parallel computers, it is shown that the optimal number of processing elements is proportional to the problem size N. The validity of the prediction is confirmed through numerical experiments on Hitachi SR2201.

  • Efficient Implementation of Multi-Dimensional Array Redistribution

    Minyi GUO  Yoshiyuki YAMASHITA  Ikuo NAKATA  

     
    PAPER-Sofware System

      Vol:
    E81-D No:11
      Page(s):
    1195-1204

    Array redistribution is required very often in programs on distributed memory parallel computers. It is essential to use efficient algorithms for redistribution, otherwise the performance of programs may degrade considerably. In this paper, we focus on automatic generation of communication routines for multi-dimensional redistribution. The principal advantage of this work is to gain the ability to handle redistribution between arbitrary source and destination processor sets and between arbitrary source and destination distribution schemes. We have implemented these algorithms using Parallelware communication library. Some experimental results show the efficiency and flexibility of our techniques compared to the other redistribution works.

  • Processor Pipeline Design for Fast Network Message Handling in RWC-1 Multiprocessor

    Hiroshi MATSUOKA  Kazuaki OKAMOTO  Hideo HIRONO  Mitsuhisa SATO  Takashi YOKOTA  Shuichi SAKAI  

     
    PAPER

      Vol:
    E81-C No:9
      Page(s):
    1391-1397

    In this paper we describe the pipeline design and enhanced hardware for fast message handling in a RICA-1 processor, a processing element (PE) in the RWC-1 multiprocessor. The RWC-1 is based on the reduced inter-processor communication architecture (RICA), in which communications are combined with computation in the processor pipeline. The pipeline is enhanced with hardware mechanisms to support fine-grain parallel execution. The data paths of the RICA-1 super-scalar processor are commonly used for communication as well as instruction execution to minimize its implementation cost. A 128-PE system has been built on January 1998, and it is currently used for hardware debugging, software development and performance evaluation.

  • Analytic Modeling of Updating Based Cache Coherent Parallel Computers

    Kazuki JOE  Akira FUKUDA  

     
    PAPER-Computer Systems

      Vol:
    E81-D No:6
      Page(s):
    504-512

    In this paper, we apply the Semi-markov Memory and Cache coherence Interference (SMCI) model, which we had proposed for invalidating based cache coherent parallel computers, to an updating based protocol. The model proposed here, the SMCI/Dragon model, can predict performance of cache coherent parallel computers with the Dragon protocol as well as the original SMCI model for the Synapse protocol. Conventional analytic models by stochastic processes to describe parallel computers have the problem of numerical explosion in the number of states necessary as the system size increases. We have already shown that the SMCI model achieved both the small number of states to describe parallel computers with the Synapse protocol and the inexpensive computation cost to predict their performance. In this paper, we demonstrate generality of the SMCI model by applying it to the another cache coherence protocol, Dragon, which has opposite characteristics than Synapse. We show the number of states required by constructing the SMCI/Dragon model is only 21 which is as small as SMCI/Synapse, and the computation cost is also the order of microseconds. Using the SMCI/Dragon model, we investigate several comparative experiments with widely known simulation results. We found that there is only a 5. 4% differences between the simulation and the SMCI/Dragon model.

  • Parallel File Access for Implementing Dynamic Load Balancing on a Massively Parallel Computer

    Masahisa SHIMIZU  Yasuhiro OUE  Kazumasa OHNISHI  Toru KITAMURA  

     
    PAPER

      Vol:
    E80-D No:4
      Page(s):
    466-472

    Because a massively parallel computer processes vast amounts of data and generates many access requests from multiple processors simultaneously, parallel secondary storage requires large capacity and high concurrency. One effective method of implementation of such secondary storage is to use disk arrays which have multiple disks connected in parallel. In this paper, we propose a parallel file access method named DECODE (dynamic express changing of data entry) in which load balancing of each disk is achieved by dynamic determination of the write data position. For resolution of the problem of data fragmentation which is caused by the relocation of data during a write process, the concept of "Equivalent Area" is introduced. We have performed a preliminary performance evaluation using software simulation under various access statuses by changing the access pattern, access size and stripe size and confirmed the effectiveness of load balancing with this method.

  • The Effect of Optimizing Compilers on Architecture and Programs

    Michael WOLFE  

     
    INVITED PAPER

      Vol:
    E80-D No:4
      Page(s):
    403-408

    The first optimizing compiler was developed at IBM in order to prove that high level language programming could be as efficient as hand-coded machine language. Computer architecture and compiler optimization interacted through a feedback loop, from the high-level language computer architectures of the 1970s to the RISC machines of the 1980s. In the supercomputing community, the availability of effective vectorizing compilers delivered easy-to-use performance in the 1980s to the present. These compilers were successful at least in part because they could predict poor performance spots in the program and report these to users. This fostered a feedback loop between programmers and compilers to develop high performance programs. Future optimizing compilers for high performance computers and supercomputers will have to take advantage of both feedback loops.

  • hMDCE: The Hierarchical Multidimensional Directed Cycles Ensemble Network

    Takashi YOKOTA  Hiroshi MATSUOKA  Kazuaki OKAMOTO  Hideo HIRONO  Shuichi SAKAI  

     
    PAPER-Interconnection Networks

      Vol:
    E79-D No:8
      Page(s):
    1099-1106

    This paper discusses a massively parallel interconnection scheme for multithreaded architecture and introduces a new class of direct interconnection networks called the hierarchical Multidimensional Directed Cycles Ensemble (hMDCE). Its suitability for massively parallel systems is discussed. The network is evolved from the Multidimensional Directed Cycles Ensemble (MDCE) network, where each node is substituted by lower-level sub-networks. The new network addresses some serious problems caused by the increasing scale of parallel systems, such as longer latency, limited throughput and high implementation cost. This paper first introduces the MDCE network and then presents and examines in detail the hierarchical MDCE network. Bisection bandwidth of hMDCE is considerably reduced from its ancestor MDCE and the network performs significantly higher throughput and lower latency under some practical implementation constraints. The gate count and delay time of the compiled circuit for the routing function are insignificant. These results reveal that the hMDCE network is an important candidate for massively parallel systems interconnection.

  • Software Cache Techniques for Memory Nodes in Distributed Memory Parallel Production Systems

    Jun MIYAZAKI   Haruo YOKOTA  

     
    PAPER-Architectures

      Vol:
    E79-D No:8
      Page(s):
    1046-1054

    Because the match phase in OPS5-type production systems requires most of the system's execution time and memory accesses, we proposed hash-based parallel production systems, CPPS (Clustered Parallel Production Systems), based on the RETE algorithm for distributed memory parallel computers, or multicomputers to reduce such a bottleneck. CPPS was effective in speeding up the match phase, but still left room for optimizations. In this paper, we introduce software cache techniques to memory nodes in the CPPS as one of the optimizations, and implement it on a multicomputer, nCUBE2. The benchmark results show that the CPPS with the software cache is about 2-fold faster than the original, and more than 7-fold faster than the simple hash method proposed by Acharya et al. for a large scale problem. The speed-up can be attributed to decreased communication costs.

1-20hit(26hit)