Takashi YOKOTA Kanemitsu OOTSU Shun KOJIMA
An interconnection network is an inevitable component for constructing parallel computers. It connects computation nodes so that the nodes can communicate with each other. As a parallel computation essentially requires inter-node communication according to a parallel algorithm, the interconnection network plays an important role in terms of communication performance. This paper focuses on the collective communication that is frequently performed in parallel computation and this paper addresses the Cup-Stacking method that is proposed in our preceding work. The key issues of the method are splitting a large packet into slices, re-shaping the slice, and stacking the slices, in a genetic algorithm (GA) manner. This paper discusses extending the Cup-Stacking method by introducing additional items (genes) and proposes the extended Cup-Stacking method. Furthermore, this paper places comprehensive discussions on the drawbacks and further optimization of the method. Evaluation results reveal the effectiveness of the extended method, where the proposed method achieves at most seven percent improvement in duration time over the former Cup-Stacking method.
Tomoki SHIMIZU Kohei ITO Kensuke IIZUKA Kazuei HIRONAKA Hideharu AMANO
The multi-FPGA system known as, the Flow-in-Cloud (FiC) system, is composed of mid-range FPGAs that are directly interconnected by high-speed serial links. FiC is currently being developed as a server for multi-access edge computing (MEC), which is one of the core technologies of 5G. Because the applications of MEC are sometimes timing-critical, a static time division multiplexing (STDM) network has been used on FiC. However, the STDM network exhibits the disadvantage of decreasing link utilization, especially under light traffic. To solve this problem, we propose a hybrid router that combines packet switching for low-priority communication and STDM for high-priority communication. In our hybrid network, the packet switching uses slots that are unused by the STDM; therefore, best-effort communication by packet switching and QoS guarantee communication by the STDM can be used simultaneously. Furthermore, to improve each link utilization under a low network traffic load, we propose a dynamic communication switching algorithm. In our algorithm, each router monitors the network load metrics, and according to the metrics, timing-critical tasks select the STDM according to the metrics only when congestion occurs. This can achieve both QoS guarantee and efficient utilization of each link with a small resource overhead. In our evaluation, the dynamic algorithm was up to 24.6% faster on the execution time with a high network load compared to the packet switching on a real multi-FPGA system with 24 boards.
Takashi YOKOTA Kanemitsu OOTSU Shun KOJIMA
Parallel computing essentially consists of computation and communication and, in many cases, communication performance is vital. Many parallel applications use collective communications, which often dominate the performance of the parallel execution. This paper focuses on collective communication performance to speed-up the parallel execution. This paper firstly offers our experimental result that splitting a session of collective communication to small portions (slices) possibly enables efficient communication. Then, based on the results, this paper proposes a new concept cup-stacking with a genetic algorithm based methodology. The preliminary evaluation results reveal the effectiveness of the proposed method.
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA
This paper intends to reduce duration times in typical collective communications. We introduce logical addressing system apart from the physical one and, by rearranging the logical node addresses properly, we intend to reduce communication overheads so that ideal communication is performed. One of the key issues is rearrangement of the logical addressing system. We introduce genetic algorithm (GA) as meta-heuristic solution as well as the random search strategy. Our GA-based method achieves at most 2.50 times speedup in three-traffic-pattern cases.
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA
Interconnection network is one of the inevitable components in parallel computers, since it is responsible to communication capabilities of the systems. It affects the system-level performance as well as the physical and logical structure of the systems. Although many studies are reported to enhance the interconnection network technology, we have to discuss many issues remaining. One of the most important issues is congestion management. In an interconnection network, many packets are transferred simultaneously and the packets interfere to each other in the network. Congestion arises as a result of the interferences. Its fast spreading speed seriously degrades communication performance and it continues for long time. Thus, we should appropriately control the network to suppress the congested situation for maintaining the maximum performance. Many studies address the problem and present effective methods, however, the maximal performance in an ideal situation is not sufficiently clarified. Solving the ideal performance is, in general, an NP-hard problem. This paper introduces particle swarm optimization (PSO) methodology to overcome the problem. In this paper, we first formalize the optimization problem suitable for the PSO method and present a simple PSO application as naive models. Then, we discuss reduction of the size of search space and introduce three practical variations of the PSO computation models as repetitive model, expansion model, and coding model. We furthermore introduce some non-PSO methods for comparison. Our evaluation results reveal high potentials of the PSO method. The repetitive and expansion models achieve significant acceleration of collective communication performance at most 1.72 times faster than that in the bursty communication condition.
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA
State-of-the-art parallel computers, which are growing in parallelism, require a lot of things in their interconnection networks. Although wide spectrum of efforts in research and development for effective and practical interconnection networks are reported, the problem is still open. One of the largest issues is congestion control that intends to maximize the network performance in terms of throughput and latency. Throttling, or injection limitation, is one of the center ideas of congestion control. We have proposed a new class of throttling method, Entropy Throttling, whose foundation is entropy concept of packets. The throttling method is successful in part, however, its potentials are not sufficiently discussed. This paper aims at exploiting capabilities of the Entropy Throttling method via comprehensive evaluation. Major contributions of this paper are to introduce two ideas of hysteresis function and guard time and also to clarify wide performance characteristics in steady and unsteady communication situations. By introducing the new ideas, we extend the Entropy throttling method. The extended methods improve communication performance at most 3.17 times in the best case and 1.47 times in average compared with non-throttling cases in collective communication, while the method can sustain steady communication performance.
Michihiro KOIBUCHI Akiya JOURAKU Hideharu AMANO
Adaptive routing algorithms, which dynamically select the route of a packet, have been widely studied for interconnection networks in massively parallel computers. An output selection function (OSF), which decides the output channel when some legal channels are free, is essential for an adaptive routing. In this paper, we propose a simple and efficient OSF called minimal multiplexed and least-recently-used (MMLRU). The MMLRU selection function has the following simple strategies for distributing the traffic: 1) each router locally grasps the congestion information by the utilization ratio of its own physical channels; 2) it is divided into the two selection steps, the choice from available physical channels and the choice from available virtual channels. The MMLRU selection function can be used on any type of network topology and adaptive routing algorithm. Simulation results show that the MMLRU selection function improves throughput and latency especially when the number of dimension becomes larger or the number of nodes per dimension become larger.
Yuetsu KODAMA Toshihiro KATASHITA Kenji SAYANO
REX is a reconfigurable experimental system for evaluating and developing parallel computer systems. It consists of large-scale FPGAs, and enables the systems to be reconfigured from their processors to the network topology in order to support their evaluation and development. We evaluated REX using several implementations of parallel computer systems, and showed that it had enough scalability of gates, memory throughput and network throughput. We also showed that REX was an effective tool because of its emulation speed and reconfigurability to develop systems.
The Fat-Btree which is a new parallel B-tree structure has been proposed to improve the access performance of shared-nothing parallel database systems. Since the Fat-Btree has only a part of index nodes on each processing element, it can reduce the synchronization cost in update operations. For these reasons, both retrieval and update operations can be processed at high throughput compared to previously proposed parallel B-tree structures for shared-nothing computers. Though we tried to apply some conventional concurrency control methods to the Fat-Btree, e.g., B-OPT and ARIES/IM, which were designed for shared-everything machines, we found that these methods are not always appropriate for the Fat-Btree. In this paper, it is shown that the conventional methods are not suitable for the Fat-Btree and other parallel B-trees. We propose a new deadlock free concurrency control protocol, named INC-OPT, to improve the performance of the Fat-Btree more effectively than the B-OPT and ARIES/IM. Furthermore, in order to prove that the Fat-Btree provides the impact on the performance of shared-nothing parallel databases, we compare the real performance of three types of parallel B-tree structures, Fat-Btree, Copy-Whole-Btree, and Single-Index-Btree, on an nCUBE3 machine where the INC-OPT is applied.
An advanced spare-connection scheme for K-out-of-N redundancy is proposed for constructing fault-tolerant ring- or toroidal mesh-connected processing-node arrays able to enhance emulation of binary hypercubes by using bypass networks. With this scheme, a component redundancy configuration for a base array with a fixed number of primary nodes, such as that for 8-node ring or 32-node toroidal mesh, can be constructed by using bypass links with a segmented bus structure to selectively connect the primary nodes to a spare node in parallel. These bypass links are allocated to the primary nodes by graph-node coloring with a minimum inter-node distance of three in order to use the bypass links as the hypercube connections as well as to attain strong fault tolerance for reconfiguring the base array with the primary network topology. An extended redundancy configuration for a large fault-tolerant array can be constructed by connecting the component configurations by using external switches of a hub type provided at the bus nodes of the bypass links. This configuration has a network topology of the parallel star-connections of sub-hypercubes whose diameter is smaller than that of the regular hypercube.
Yuzuru MAYA Soichi ISONO Akira OHTSUJI
For cluster systems consisting of multiple processing nodes and shared servers which consist of an on-line and a backup shared server, we propose a hot-standby scheme for shared servers. In this scheme for shared servers, when the on-line shared server receives a command from a node, it sends only an update command and its data identifier to the backup shared server. Both the on-line and the backup shared server execute the update command independently, and the command result of the on-line shared server is identical to that of the backup shared server. When the on-line shared server fails, the backup reconstructs the shared data by using its own shared data and the user data from each node. We evaluated the system recovery time and the performance overhead for this hot-standby scheme. It enables the performance overhead to be ignored, and the system recovery time to be shortened to 20 seconds in cluster systems.
Tadayoshi HORITA Itsuo TAKANAMI
As VLSI technology has developed, the interest in implementing an entire or significant part of a parallel computer system using wafer scale integration is growing. The major problem for the case is the possibility of drastically low yield and/or reliability of the system if there is no strategy for coping with such situations. Various strategies to restructure the faulty physical system into the fault-free target logical system are described in the literature [1]-[5]. In this paper, we propose an efficient approximate method which can reconstruct the 1 1/2 track-switch mesh arrays with faulty PEs using hardware as well as software. A logical circuit added to each PE and a network connecting the circuits are used to decide spare PEs which compensate for faulty PEs. The hardware compexity of each circuit is much less than that of a PE where the size of each additional circuit is independent of array sizes and constant. By using the exclusive hardware scheme, a built-in self-reconfigurable system without using a host computer is realizable and the time for reconfiguring arrays becomes very short. The simulation result of the performance of the method shows that the reconstructing efficiency of our algorithm is a little less than those of the exaustive and Shigei's ones [6] and [7], but much better than that of the neural one [3]. We also compare the time complexities of reconstructions by hardware as well as software, and the hardware complexity in terms of the number of gates in the logical circuit added to each PE among the other methods.
Muneharu YOKOYAMA Takaomi SHIGEHARA Hiroshi MIZOGUCHI Taketoshi MISHIMA
The Conjugate Residual method, one of the iterative methods for solving linear systems, is applied to the problems with a dense coefficient matrix on distributed memory parallel computers. Based on an assumption on the computation and communication times of the proposed algorithm for parallel computers, it is shown that the optimal number of processing elements is proportional to the problem size N. The validity of the prediction is confirmed through numerical experiments on Hitachi SR2201.
Minyi GUO Yoshiyuki YAMASHITA Ikuo NAKATA
Array redistribution is required very often in programs on distributed memory parallel computers. It is essential to use efficient algorithms for redistribution, otherwise the performance of programs may degrade considerably. In this paper, we focus on automatic generation of communication routines for multi-dimensional redistribution. The principal advantage of this work is to gain the ability to handle redistribution between arbitrary source and destination processor sets and between arbitrary source and destination distribution schemes. We have implemented these algorithms using Parallelware communication library. Some experimental results show the efficiency and flexibility of our techniques compared to the other redistribution works.
Hiroshi MATSUOKA Kazuaki OKAMOTO Hideo HIRONO Mitsuhisa SATO Takashi YOKOTA Shuichi SAKAI
In this paper we describe the pipeline design and enhanced hardware for fast message handling in a RICA-1 processor, a processing element (PE) in the RWC-1 multiprocessor. The RWC-1 is based on the reduced inter-processor communication architecture (RICA), in which communications are combined with computation in the processor pipeline. The pipeline is enhanced with hardware mechanisms to support fine-grain parallel execution. The data paths of the RICA-1 super-scalar processor are commonly used for communication as well as instruction execution to minimize its implementation cost. A 128-PE system has been built on January 1998, and it is currently used for hardware debugging, software development and performance evaluation.
In this paper, we apply the Semi-markov Memory and Cache coherence Interference (SMCI) model, which we had proposed for invalidating based cache coherent parallel computers, to an updating based protocol. The model proposed here, the SMCI/Dragon model, can predict performance of cache coherent parallel computers with the Dragon protocol as well as the original SMCI model for the Synapse protocol. Conventional analytic models by stochastic processes to describe parallel computers have the problem of numerical explosion in the number of states necessary as the system size increases. We have already shown that the SMCI model achieved both the small number of states to describe parallel computers with the Synapse protocol and the inexpensive computation cost to predict their performance. In this paper, we demonstrate generality of the SMCI model by applying it to the another cache coherence protocol, Dragon, which has opposite characteristics than Synapse. We show the number of states required by constructing the SMCI/Dragon model is only 21 which is as small as SMCI/Synapse, and the computation cost is also the order of microseconds. Using the SMCI/Dragon model, we investigate several comparative experiments with widely known simulation results. We found that there is only a 5. 4% differences between the simulation and the SMCI/Dragon model.
Masahisa SHIMIZU Yasuhiro OUE Kazumasa OHNISHI Toru KITAMURA
Because a massively parallel computer processes vast amounts of data and generates many access requests from multiple processors simultaneously, parallel secondary storage requires large capacity and high concurrency. One effective method of implementation of such secondary storage is to use disk arrays which have multiple disks connected in parallel. In this paper, we propose a parallel file access method named DECODE (dynamic express changing of data entry) in which load balancing of each disk is achieved by dynamic determination of the write data position. For resolution of the problem of data fragmentation which is caused by the relocation of data during a write process, the concept of "Equivalent Area" is introduced. We have performed a preliminary performance evaluation using software simulation under various access statuses by changing the access pattern, access size and stripe size and confirmed the effectiveness of load balancing with this method.
The first optimizing compiler was developed at IBM in order to prove that high level language programming could be as efficient as hand-coded machine language. Computer architecture and compiler optimization interacted through a feedback loop, from the high-level language computer architectures of the 1970s to the RISC machines of the 1980s. In the supercomputing community, the availability of effective vectorizing compilers delivered easy-to-use performance in the 1980s to the present. These compilers were successful at least in part because they could predict poor performance spots in the program and report these to users. This fostered a feedback loop between programmers and compilers to develop high performance programs. Future optimizing compilers for high performance computers and supercomputers will have to take advantage of both feedback loops.
Takashi YOKOTA Hiroshi MATSUOKA Kazuaki OKAMOTO Hideo HIRONO Shuichi SAKAI
This paper discusses a massively parallel interconnection scheme for multithreaded architecture and introduces a new class of direct interconnection networks called the hierarchical Multidimensional Directed Cycles Ensemble (hMDCE). Its suitability for massively parallel systems is discussed. The network is evolved from the Multidimensional Directed Cycles Ensemble (MDCE) network, where each node is substituted by lower-level sub-networks. The new network addresses some serious problems caused by the increasing scale of parallel systems, such as longer latency, limited throughput and high implementation cost. This paper first introduces the MDCE network and then presents and examines in detail the hierarchical MDCE network. Bisection bandwidth of hMDCE is considerably reduced from its ancestor MDCE and the network performs significantly higher throughput and lower latency under some practical implementation constraints. The gate count and delay time of the compiled circuit for the routing function are insignificant. These results reveal that the hMDCE network is an important candidate for massively parallel systems interconnection.
Because the match phase in OPS5-type production systems requires most of the system's execution time and memory accesses, we proposed hash-based parallel production systems, CPPS (Clustered Parallel Production Systems), based on the RETE algorithm for distributed memory parallel computers, or multicomputers to reduce such a bottleneck. CPPS was effective in speeding up the match phase, but still left room for optimizations. In this paper, we introduce software cache techniques to memory nodes in the CPPS as one of the optimizations, and implement it on a multicomputer, nCUBE2. The benchmark results show that the CPPS with the software cache is about 2-fold faster than the original, and more than 7-fold faster than the simple hash method proposed by Acharya et al. for a large scale problem. The speed-up can be attributed to decreased communication costs.