IEICE global.ieice.org Site

Keyword Search Result

[Keyword] parallel computers(11hit)

1-11hit

Enhancing Cup-Stacking Method for Collective Communication
Takashi YOKOTA Kanemitsu OOTSU Shun KOJIMA

PAPER-Computer System

Pubricized:
2023/08/22
Vol:
E106-D No:11
Page(s):
1808-1821
An interconnection network is an inevitable component for constructing parallel computers. It connects computation nodes so that the nodes can communicate with each other. As a parallel computation essentially requires inter-node communication according to a parallel algorithm, the interconnection network plays an important role in terms of communication performance. This paper focuses on the collective communication that is frequently performed in parallel computation and this paper addresses the Cup-Stacking method that is proposed in our preceding work. The key issues of the method are splitting a large packet into slices, re-shaping the slice, and stacking the slices, in a genetic algorithm (GA) manner. This paper discusses extending the Cup-Stacking method by introducing additional items (genes) and proposes the extended Cup-Stacking method. Furthermore, this paper places comprehensive discussions on the drawbacks and further optimization of the method. Evaluation results reveal the effectiveness of the extended method, where the proposed method achieves at most seven percent improvement in duration time over the former Cup-Stacking method.
On a Cup-Stacking Concept in Repetitive Collective Communication
Takashi YOKOTA Kanemitsu OOTSU Shun KOJIMA

LETTER-Computer System

Pubricized:
2022/04/15
Vol:
E105-D No:7
Page(s):
1325-1329
Parallel computing essentially consists of computation and communication and, in many cases, communication performance is vital. Many parallel applications use collective communications, which often dominate the performance of the parallel execution. This paper focuses on collective communication performance to speed-up the parallel execution. This paper firstly offers our experimental result that splitting a session of collective communication to small portions (slices) possibly enables efficient communication. Then, based on the results, this paper proposes a new concept cup-stacking with a genetic algorithm based methodology. The preliminary evaluation results reveal the effectiveness of the proposed method.
A Genetic Approach for Accelerating Communication Performance by Node Mapping
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA

LETTER-Architecture

Pubricized:
2018/09/18
Vol:
E101-D No:12
Page(s):
2971-2975
This paper intends to reduce duration times in typical collective communications. We introduce logical addressing system apart from the physical one and, by rearranging the logical node addresses properly, we intend to reduce communication overheads so that ideal communication is performed. One of the key issues is rearrangement of the logical addressing system. We introduce genetic algorithm (GA) as meta-heuristic solution as well as the random search strategy. Our GA-based method achieves at most 2.50 times speedup in three-traffic-pattern cases.
A Static Packet Scheduling Approach for Fast Collective Communication by Using PSO
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA

PAPER-Interconnection networks

Pubricized:
2017/07/14
Vol:
E100-D No:12
Page(s):
2781-2795
Interconnection network is one of the inevitable components in parallel computers, since it is responsible to communication capabilities of the systems. It affects the system-level performance as well as the physical and logical structure of the systems. Although many studies are reported to enhance the interconnection network technology, we have to discuss many issues remaining. One of the most important issues is congestion management. In an interconnection network, many packets are transferred simultaneously and the packets interfere to each other in the network. Congestion arises as a result of the interferences. Its fast spreading speed seriously degrades communication performance and it continues for long time. Thus, we should appropriately control the network to suppress the congested situation for maintaining the maximum performance. Many studies address the problem and present effective methods, however, the maximal performance in an ideal situation is not sufficiently clarified. Solving the ideal performance is, in general, an NP-hard problem. This paper introduces particle swarm optimization (PSO) methodology to overcome the problem. In this paper, we first formalize the optimization problem suitable for the PSO method and present a simple PSO application as naive models. Then, we discuss reduction of the size of search space and introduce three practical variations of the PSO computation models as repetitive model, expansion model, and coding model. We furthermore introduce some non-PSO methods for comparison. Our evaluation results reveal high potentials of the PSO method. The repetitive and expansion models achieve significant acceleration of collective communication performance at most 1.72 times faster than that in the bursty communication condition.
Enhancing Entropy Throttling: New Classes of Injection Control in Interconnection Networks
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA

PAPER-Interconnection network

Pubricized:
2016/08/25
Vol:
E99-D No:12
Page(s):
2911-2922
State-of-the-art parallel computers, which are growing in parallelism, require a lot of things in their interconnection networks. Although wide spectrum of efforts in research and development for effective and practical interconnection networks are reported, the problem is still open. One of the largest issues is congestion control that intends to maximize the network performance in terms of throughput and latency. Throttling, or injection limitation, is one of the center ideas of congestion control. We have proposed a new class of throttling method, Entropy Throttling, whose foundation is entropy concept of packets. The throttling method is successful in part, however, its potentials are not sufficiently discussed. This paper aims at exploiting capabilities of the Entropy Throttling method via comprehensive evaluation. Major contributions of this paper are to introduce two ideas of hysteresis function and guard time and also to clarify wide performance characteristics in steady and unsteady communication situations. By introducing the new ideas, we extend the Entropy throttling method. The extended methods improve communication performance at most 3.17 times in the best case and 1.47 times in average compared with non-throttling cases in collective communication, while the method can sustain steady communication performance.
MMLRU Selection Function: A Simple and Efficient Output Selection Function in Adaptive Routing
Michihiro KOIBUCHI Akiya JOURAKU Hideharu AMANO

PAPER-Computer Systems

Vol:
E88-D No:1
Page(s):
109-118
Adaptive routing algorithms, which dynamically select the route of a packet, have been widely studied for interconnection networks in massively parallel computers. An output selection function (OSF), which decides the output channel when some legal channels are free, is essential for an adaptive routing. In this paper, we propose a simple and efficient OSF called minimal multiplexed and least-recently-used (MMLRU). The MMLRU selection function has the following simple strategies for distributing the traffic: 1) each router locally grasps the congestion information by the utilization ratio of its own physical channels; 2) it is divided into the two selection steps, the choice from available physical channels and the choice from available virtual channels. The MMLRU selection function can be used on any type of network topology and adaptive routing algorithm. Simulation results show that the MMLRU selection function improves throughput and latency especially when the number of dimension becomes larger or the number of nodes per dimension become larger.
Iterative Methods for Dense Linear Systems on Distributed Memory Parallel Computers
Muneharu YOKOYAMA Takaomi SHIGEHARA Hiroshi MIZOGUCHI Taketoshi MISHIMA

PAPER

Vol:
E82-A No:3
Page(s):
483-486
The Conjugate Residual method, one of the iterative methods for solving linear systems, is applied to the problems with a dense coefficient matrix on distributed memory parallel computers. Based on an assumption on the computation and communication times of the proposed algorithm for parallel computers, it is shown that the optimal number of processing elements is proportional to the problem size N. The validity of the prediction is confirmed through numerical experiments on Hitachi SR2201.
Efficient Implementation of Multi-Dimensional Array Redistribution
Minyi GUO Yoshiyuki YAMASHITA Ikuo NAKATA

PAPER-Sofware System

Vol:
E81-D No:11
Page(s):
1195-1204
Array redistribution is required very often in programs on distributed memory parallel computers. It is essential to use efficient algorithms for redistribution, otherwise the performance of programs may degrade considerably. In this paper, we focus on automatic generation of communication routines for multi-dimensional redistribution. The principal advantage of this work is to gain the ability to handle redistribution between arbitrary source and destination processor sets and between arbitrary source and destination distribution schemes. We have implemented these algorithms using Parallelware communication library. Some experimental results show the efficiency and flexibility of our techniques compared to the other redistribution works.
The Effect of Optimizing Compilers on Architecture and Programs
Michael WOLFE

INVITED PAPER

Vol:
E80-D No:4
Page(s):
403-408
The first optimizing compiler was developed at IBM in order to prove that high level language programming could be as efficient as hand-coded machine language. Computer architecture and compiler optimization interacted through a feedback loop, from the high-level language computer architectures of the 1970s to the RISC machines of the 1980s. In the supercomputing community, the availability of effective vectorizing compilers delivered easy-to-use performance in the 1980s to the present. These compilers were successful at least in part because they could predict poor performance spots in the program and report these to users. This fostered a feedback loop between programmers and compilers to develop high performance programs. Future optimizing compilers for high performance computers and supercomputers will have to take advantage of both feedback loops.
hMDCE: The Hierarchical Multidimensional Directed Cycles Ensemble Network
Takashi YOKOTA Hiroshi MATSUOKA Kazuaki OKAMOTO Hideo HIRONO Shuichi SAKAI

PAPER-Interconnection Networks

Vol:
E79-D No:8
Page(s):
1099-1106
This paper discusses a massively parallel interconnection scheme for multithreaded architecture and introduces a new class of direct interconnection networks called the hierarchical Multidimensional Directed Cycles Ensemble (hMDCE). Its suitability for massively parallel systems is discussed. The network is evolved from the Multidimensional Directed Cycles Ensemble (MDCE) network, where each node is substituted by lower-level sub-networks. The new network addresses some serious problems caused by the increasing scale of parallel systems, such as longer latency, limited throughput and high implementation cost. This paper first introduces the MDCE network and then presents and examines in detail the hierarchical MDCE network. Bisection bandwidth of hMDCE is considerably reduced from its ancestor MDCE and the network performs significantly higher throughput and lower latency under some practical implementation constraints. The gate count and delay time of the compiled circuit for the routing function are insignificant. These results reveal that the hMDCE network is an important candidate for massively parallel systems interconnection.
Software Cache Techniques for Memory Nodes in Distributed Memory Parallel Production Systems
Jun MIYAZAKI Haruo YOKOTA

PAPER-Architectures

Vol:
E79-D No:8
Page(s):
1046-1054
Because the match phase in OPS5-type production systems requires most of the system's execution time and memory accesses, we proposed hash-based parallel production systems, CPPS (Clustered Parallel Production Systems), based on the RETE algorithm for distributed memory parallel computers, or multicomputers to reduce such a bottleneck. CPPS was effective in speeding up the match phase, but still left room for optimizations. In this paper, we introduce software cache techniques to memory nodes in the CPPS as one of the optimizations, and implement it on a multicomputer, nCUBE2. The benchmark results show that the CPPS with the software cache is about 2-fold faster than the original, and more than 7-fold faster than the simple hash method proposed by Acharya et al. for a large scale problem. The speed-up can be attributed to decreased communication costs.

Keyword Search Result

[Keyword] parallel computers(11hit)

Enhancing Cup-Stacking Method for Collective Communication

On a Cup-Stacking Concept in Repetitive Collective Communication

A Genetic Approach for Accelerating Communication Performance by Node Mapping

A Static Packet Scheduling Approach for Fast Collective Communication by Using PSO

Enhancing Entropy Throttling: New Classes of Injection Control in Interconnection Networks

MMLRU Selection Function: A Simple and Efficient Output Selection Function in Adaptive Routing

Iterative Methods for Dense Linear Systems on Distributed Memory Parallel Computers

Efficient Implementation of Multi-Dimensional Array Redistribution

The Effect of Optimizing Compilers on Architecture and Programs

hMDCE: The Hierarchical Multidimensional Directed Cycles Ensemble Network

Software Cache Techniques for Memory Nodes in Distributed Memory Parallel Production Systems

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles