The search functionality is under construction.

IEICE TRANSACTIONS on Information

  • Impact Factor

    0.72

  • Eigenfactor

    0.002

  • article influence

    0.1

  • Cite Score

    1.4

Advance publication (published online immediately after acceptance)

Volume E80-D No.4  (Publication Date:1997/04/25)

    Special Issue on Parallel and Distributed Supercomputing
  • FOREWORD

    Masaaki SHIMASAKI  

     
    FOREWORD

      Page(s):
    401-402
  • The Effect of Optimizing Compilers on Architecture and Programs

    Michael WOLFE  

     
    INVITED PAPER

      Page(s):
    403-408

    The first optimizing compiler was developed at IBM in order to prove that high level language programming could be as efficient as hand-coded machine language. Computer architecture and compiler optimization interacted through a feedback loop, from the high-level language computer architectures of the 1970s to the RISC machines of the 1980s. In the supercomputing community, the availability of effective vectorizing compilers delivered easy-to-use performance in the 1980s to the present. These compilers were successful at least in part because they could predict poor performance spots in the program and report these to users. This fostered a feedback loop between programmers and compilers to develop high performance programs. Future optimizing compilers for high performance computers and supercomputers will have to take advantage of both feedback loops.

  • Vienna Fortran and the Path Towards a Standard Parallel Language

    Barbara M. CHAPMAN  Piyush MEHROTRA  Hans P. ZIMA  

     
    INVITED PAPER

      Page(s):
    409-416

    Highly parallel scalable multiprocessing systems (HMPs) are powerful tools for solving large-scale scientific and engineering problems. However, these machines are difficult to program since algorithms must exploit locality in order to achieve high performance. Vienna Fortran was the first fully specified data-parallel language for HMPs that provided features for the specification of data distribution and alignment at a high level of abstraction. In this paper we outline the major elements of Vienna Fortran and compare it to High Performance Fortran (HPF), a de-facto standard in this area. A significant weakness of HPF is its lack of support for many advanced applications, which require irregular data distributions and dynamic load balancing. We introduce HPF +, an extension of HPF based on Vienna Fortran, that provides the required functionality.

  • Nonuniform Output Traffic Distributions in the Multipath Crossbar Network

    Byungho KIM  Boseob KWON  Hyunsoo YOON  Jung Wan CHO  

     
    PAPER

      Page(s):
    417-424

    Multipath interconnection networks can support higher bandwidth than those of nonblocking networks by passing multiple packets to the same output simultaneously and these packets are buffered in the output buffer. The delay-throughput performance of the output buffer in multipath networks is closely related to output traffic distribution, packet arrival process at each output link connected to a given output buffer. The output traffic distributions are different according to the various input traffic patterns. Focusing on nonuniform output traffic distributions, this paper develops a new, general analytic model of the output buffer in multipath networks, which enables us to investigate the delay-throughput performance of the output buffer under various input traffic patterns. This paper also introduces Multipath Crossbar network as a representative multipath network which is the base architecture of our analysis. It is shown that the output buffer performances such as packet loss probability and delay improve as nonuniformity of the output traffic distribution becomes larger.

  • Node-to-Set Disjoint Paths with Optimal Length in Star Graphs

    Qian-Ping GU  Shietung PENG  

     
    PAPER

      Page(s):
    425-433

    In this paper, we consider the following node-to-set disjoint paths problem: given a node s and a set T = {t1,...,tk} of k nodes in a k-connected graph G, find k node-disjoint paths s ti, 1 i k. We give an O(n2) time algorithm for the node-to-set disjoint paths problem in n-dimensional star graphs Gn which are (n - 1)-connected. The algorithm finds the n - 1 node-disjoint paths of length at most d(Gn) + 1 for n 4,6 and at most d(Gn) + 2 for n = 4,6, where d(Gn) = 3(n-1)/2 is the diameter of Gn. d(Gn) + 1 and d(Gn) + 2 are also the lower bounds on the length of the paths for the above problem in Gn for n 4,6 and n = 4,6, respectively.

  • Extending SCI on Hierarchical Directory Trees for Large-Scale Multiprocessors

    Ing-Zong LU  Tien-Fu CHEN  

     
    PAPER

      Page(s):
    434-440

    SCI (Scalable Coherent Interface) is pointerbased coherent directory scheme for massively parallel multiprocessors. Large message latency is one of the problems with SCI because of its linked list structure: the searching latency of messages could grow as a linear order of the number of processors. In this paper, we focus on a hierarchical architecture to propose a new schemeEST(Extending SCI-Tree), which may reduce the message traffic and also take the advantages of the topology property. Simulation results show that the EST scheme is effective in reducing message latency and communication cost when compared with other schemes.

  • Intelligent Memory: An Architecture for Lock-Free Synchronization

    Nakun SEONG  Naihoon JUNG  Byungho KIM  Hyunsoo YOON  

     
    PAPER

      Page(s):
    441-447

    This paper presents intelligent memory, a new memory architecture capable of providing efficient lock-free synchronization. In the intelligent memory, a sequence of operations on a shared object associated with that memory module can be processed without any intervention so that an environment for the synchronization can be provided by executing a critical section itself in that memory module. For this, we present a memory architecture for the intelligent memory having minimal instruction set and develop a progtramming model, called Critical Section Procedure (CSP), which consists of shared data structures and operations on them. Intelligent memory is intended to eliminate waste of processing time such as busy waiting in spin lock and the retry due to process contentions in existing lock-free synchronization schemes. Simulation results show that the intelligent memory provides better throughput compared with the spin lock and the existing lock-free synchronization schemes.

  • A High-Performance Cluster Computing Environment Based on Hybrid Shared Memory/Message Passing Model

    Yoshimasa OHNISHI  Yoshinari SUGIMOTO  Toshinori SUEYOSHI  

     
    PAPER

      Page(s):
    448-454

    We conducted research and development of Distributed Supercomputing Environment (DSE) based on distributed shared memory model to serve as a cluster computing environment to provide parallel processing facilities. Shared memory model and message passing model are well-known typical models of parallel processing. It is desired that hybrid programming environment will make the best use of the prominent features of both models. Consequently, we add a new message passing mechanism to present DSE, and create a prototype called Hybrid DSE as a hybrid model based cluster computing environment. In this paper, we describe the implementation of a message passing mechanism on DSE and performance evaluation of Hybrid DSE.

  • Design of Array Processors for 2-D Discrete Fourier Transform

    Shietung PENG  Igor SEDUKHIN  Stanislav SEDUKHIN  

     
    PAPER

      Page(s):
    455-465

    In this paper the design of systolic array processors for computing 2-dimensional Discrete Fourier Transform (2-D DFT) is considered. We investigated three different computational schemes for designing systolic array processors using systematic approach. The systematic approach guarantees to find optimal systolic array processors from a large solution space in terms of the number of processing elements and I/O channels, the processing time, topology, pipeline period, etc. The optimal systolic array processors are scalable, modular and suitable for VLSI implementation. An application of the designed systolic array processors to the prime-factor DFT is also presented.

  • Parallel File Access for Implementing Dynamic Load Balancing on a Massively Parallel Computer

    Masahisa SHIMIZU  Yasuhiro OUE  Kazumasa OHNISHI  Toru KITAMURA  

     
    PAPER

      Page(s):
    466-472

    Because a massively parallel computer processes vast amounts of data and generates many access requests from multiple processors simultaneously, parallel secondary storage requires large capacity and high concurrency. One effective method of implementation of such secondary storage is to use disk arrays which have multiple disks connected in parallel. In this paper, we propose a parallel file access method named DECODE (dynamic express changing of data entry) in which load balancing of each disk is achieved by dynamic determination of the write data position. For resolution of the problem of data fragmentation which is caused by the relocation of data during a write process, the concept of "Equivalent Area" is introduced. We have performed a preliminary performance evaluation using software simulation under various access statuses by changing the access pattern, access size and stripe size and confirmed the effectiveness of load balancing with this method.

  • Data-Localization Scheduling inside Processor-Cluster for Multigrain Parallel Processing

    Akimasa YOSHIDA  Ken'ichi KOSHIZUKA  Wataru OGATA  Hironori KASAHARA  

     
    PAPER

      Page(s):
    473-479

    This paper proposes a data-localization scheduling scheme inside a processor-cluster for multigrain parallel processing, which hierarchically exploits parallelism among coarsegrain tasks like loops, medium-grain tasks like loop iterations and near-fine-grain tasks like statements. The proposed scheme assigns near-fine-grain or medium-grain tasks inside coarse-grain tasks onto processors inside a processor-cluster so that maximum parallelism can be exploited and inter-processor data transfer can be minimum after data-localization for coarse-grain tasks across processor-clusters. Performance evaluation on a multiprocessor system OSCAR shows that multigrain parallel processing with the proposed data-localization scheduling can reduce execution time for application programs by 10% compared with multigrain parallel processing without data-localization.

  • Non-Graph Based Approach on the Analysis of Pointers and Structures

    Dong-Soo HAN  Takao TSUDA  

     
    PAPER

      Page(s):
    480-488

    In high performance compilers to process pointer-handling programs, precise pointer alias analysis is useful for the compilers to generate efficient object code. It is well known that most compiler techniques such as data flow analysis, dependence analysis, side effect analysis and optimizations are related to the alias problem. However, without data structure information, there is a limit on the precision of the alias analysis. Even though the automatic data structure detection problem is complex, when pointer manipulation satisfies some restrictions, some data structures can be detected automatically by compilers with some knowledge of aliases. In this paper, we propose an automatic data structure detection method for Pascal and Fortran 90. Linear list, tree and dag data structures are detected. Detected data structure information can be used not only for raising the precision of alias analysis but also for some optimizing techniques for pointer handling programs directly.

  • A Lookahead Heuristic for Heterogeneous Multiprocessor Scheduling with Communication Costs

    Dingchao LI  Akira MIZUNO  Yuji IWAHORI  Naohiro ISHII  

     
    PAPER

      Page(s):
    489-494

    This paper describes a new approach to the scheduling problem that assigns tasks of a parallel program described as a task graph onto parallel machines. The approach handles interprocessor communication and heterogeneity, based on using both the theoretical results developed so far and a lookahead scheduling strategy. The experimental results on randomly generated task graphs demonstrate the effectiveness of this scheduling heuristic.

  • Reproducing the Behavior of a Parallel Program by Using Dataflow Execution Models

    Naohisa TAKAHASHI  Takeshi MIEI  

     
    PAPER

      Page(s):
    495-503

    We present a general framework with which we can evaluate the flexibility and efficiency of various replay systems for parallel programs. In our approach, program monitoring is modeled by making a virtual dataflow program graph, referred to as a VDG, that includes all the instructions executed by the program. The behavior of the program replay is modeled on the parallel interpretation of a VDG based on two basic parallel execution models for dataflow program graphs: a data-driven model and a demand-driven model. Previous attempts to replay parallel programs, known as Instant Replay and P-Sequence, are also modeled as variations of the data-driven replay, i.e. the datadriven interpretation of a VDG. We show that the demand-driven replay, i.e. the demand-driven interpretation of a VDG, is more flexible in program replay than the data-driven replay since it allows better control of parallelism and a more selective replay. We also show that we can implement a demand-driven replay that requires almost the same amount of data to be saved during program monitoring as does the data-driven replay, and which eliminates any centralized bottleneck during program monitoring by optimizing the demand propagation and using an effective data structure.

  • Implementation of a Parallel Prolog System on a Distributed Memory Parallel Computer

    Hideo MATSUDA  Toru KAWABATA  Yukio KANEDA  

     
    PAPER

      Page(s):
    504-509

    In this paper we propose a new method for parallel execution of Prolog programs and present its implementation on a distributed memory parallel computer, Fujitsu AP1000. In our method a number of processes (named Prolog engines) explore different branches of a search tree (named tasks) in parallel, which is the same as OR-parallelism. Unlike OR-parallelism, the mapping between Prolog engines and tasks is statically determined like data parallelism. Each Prolog engine can decide which task is executed by the engine without communicating with the other engines. In many search problems, however, such static task mapping may cause imbalance on the processing time of each engine since the computational costs to explore branches may vary substantially. To cope with this issue, we devise a method to adjust the task imbalance by periodical exchanging how many tasks were processed for each engine. Also for reducing communication overhead in load balancing, we limit the scope of engines that exchange the load information each other. The effectiveness of our method is evaluated by measuring execution times for N Queens and the Traveling Salesman Problem on the AP1000. Using 512 processors, we obtained 355-fold speedup for N Queens and 343-fold speedup on the Traveling Salesman Problem.

  • An Efficient Implementation of Term Rewriting System on a Distributed Memory Architecture

    Yoshinari HACHISU  Shinichirou YAMAMOTO  Takeshi HAMAGUCHI  Kiyoshi AGUSA  

     
    PAPER

      Page(s):
    510-517

    Term Rewriting System (TRS) is a model of computation and it is used in various application such as algebraic specification. TRS has an inherent concurrency and it is suitable for parallel computing. We have already proposed BOB (Bundle Of Branches), which is a mechanism of data management for parallel rewriting. We have proposed a model of parallel rewriting using BOB and implemented a TRS simulator based on this model on a shared memory parallel computer. Because it fully depends on the feature of a shared memory architecture, that is, a process can access any memory element, it is hard to transport it on a distributed memory parallel computer. In this paper, we propose autonomous BOB model. This model is suitable for a distributed memory architecture since a process uses message passing protocol and the method of load balancing is provided. We implement a TRS simulator using this model on a distributed memory architecture and it runs about 30 times faster on 64 processors than on a single processor.

  • Implementation of the Multicolored SOR Method on a Vector Supercomputer

    Seiji FUJINO  Ryutaro HIMENO  Akira KOJIMA  Kazuo TERADA  

     
    PAPER

      Page(s):
    518-523

    We describe the implementation of an iterative method with the goal of gaining a long vector length. The strategy for vectorization by means of multipoint stencils used for discretization of the partial differential equations is discussed. Numerical experiments show that the strategy that requires certain restrictions on the number of grid points in the x and y directions improves the performance on the vector supercomputer.

  • High-Performance Parallel Computation of Flows Past a Space Plane Using NWT

    Kisa MATSUSHIMA  Susumu TAKANASHI  

     
    PAPER

      Page(s):
    524-530

    Compressible viscous flows past a space plane have been elucidated by parallel computation on the NWT. The NWT is a vector-parallel architecture computer system which achieves remarkably high performance in processing speed and memory storage. We have examined the advantages of the NWT in order to simulate realistic flow problems in engineering, such as the investigation of global and local aerodynamic characteristics of a space plane. The accuracy of the computational results has been verified by comparison with experimental data. The simplified domain-decomposition technique introduced here is easy to apply for parallel implementation to significantly improve the acceleration rate of computations. The larger available memory storage enables us to conduct a grid refinement study through which several points concerning CFD simulation of a space plane are obtained.

  • Parallelized Simulation of Complicated Polymer Structures and lts Efficiency

    Kazuhito SHIDA  Kaoru OHNO  Masayuki KIMURA  Yoshiyuki KAWAZOE  

     
    PAPER

      Page(s):
    531-537

    A large scale simulation for polymer chains in good solvent is performed. The implementation technique for efficient parallel execution, optimization, and load-balancing are discussed on this practical application. Finally, a simple performance model is proposed.