The search functionality is under construction.

Keyword Search Result

[Keyword] parallelizing compiler(11hit)

1-11hit
  • Compiler Software Coherent Control for Embedded High Performance Multicore

    Boma A. ADHI  Tomoya KASHIMATA  Ken TAKAHASHI  Keiji KIMURA  Hironori KASAHARA  

     
    PAPER

      Vol:
    E103-C No:3
      Page(s):
    85-97

    The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism; therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 “equake” with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8 cores with no hardware coherence mechanism available.

  • Power-Aware Compiler Controllable Chip Multiprocessor

    Hiroaki SHIKANO  Jun SHIRAKO  Yasutaka WADA  Keiji KIMURA  Hironori KASAHARA  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    432-439

    A power-aware compiler controllable chip multiprocessor (CMP) is presented and its performance and power consumption are evaluated with the optimally scheduled advanced multiprocessor (OSCAR) parallelizing compiler. The CMP is equipped with power control registers that change clock frequency and power supply voltage to functional units including processor cores, memories, and an interconnection network. The OSCAR compiler carries out coarse-grain task parallelization of programs and reduces power consumption using architectural power control support and the compiler's power saving scheme. The performance evaluation shows that MPEG-2 encoding on the proposed CMP with four CPUs results in 82.6% power reduction in real-time execution mode with a deadline constraint on its sequential execution time. Furthermore, MP3 encoding on a heterogeneous CMP with four CPUs and four accelerators results in 53.9% power reduction at 21.1-fold speed-up in performance against its sequential execution in the fastest execution mode.

  • Development and Implementation of an Interactive Parallelization Assistance Tool for OpenMP: iPat/OMP

    Makoto ISHIHARA  Hiroki HONDA  Mitsuhisa SATO  

     
    PAPER-Parallel/Distributed Programming Models, Paradigms and Tools

      Vol:
    E89-D No:2
      Page(s):
    399-407

    iPat/OMP is an interactive parallelization assistance tool for OpenMP. In the present paper, we describe the design concept of iPat/OMP, the parallelization sequence achieved by the tool and its current implementation status. In addition, we present an evaluation of the performance of the implemented functionalities. The experimental results show that iPat/OMP can detect parallelism and create an appropriate OpenMP directive for several for-loops.

  • Bit Length Optimization of Fractional Part on Floating to Fixed Point Conversion for High-Level Synthesis

    Nobuhiro DOI  Takashi HORIYAMA  Masaki NAKANISHI  Shinji KIMURA  Katsumasa WATANABE  

     
    PAPER-Logic and High Level Synthesis

      Vol:
    E86-A No:12
      Page(s):
    3184-3191

    In the hardware synthesis from a high-level language such as C, the bit length of variables is one of the key issues for the area and speed optimization. Usually, designers are required to optimize the bit-length of each variable manually using the time-consuming simulation on huge-data. In this paper, we propose an optimization method of the fractional bit length in the conversion from floating-point variables to fixed-point variables. The method is based on error propagation and the backward propagation of the accuracy limitation. The method is fully analytical and fast compared to simulation based methods.

  • Efficient Loop Partitioning for Parallel Codes of Irregular Scientific Computations

    Minyi GUO  

     
    PAPER-Software Systems

      Vol:
    E86-D No:9
      Page(s):
    1825-1834

    In most cases of distributed memory computations, node programs are executed on processors according to the owner computes rule. However, owner computes rule is not best suited for irregular application codes. In irregular application codes, use of indirection in accessing left hand side array makes it difficult to partition the loop iterations, and because of use of indirection in accessing right hand side elements, we may reduce total communication by using heuristics other than owner computes rule. In this paper, we propose a communication cost reduction computes rule for irregular loop partitioning, called least communication computes rule. We partition a loop iteration to a processor on which the minimal communication cost is ensured when executing that iteration. Then, after all iterations are partitioned into various processors, we give global vs. local data transformation rule, indirection arrays remapping and communication optimization methods. The experimental results show that, in most cases, our approaches achieved better performance than other loop partitioning rules.

  • Parallel Molecular Dynamics in a Parallelizing SML Compiler

    Norman SCAIFE  Ryoko HAYASHI  Susumu HORIGUCHI  

     
    PAPER-Software Systems and Technologies

      Vol:
    E86-D No:9
      Page(s):
    1569-1576

    We have constructed a parallelizing compiler for Standard ML (SML) based upon algorithmic skeletons. We present an implementation of a Parallel Molecular Dynamics (PMD) simulation in order to compare our functional approach with a traditional imperative approach. Although we present performance data, the principal benefits from our approach are in the modularity of the code and the ease of programming. Extant FORTRAN90 code for an O(N 2) algorithm is translated, firstly into imperative SML and then into purely functional SML which is then parallelized. The ease of programming and the performance of the FORTRAN90 and SML code are compared. Modest parallel performance is obtained from the parallel SML but with a much slower sequential execution time compared to the FORTRAN90. We then improve the implementation with a ring topology implementation which gives much closer performance to the FORTRAN90 implementation.

  • Multigrain Parallel Processing on Compiler Cooperative OSCAR Chip Multiprocessor Architecture

    Keiji KIMURA  Takeshi KODAKA  Motoki OBATA  Hironori KASAHARA  

     
    PAPER-Architecture and Algorithms

      Vol:
    E86-C No:4
      Page(s):
    570-579

    This paper describes multigrain parallel processing on OSCAR (Optimally SCheduled Advanced multiprocessoR) chip multiprocessor architecture. OSCAR compiler cooperative chip multiprocessor architecture aims at development of scalable, high effective performance and cost effective chip multiprocessor with ease of use by compiler supports. OSCAR chip multiprocessor architecture integrates simple single issue processors having distributed shared data memory for optimal use of data locality over different loops and fine grain data transfer and synchronization, local data memory for private data recognized by compiler, and compiler controllable data transfer unit for overlapping data transfer to hide data transfer overhead. This OSCAR chip multiprocessor and OSCAR multigrain parallelizing compiler have been developed simultaneously. Performance of multigrain parallel processing on OSCAR chip multiprocessor architecture is evaluated using SPEC fp 2000/95 benchmark suite. When microSPARC like single issue core is used, OSCAR chip multiprocessor architecture gives us 2.36 times speedup in fpppp, 2.64 times in su2cor, 2.88 times in turb3d, 2.98 times in hydro2d, 3.84 times in tomcatv, 3.84 times in mgrid and 3.97 times in swim respectively for four processors against single processor.

  • Enhanced Look-Ahead Scheduling Technique to Overlap Communication with Computation

    Dingchao LI  Yuji IWAHORI  Tatsuya HAYASHI  Naohiro ISHII  

     
    PAPER-Sofware System

      Vol:
    E81-D No:11
      Page(s):
    1205-1212

    Reducing communication overhead is a key goal of program optimization for current scalable multiprocessors. A well-known approach to achieving this is to map tasks (indivisible units of computation) to processors so that communication and computation overlap as much as possible. In an earlier work, we developed a look-ahead scheduling heuristic for efficiently reducing communication overhead with the aim of decreasing the completion time of a given parallel program. In this paper, we report on an extension of the algorithm, which fills in the idle time slots created by interprocessor communication without increasing the algorithm's time complexity. The results of experiments emphasize the importance of optimally filling idle time slots in processors.

  • Efficient Implementation of Multi-Dimensional Array Redistribution

    Minyi GUO  Yoshiyuki YAMASHITA  Ikuo NAKATA  

     
    PAPER-Sofware System

      Vol:
    E81-D No:11
      Page(s):
    1195-1204

    Array redistribution is required very often in programs on distributed memory parallel computers. It is essential to use efficient algorithms for redistribution, otherwise the performance of programs may degrade considerably. In this paper, we focus on automatic generation of communication routines for multi-dimensional redistribution. The principal advantage of this work is to gain the ability to handle redistribution between arbitrary source and destination processor sets and between arbitrary source and destination distribution schemes. We have implemented these algorithms using Parallelware communication library. Some experimental results show the efficiency and flexibility of our techniques compared to the other redistribution works.

  • Data-Localization Scheduling inside Processor-Cluster for Multigrain Parallel Processing

    Akimasa YOSHIDA  Ken'ichi KOSHIZUKA  Wataru OGATA  Hironori KASAHARA  

     
    PAPER

      Vol:
    E80-D No:4
      Page(s):
    473-479

    This paper proposes a data-localization scheduling scheme inside a processor-cluster for multigrain parallel processing, which hierarchically exploits parallelism among coarsegrain tasks like loops, medium-grain tasks like loop iterations and near-fine-grain tasks like statements. The proposed scheme assigns near-fine-grain or medium-grain tasks inside coarse-grain tasks onto processors inside a processor-cluster so that maximum parallelism can be exploited and inter-processor data transfer can be minimum after data-localization for coarse-grain tasks across processor-clusters. Performance evaluation on a multiprocessor system OSCAR shows that multigrain parallel processing with the proposed data-localization scheduling can reduce execution time for application programs by 10% compared with multigrain parallel processing without data-localization.

  • PPD: A Practical Parallel Loop Detector for Parallelizing Compilers on Multiprocessor Systems*

    Chao-Tung YANG  Cheng-Tien WU  Shian-Shyong TSENG  

     
    PAPER-Sofware System

      Vol:
    E79-D No:11
      Page(s):
    1545-1560

    It is well known that extracting parallel loops plays a significant role in designing parallelizing compilers. The execution efficiency of a loop is enhanced when the loop can be executed in parallel or partial parallel, like a DOALL or DOACROSS loop. This paper reports on the practical parallelism detector (PPD) that is implemented in PFPC (a portable FORTRAN parallelizing compiler running on OSF/1) at NCTU to concentrate on finding the parallelism available in loops. The PPD can extract the potential DOALL and DOACROSS loops in a program by invoking a combination of the ZIV test and the I test for verifying array subscripts. Furthermore, if DOACROSS loops are available, an optimization of synchronization statement is made. Experimental results show that PPD is more reliable and accurate than previous approaches.