Atsushi MATSUO Shigeru YAMASHITA Daniel J. EGGER
Most quantum circuits require SWAP gate insertion to run on quantum hardware with limited qubit connectivity. A promising SWAP gate insertion method for blocks of commuting two-qubit gates is a predetermined swap strategy which applies layers of SWAP gates simultaneously executable on the coupling map. A good initial mapping for the swap strategy reduces the number of required swap gates. However, even when a circuit consists of commuting gates, e.g., as in the Quantum Approximate Optimization Algorithm (QAOA) or trotterized simulations of Ising Hamiltonians, finding a good initial mapping is a hard problem. We present a SAT-based approach to find good initial mappings for circuits with commuting gates transpiled to the hardware with swap strategies. Our method achieves a 65% reduction in gate count for random three-regular graphs with 500 nodes. In addition, we present a heuristic approach that combines the SAT formulation with a clustering algorithm to reduce large problems to a manageable size. This approach reduces the number of swap layers by 25% compared to both a trivial and random initial mapping for a random three-regular graph with 1000 nodes. Good initial mappings will therefore enable the study of quantum algorithms, such as QAOA and Ising Hamiltonian simulation applied to sparse problems, on noisy quantum hardware with several hundreds of qubits.
High level synthesis (HLS) is a source-code-driven Register Transfer Level (RTL) design tool, and the performance, the power consumption, and the area of a generated RTL are limited partly by the description of a HLS input source code. In order to break through such kind of limitation and to get a further optimized RTL, the optimization of the input source code is indispensable. Routing congestion is one of such problems we need to consider the refinement of a HLS input source code. In this paper, we propose a novel HLS flow that performs code improvements by detecting congested parts directly from HLS input source code without using physical logic synthesis, and regenerating the input source code for HLS. In our approach, the origin of the wire congestion is detected from the HLS input source code by applying pattern matching on Program-Dependence Graph (PDG) constructed from the HLS input source code, the possibility of wire congestion is reported.
Boma A. ADHI Tomoya KASHIMATA Ken TAKAHASHI Keiji KIMURA Hironori KASAHARA
The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism; therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 “equake” with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8 cores with no hardware coherence mechanism available.
Yoshitake OKI Yuto ABE Kazuki YAMAMOTO Kohei YAMAMOTO Tomoya SHIRAKAWA Akimasa YOSHIDA Keiji KIMURA Hironori KASAHARA
Utilization of local memory from real-time embedded systems to high performance systems with multi-core processors has become an important factor for satisfying hard deadline constraints. However, challenges lie in the area of efficiently managing the memory hierarchy, such as decomposing large data into small blocks to fit onto local memory and transferring blocks for reuse and replacement. To address this issue, this paper presents a compiler optimization method that automatically manage local memory of multi-core processors. The method selects and maps multi-dimensional data onto software specified memory blocks called Adjustable Blocks. These blocks are hierarchically divisible with varying sizes defined by the features of the input application. Moreover, the method introduces mapping structures called Template Arrays to maintain the indices of the decomposed multi-dimensional data. The proposed work is implemented on the OSCAR automatic parallelizing compiler and evaluations were performed on the Renesas RP2 8-core processor. Experimental results from NAS Parallel Benchmark, SPEC benchmark, and multimedia applications show the effectiveness of the method, obtaining maximum speed-ups of 20.44 with 8 cores utilizing local memory from single core sequential versions that use off-chip memory.
Toru NAKURA Tetsuya IIZUKA Kunihiro ASADA
This paper demonstrates a PLL compiler that generates the final GDSII data from a specification of input and output frequencies with PVT corner conditions. A Pulse Width Controlled PLLs (PWPLL) is composed of digital blocks, and thus suitable for being designed using a standard cell library and being layed out with a commercially available place-and-route (P&R) tool. A PWPLL has 8 design parameters. Our PLL compiler decides the 8 parameters and confirms the PLL operation with the following functions: 1) calculates rough parameter values based on an analytical model, 2) generates SPICE and gate-level verilog netlists with given parameter values, 3) runs SPICE simulations and analyzes the waveform, to examine the oscillation frequency or the voltage of specified nodes at a given time, 4) changes the parameter values to an appropriate direction depending on the waveform analyses to obtain the optimized parameter values, 5) generates scripts that can be used in commercial design tools and invokes the tools with the gate-level verilog netlist to get the final LVS/DRC-verified GDSII data from a P&R and a verification tools, and finally 6) generates the necessary characteristic summary sheets from the post-layout SPICE simulations extracted from the GDSII. Our compiler was applied to an 0.18µm standard CMOS technology to design a PLL with 600MHz output, 600/16MHz input frequency, and confirms the PLL operation with 1.2mW power and 85µm×85µm layout area.
Yusuke HIBINO Hirofumi IKEO Nagisa ISHIURA
This letter presents a test suite CF3 designed to find bugs in arithmetic optimizers of C compilers. It consists of 13,720 test programs containing all the expression patterns covering all the permutations of 3 operators from 14 operators. CF3 detected more than 70 errors in GCC 4.2-4.5 within 2 hours.
Hideki TAKASE Gang ZENG Lovic GAUTHIER Hirotaka KAWASHIMA Noritoshi ATSUMI Tomohiro TATEMATSU Yoshitake KOBAYASHI Takenori KOSHIRO Tohru ISHIHARA Hiroyuki TOMIYAMA Hiroaki TAKADA
This paper presents a framework for reducing the energy consumption of embedded real-time systems. We implemented the presented framework as both an optimization toolchain and an energy-aware real-time operating system. The framework consists of the integration of multiple techniques to optimize the energy consumption. The main idea behind our approach is to utilize trade-offs between the energy consumption and the performance of different processor configurations during task checkpoints, and to maintain memory allocation during task context switches. In our framework, a target application is statically analyzed at both intra-task and inter-task levels. Based on these analyzed results, runtime optimization is performed in response to the behavior of the application. A case study shows that our toolchain and real-time operating systems have achieved energy reduction while satisfying the real-time performance. The toolchain has also been successfully applied to a practical application.
Agus BEJO Dongju LI Tsuyoshi ISSHIKI Hiroaki KUNIEDA
This paper firstly presents a processor design with Derivative ASIP approach. The architecture of processor is designed by making use of a well-known embedded processor's instruction-set as a base architecture. To improve its performance, the architecture is enhanced with more hardware resources such as registers, interfaces and instruction extensions which might achieve target specifications. Secondly, a new approach for retargeting compiler by means of assembly converter tool is proposed. Our retargeting approach is practical because it is performed by the assembly converter tool with a simple configuration file and independent from a base compiler. With our proposed approach, both architecture flexibility and a good quality of assembly code can be obtained at once. Compared to other compilers, experiments show that our approach capable of generating code as high efficiency as its base compiler and the developed ASIP results in better performance than its base processor.
Hiroshi NAKAMURA Weihan WANG Yuya OHTA Kimiyoshi USAMI Hideharu AMANO Masaaki KONDO Mitaro NAMIKI
Power consumption has recently emerged as a first class design constraint in system LSI designs. Specially, leakage power has occupied a large part of the total power consumption. Therefore, reduction of leakage power is indispensable for efficient design of high-performance system LSIs. Since 2006, we have carried out a research project called “Innovative Power Control for Ultra Low-Power and High-Performance System LSIs”, supported by Japan Science and Technology Agency as a CREST research program. One of the major objectives of this project is reducing the leakage power consumption of system LSIs by innovative power control through tight cooperation and co-optimization of circuit technology, architecture, and system software designs. In this project, we focused on power gating as a circuit technique for reducing leakage power. Temporal granularity is one of the most important issue in power gating. Thus, we have developed a series of Geysers as proof-of-concept CPUs which provide several mechanisms of fine-grained run-time power gating. In this paper, we describe their concept and design, and explain why co-optimization of different design layers are important. Then, three kinds of power gating implementations and their evaluation are presented from the view point of power saving and temporal granularity.
Kyongjin JO Seon Wook KIM Jong-Kook KIM
The size and complexity of software in computer systems and even in consumer electronics is drastically and continuously increasing, thus increasing the compilation time. For example, the compilation time for building some of mobile phones' platform software takes several hours. In order to reduce the compilation time, this paper proposes a Distributed Scalable Compilation Tool, called DiSCo where full compilation passes such as preprocessing, compilation, and even linking are performed at remote machines, i.e. in parallel. To the best of our knowledge DiSCo is the first distributed compiler to support complete distributed processing in all the compilation passes. We use an extensive dependency analysis in parsing compilation commands for exploiting higher command-level parallelism, and we apply a file caching method and a network-drive protocol for reducing the remote compilation overhead and simplifying the implementation. Lastly, we minimize load imbalance and remote machine management overhead with our heuristic static scheduling method by predicting compilation time and considering the overheads invoked by the compilation process. Our evaluation using four large mobile applications and eight GNU applications shows that the performance of DiSCo is scalable and the performance is close to a profile scheduling.
Hideki TAKASE Hiroyuki TOMIYAMA Hiroaki TAKADA
Energy minimization has become one of the primary goals in the embedded real-time domains. Consequently, scratch-pad memory has been employed as partial or entire replacement for cache memory due to its better energy efficiency. However, most previous approaches were not applicable to a preemptive multi-task environment. We propose three methods of partitioning and allocation of scratch-pad memory for fixed-priority-based preemptive multi-task systems. The three methods, i.e., spatial, temporal, and hybrid methods, achieve energy reduction in the instruction memory subsystems. With the spatial method, each task occupies its exclusive space in scratch-pad memory. With the temporal method, the running task uses entire scratch-pad space. The content of scratch-pad memory is swapped out as a task executes or gets preempted. The hybrid method is based on the spatial one but a higher priority task can temporarily use the space of lower priority task. The amount of space is prioritized for higher priority tasks. We formulate each method as an integer programming problem that simultaneously determines (1) partitioning of scratch-pad memory space for the tasks, and (2) allocation of program code to scratch-pad memory space for each task. Our methods not only support the real-time task scheduling but also consider aggressively the periods and priorities of tasks for the energy minimization. Additionally, we implement an RTOS-hardware cooperative support mechanism for runtime code allocation to the scratch-pad memory space. We have made the experiments with the fully functional real-time operating system. The experimental results have demonstrated the effectiveness of our techniques. Up to 73% energy reduction compared to a conventional method was achieved.
Min ZHU Leibo LIU Shouyi YIN Chongyong YIN Shaojun WEI
This paper introduces a cycle-accurate Simulator for a dynamically REconfigurable MUlti-media System, called SimREMUS. SimREMUS can either be used at transaction-level, which allows the modeling and simulation of higher-level hardware and embedded software, or at register transfer level, if the dynamic system behavior is desired to be observed at signal level. Trade-offs among a set of criteria that are frequently used to characterize the design of a reconfigurable computing system, such as granularity, programmability, configurability as well as architecture of processing elements and route modules etc., can be quickly evaluated. Moreover, a complete tool chain for SimREMUS, including compiler and debugger, is developed. SimREMUS could simulate 270 k cycles per second for million gates SoC (System-on-a-Chip) and produced one H.264 1080p frame in 15 minutes, which might cost days on VCS (platform: CPU: E5200@ 2.5 Ghz, RAM: 2.0 GB). Simulation showed that 1080p@30 fps of H.264 High Profile@ Level 4 can be achieved when exploiting a 200 MHz working frequency on the VLSI architecture of REMUS.
Takuji HIEDA Hiroaki TANAKA Keishi SAKANUSHI Yoshinori TAKEUCHI Masaharu IMAI
Partial forwarding is a design method to place forwarding paths on a part of processor pipeline. Hardware cost of processor can be reduced without performance loss by partial forwarding. However, compiler with the instruction scheduler which considers partial forwarding structure of the target processor is required since conventional scheduling algorithm cannot make the most of partial forwarding structure. In this paper, we propose a heuristic instruction scheduling method for processors with partial forwarding structure. The proposed algorithm uses available distance to schedule instructions which are suitable for the target partial forwarding processor. Experimental results show that the proposed method generates near-optimal solutions in practical time and some of the optimized codes for partial forwarding processor run in the shortest time among the target processors. It also shows that the proposed method is superior to hazard detection unit.
Vasutan TUNBUNHENG Hideharu AMANO
For developing design environment of various Dynamically Reconfigurable Processor Arrays (DRPAs), the Graph with Configuration Information (GCI) is proposed to represent configurable resource in the target dynamically reconfigurable architecture. The functional unit, constant unit, register, and routing resource can be represented in the graph as well as the configuration information. The restriction in the hardware is also added in the graph by limiting the possible configuration at a node controlled by the other node. A prototype compiler called Black-Diamond with GCI is now available for three different DRPAs. It translates data-flow graph from C-like front-end description, applies placement and routing by using the GCI, and generates configuration data for each element of the DRPA. Evaluation results of simple applications show that Black-Diamond can generate reasonable designs for all three different architectures. Other target architectures can be easily treated by representing many aspects of architectural property into a GCI.
Youngsun HAN Seok Joong HWANG Seon Wook KIM
In this paper, we present a reconfigurable processor infrastructure to accelerate Java applications, called Jaguar. The Jaguar infrastructure consists of a compiler framework and a runtime environment support. The compiler framework selects a group of Java methods to be translated into hardware for delivering the best performance under limited resources, and translates the selected Java methods into Verilog synthesizable code modules. The runtime environment support includes the Java virtual machine (JVM) running on a host processor to provide Java execution environment to the generated Java accelerator through communication interface units while preserving Java semantics. Our compiler infrastructure is a tightly integrated and solid compiler-aided solution for Java reconfigurable computing. There is no limitation in generating synthesizable Verilog modules from any Java application while preserving Java semantics. In terms of performance, our infrastructure achieves the speedup by 5.4 times on average and by up to 9.4 times in measured benchmarks with respect to JVM-only execution. Furthermore, two optimization schemes such as an instruction folding and a live buffer removal can reduce 24% on average and up to 39% of the resource consumption.
Hiroaki SHIKANO Jun SHIRAKO Yasutaka WADA Keiji KIMURA Hironori KASAHARA
A power-aware compiler controllable chip multiprocessor (CMP) is presented and its performance and power consumption are evaluated with the optimally scheduled advanced multiprocessor (OSCAR) parallelizing compiler. The CMP is equipped with power control registers that change clock frequency and power supply voltage to functional units including processor cores, memories, and an interconnection network. The OSCAR compiler carries out coarse-grain task parallelization of programs and reduces power consumption using architectural power control support and the compiler's power saving scheme. The performance evaluation shows that MPEG-2 encoding on the proposed CMP with four CPUs results in 82.6% power reduction in real-time execution mode with a deadline constraint on its sequential execution time. Furthermore, MP3 encoding on a heterogeneous CMP with four CPUs and four accelerators results in 53.9% power reduction at 21.1-fold speed-up in performance against its sequential execution in the fastest execution mode.
Yuki KOBAYASHI Murali JAYAPALA Praveen RAGHAVAN Francky CATTHOOR Masaharu IMAI
Clustering L0 buffers is effective for energy reduction in the instruction memory caches of embedded VLIW processors. However, the efficiency of the clustering depends on the schedule of the target application. For improving the energy efficiency of L0 clusters, an operation shuffling is proposed, which explores assignment of operations for each cycle, generates various schedules, and evaluates them to find an energy efficient schedule. This approach can find energy efficient schedules, however, it takes a long time to obtain the final result. In this paper, we propose a new method to directly generate an energy efficient schedule without iterations of operation shuffling. In the proposed method, a compiler schedules operations using the result of the single operation shuffling as a constraint. We propose some optimization algorithms to generate an energy efficient schedule for a given L0 cluster organization. The proposed method can drastically reduce the computational effort since it performs the operation shuffling only once. The experimental results show that comparable energy reduction is achieved by using the proposed method while the computational effort can be reduced significantly over the conventional operation shuffling.
Takefumi MIYOSHI Nobuhiko SUGINO
For a coarse grain dynamic reconfigurable processing unit cooperating with a general purpose processor, a context selection method, which can reduce total execution cycles of a given program, is proposed. The method evaluates context candidates from a given program, in terms of reduction in cycles by exploiting parallel and pipeline execution of the reconfigurable processor. According to this evaluation measure, the method selects appropriate contexts for the dynamic reconfigurable processing unit. The proposed method is implemented on the framework of COINS project. For several example programs, the generated codes are evaluated by a software simulator in terms of execution cycles, and these results prove the effectiveness of the proposed method.
Hiroaki TANAKA Yoshinori TAKEUCHI Keishi SAKANUSHI Masaharu IMAI Hiroki TAGAWA Yutaka OTA Nobu MATSUMOTO
SIMD instructions are often implemented in modern multimedia oriented processors. Although SIMD instructions are useful for many digital signal processing applications, most compilers do not exploit SIMD instructions. The difficulty in the utilization of SIMD instructions stems from data parallelism in registers. In assembly code generation, the positions of data in registers must be noted. A technique of generating pack instructions which pack or reorder data in registers is essential for exploitation of SIMD instructions. This paper presents a code generation technique for SIMD instructions with pack instructions. SIMD instructions are generated by finding and grouping the same operations in programs. After the SIMD instruction generation, pack instructions are generated. In the pack instruction generation, Multi-valued Decision Diagram (MDD) is introduced to represent and to manipulate sets of packed data. Experimental results show that the proposed code generation technique can generate assembly code with SIMD and pack instructions performing repacking of 8 packed data in registers for a RISC processor with a dual-issue coprocessor which supports SIMD and pack instructions. The proposed method achieved speedup ratio up to about 8.5 by SIMD instructions and multiple-issue mechanism of the target processor.
Masahiko OMURA Toshiki KANAMOTO Michiko TSUKAMOTO Mitsutoshi SHIROTA Takashi NAKAJIMA Masayuki TERAI
This paper proposes a new efficient method of characterizing a memory compiler in order to reduce the computation time and remove human error. The new features that make our method greatly efficient are the following three points: (1) high-speed circuit simulation of the whole memory module using a hierarchical LPE (Layout Parasitic Extractor) and a hierarchical circuit simulator, (2) automatic generation of circuit simulation input data from corresponding parameterized description termed the template file, and (3) carefully selected environmental conditions of circuit level simulator and minimizing the number of runs of it. We demonstrate the effectiveness of the proposed method by application to the single-port SRAM generators using 90 nm CMOS technology.