Chenxu WANG Hideki KAWAGUCHI Kota WATANABE
An approach to dedicated computers is discussed in this study as a possibility for portable, low-cost, and low-power consumption high-performance computing technologies. Particularly, dataflow architecture dedicated computer of the finite integration technique (FIT) for 2D magnetostatic field simulation is considered for use in industrial applications. The dataflow architecture circuit of the BiCG-Stab matrix solver of the FIT matrix calculation is designed by the very high-speed integrated circuit hardware description language (VHDL). The operation of the dedicated computer's designed circuit is considered by VHDL logic circuit simulation.
Leibo LIU Dong WANG Yingjie CHEN Min ZHU Shouyi YIN Shaojun WEI
This paper presents the design of a multiple-standard 1080 high definition (HD) video decoder on a mixed-grained reconfigurable computing platform integrating coarse-grained reconfigurable processing units (RPUs) and FPGAs. The proposed RPU, including 16×16 multi-functional processing elements (PEs), is used to accelerate compute-intensive tasks in the video decoding. A soft-core-based microprocessor array is implemented on the FPGA and adopted to speed-up the dynamic reconfiguration of the RPU. Furthermore, a mail-box-based communication scheme is utilized to improve the communication efficiency between RPUs and FPGAs. By exploiting dynamic reconfiguration of the RPUs and static reconfiguration of the FPGAs, the proposed platform achieves scalable performances and cost trade-offs to support a variety of video coding standards, including MPEG-2, AVS, H.264, and HEVC. The measured results show that the proposed platform can support H.264 1080 HD video streams at up to 57 frames per second (fps) and HEVC 1080 HD video streams at up to 52fps under 250MHz, at the same time, it achieves a 3.6× performance gain over an industrial coarse-grained reconfigurable processor for H.264 decoding, and a 6.43× performance boosts over a general purpose processor based implementation for HEVC decoding.
Atef IBRAHIM Hamed ELSIMARY Abdullah ALJUMAH
This paper presents novel reconfigurable semi-systolic array architecture for the Smith-Waterman with an affine gap penalty algorithm to align protein sequences optimized for shorter database sequences. This architecture has been modified to enable hardware reuse rather than replicating processing elements of the semi-systolic array in multiple FPGAs. The proposed hardware architecture and the previously published conventional one are described at the Register Transfer Level (RTL) using VHDL language and implemented using the FPGA technology. The results show that the proposed design has significant higher normalized speedup (up to 125%) over the conventional one for query sequence lengths less than 512 residues. According to the UniProtKB/TrEMBL protein database (release 2015_05) statistics, the largest number of sequences (about 80%) have sequence length less than 512 residues that makes the proposed design outperforms the conventional one in terms of speed and area in this sequence lengths range.
Yu PENG Shouyi YIN Leibo LIU Shaojun WEI
Coarse-grained Reconfigurable Architecture (CGRA) is a promising mobile computing platform that provides both high performance and high energy efficiency. In an application, loop nests are usually mapped onto CGRA for further acceleration, so optimizing the mapping is an important goal for design of CGRAs. Moreover, obviously almost all of mobile devices are powered by batteries, how to reduce energy consumption also becomes one of primary concerns in using CGRAs. This paper makes three contributions: a) Proposing an energy consumption model for CGRA; b) Formulating loop nests mapping problem to minimize the battery charge loss; c) Extract an efficient heuristic algorithm called BPMap. Experiment results on most kernels of the benchmarks and real-life applications show that our methods can improve the performance of the kernels and lower the energy consumption.
Shouyi YIN Rui SHI Leibo LIU Shaojun WEI
Coarse-grained Reconfigurable Architecture (CGRA) is a parallel computing platform that provides both high performance of hardware and high flexibility of software. It is becoming a promising platform for embedded and mobile applications. Since the embedded and mobile devices are usually battery-powered, improving battery lifetime becomes one of the primary design issues in using CGRAs. In this paper, we propose a battery-aware task-mapping method to optimize energy consumption and improve battery lifetime. The proposed method mainly addresses two problems: task partitioning and task scheduling when mapping applications onto CGRA. The task partitioning and scheduling are formulated as a joint optimization problem of minimizing the energy consumption. The nonlinear effects of real battery are taken into account in problem formulation. Using the insights from the problem formulation, we design the task-mapping algorithm. We have used several real-world benchmarks to test the effectiveness of the proposed method. Experiment results show that our method can dramatically lower the energy consumption and prolong the battery-life.
Shouyi YIN Dajiang LIU Leibo LIU Shaojun WEI
A coarse-grained reconfigurable architecture (CGRA) is typically hybrid architecture, which is composed of a reconfigurable processing unit (RPU) and a host microprocessor. Many computation-intensive kernels (e.g., loop nests) are often mapped onto RPUs to speed up the execution of programs. Thus, mapping optimization of loop nests is very important to improve the performance of CGRA. Processing element (PE) utilization rate, communication volume and reconfiguration cost are three crucial factors for the performance of RPUs. Loop transformations can affect these three performance influencing factors greatly, and would be of much significance when mapping loops onto RPUs. In this paper, a joint loop transformation approach for RPUs is proposed, where the PE utilization rate, communication cost and reconfiguration cost are under a joint consideration. Our approach could be integrated into compilers for CGRAs to improve the operating performance. Compared with the communication-minimal approach, experimental results show that our scheme can improve 5.8% and 13.6% of execution time on motion estimation (ME) and partial differential equation (PDE) solvers kernels, respectively. Also, run-time complexity is acceptable for the practical cases.
Cyclic Redundancy Check (CRC) is a well known error detection scheme used to detect corruption of digital content in digital networks and storage devices. Since it is a compute-intensive process which adversely affects performance, hardware acceleration using FPGAs has been tried and satisfactory performance has been achieved. However, recent extended usage of networks and storage systems require various correction capabilities for various CRC standards. Traditional hardware designs based on the LFSR (Linear Feedback Shift Register) tend to have fixed structure without such flexibility. Here, fully-adaptable CRC accelerator based on a table-based algorithm is proposed. The table-based algorithm is a flexible method commonly used in software implementations. It has been rarely implemented with the hardware, since it is believed that the operational speed is not enough. However, by using pipelined structure and efficient use of memory modules in FPGAs, it appeared that the table-based fixed CRC accelerators achieved better performance than traditional implementation. Based on the implementation, fully-adaptable CRC accelerator which eliminate the need for many non-adaptable CRC implementations is proposed. The accelerator has ability to process arbitrary number of input data and generates CRC for any known CRC standard, up to 65 bits of generator polynomial, during run-time. Further, we modify Table generation algorithm in order to decrease its space complexity from O(nm) to O(n). On Xilinx Virtex 6 LX550T board, the fully-adaptable accelerators occupy between 1 to 2% area to produce maximum of 289.8 Gbps at 283.1 MHz if BRAM is deployed, or between 1.6 - 14% of area for 418 Gbps at 408.9 MHz if tables are implemented in logic. Proposed architecture enables further expansion of throughput by increasing a number of input bits M processed at a time.
Hung K. NGUYEN Peng CAO Xue-Xiang WANG Jun YANG Longxing SHI Min ZHU Leibo LIU Shaojun WEI
REMUS-II (REconfigurable MUltimedia System 2) is a coarse-grained dynamically reconfigurable computing system for multimedia and communication baseband processing. This paper proposes a real-time H.264 baseline profile encoder on REMUS-II. First, we propose an overall mapping flow for mapping algorithms onto the platform of REMUS-II system and then illustrate it by implementing the H.264 encoder. Second, parallel and pipelining techniques are considered for fully exploiting the abundant computing resources of REMUS-II, thus increasing total computing throughput and solving high computational complexity of H.264 encoder. Besides, some data-reuse schemes are also used to increase data-reuse ratio and therefore reduce the required data bandwidth. Third, we propose a scheduling scheme to manage run-time reconfiguration of the system. The scheduling is also responsible for synchronizing the data communication between tasks and handling conflict between hardware resources. Experimental results prove that the REMUS-MB (REMUS-II version for mobile applications) system can perform a real-time H.264/AVC baseline profile encoder. The encoder can encode CIF@30 fps video sequences with two reference frames and maximum search range of [-16,15]. The implementation, thereby, can be applied to handheld devices targeted at mobile multimedia applications. The platform of REMUS-MB system is designed and synthesized by using TSMC 65 nm low power technology. The die size of REMUS-MB is 13.97 mm2. REMUS-MB consumes, on average, about 100 mW while working at 166 MHz. To my knowledge, in the literature this is the first implementation of H.264 encoding algorithm on a coarse-grained dynamically reconfigurable computing system.
Dajiang LIU Shouyi YIN Chongyong YIN Leibo LIU Shaojun WEI
Reconfigurable computing system is a class of parallel architecture with the ability of computing in hardware to increase performance, while remaining much of flexibility of a software solution. This architecture is particularly suitable for running regular and compute-intensive tasks, nevertheless, most compute-intensive tasks spend most of their running time in nested loops. Polyhedron model is a powerful tool to give a reasonable transformation on such nested loops. In this paper, a number of issues are addressed towards the goal of optimization of affine loop nests for reconfigurable cell array (RCA), such as approach to make the most use of processing elements (PE) while minimizing the communication volume by loop transformation in polyhedron model, determination of tilling form by the intra-statement dependence analysis and determination of tilling size by the tilling form and the RCA size. Experimental results on a number of kernels demonstrate the effectiveness of the mapping optimization approaches developed. Compared with DFG-based optimization approach, the execution performances of 1-d jacobi and matrix multiplication are improved by 28% and 48.47%. Lastly, the run-time complexity is acceptable for the practical cases.
Antoine TROUVE Kazuaki MURAKAMI
This article introduces some improvements to the already proposed custom instruction candidates selection for the automatic ISA customisation problem targeting reconfigurable processors. It introduces new opportunities to prune the search space, and a technique based on dynamic programming to check the independence between groups. The proposed new algorithm yields one order less measured number of convexity checks than the related work for the same inputs and outputs.
Shouyi YIN Chongyong YIN Leibo LIU Min ZHU Shaojun WEI
Coarse-grained reconfigurable architecture (CGRA) combines the performance of application-specific integrated circuits (ASICs) and the flexibility of general-purpose processors (GPPs), which is a promising solution for embedded systems. With the increasing complexity of reconfigurable resources (processing elements, routing cells, I/O blocks, etc.), the reconfiguration cost is becoming the performance bottleneck. The major reconfiguration cost comes from the frequent memory-read/write operations for transferring the configuration context from main memory to context buffer. To improve the overall performance, it is critical to reduce the amount of configuration context. In this paper, we propose a configuration context reduction method for CGRA. The proposed method exploits the structure correlation of computation tasks that are mapped onto CGRA and reduce the redundancies in configuration context. Experimental results show that the proposed method can averagely reduce the configuration context size up to 71% and speed up the execution up to 68%. The proposed method does not depend on any architectural feature and can be applied to CGRA with an arbitrary architecture.
Chongyong YIN Shouyi YIN Leibo LIU Shaojun WEI
Compiler is the most important supporting tool to facilitate the use of reconfigurable computing architecture (RCA). In this paper, a template-based compiler framework is proposed. This compiler can synthesize the executables for RCA from native high-level programming language source code directly. It supports to generate run-time dynamic configuration context. And it is capable to generate both full configuration context and partial configuration context. Experimental results show that the executables generated by the proposed compiler can achieve better execution performance and smaller configuration context size than previous compilers. Moreover, this compiler does not require the programmer to have any extra knowledge about the hardware architecture of RCA.
Panan POTIPANTONG Phaophak SIRISUK Soontorn ORAINTARA Apisak WORAPISHET
This paper presents an FPGA implementation of highly modular universal discrete transforms. The implementation relies upon the unified discrete Fourier Hartley transform (UDFHT), based on which essential sinusoidal transforms including discrete Fourier transform (DFT), discrete Hartley transform (DHT), discrete cosine transform (DCT) and discrete sine transform (DST) can be realized. It employs a reconfigurable, scalable and modular architecture that consists of a memory-based FFT processor equipped with pre- and post-processing units. Besides, a pipelining technique is exploited to seamlessly harmonize the operation between each sub-module. Experimental results based on Xilinx Virtex-II Pro are given to examine the performance of the proposed UDFHT implementation. Two practical applications are also shown to demonstrate the flexibility and modularity of the proposed work.
Mitsuru TOMONO Masaki NAKANISHI Shigeru YAMASHITA Kazuo NAKAJIMA Katsumasa WATANABE
In a partially reconfigurable FPGA of the future, arbitrary portions of its logic resources and interconnection networks will be reconfigured without affecting the other parts. Multiple tasks will be mapped and executed concurrently in such an FPGA. Efficient execution of the tasks using the limited resources of the FPGA will necessitate effective resource management. A number of online FPGA placement methods have recently been proposed for such an FPGA. However, they cannot handle I/O communications of the tasks. Taking such I/O communications into consideration, we introduce a new approach to online FPGA placement. We present an algorithm for placing each arriving task in an empty area so as to complete all the tasks efficiently. We develop two fitting strategies to effectively handle I/O communications of the tasks. Our experimental results show that properly weighted combinations of these and two other previously proposed strategies enable this algorithm to run very fast and make an effective placement of the tasks. In fact, we show that the overhead associated with the use of this algorithm is negligible as compared to the total execution time of the tasks.
Yeong-Kang LAI Lien-Fei CHEN Jian-Chou CHEN Chun-Wei CHIU
In this paper, a novel cost effective interconnection network for two-way pipelined SIMD-based reconfigurable computing processor is proposed. Our reconfigurable computing engine is composed of the SIMD-based function units, flexible interconnection networks, and two-bank on-chip memories. In order to connect the function units, the reconfigurable network is proposed to connect all neighbors of each function unit. The proposed interconnection network is a kind of full and bidirectional connection with the data duplication to perform the data-parallelism applications efficiently. Moreover, it is a multistage network to accomplish the high flexibility and low hardware cost.
Kenji KUDO Yoshihiro MYOKAN Winh Chan THAN Shinji AKIMOTO Takashi KANAMARU Masatoshi SEKINE
To realize the hardware object which facilitates the application development in the reconfigurable computing system, a hardware module (HwModule) is proposed and implemented. To access the circuit in the HwModule from the standard PC without detailed knowledge of the hardware, an object manager (ObjectManager) is also implemented. With the help of the ObjectManager, the programmers can use the hardware objects like the usual software objects. The HwModule is applied to the image matching, and the easiness of the application development for the HwModule is confirmed.
Katsumi TAKAHASHI Hiroai ASAMI Katsuto NAKAJIMA Masahiro IIDA
We designed an FPGA-based parallel machine called "RASH"(Reconfigurable Architecture based on Scalable Hardware) for high speed and flexible signal/data processing. Cryptanalysis is one of the killer applications for FPGA-based machines because huge amounts of logical and/or simple arithmetic operations are required and FPGA is suitable for this. One of the well-known activities in cryptanalysis is the DES (Data Encryption Standard) cracking contest conducted by RSA Data Security. TMTO (Time-Memory Trade-Off) Cryptanalysis is a practical method to dramatically shorten the time for key search when plaintext is given in advance. A string of ASCII characters is used as the key much like a password. The ASCII character is 7-bit character and is changed to 96 kinds of value. The 56-bit DES key is given with a string of 8 ASCII characters. Although the DES key has 64 trillion(=256) possibilities, the key that is given with a string has only 6.4 trillion(=968) possibilities. Therefore, we improve TMTO cryptanalysis so that we search only the limited key by ASCII characters and reduce the quantity of computation. In this paper, we demonstrate how TMTO cryptanalysis for limited key is well suited to our FPGA-based RASH machine. By limiting the key to a string, DES key will be found at 80% probability within 45 minutes after ciphertext is given on 10 units of RASH. The precomputation before starting key search takes 3 weeks on the same RASH configuration.
Takanori HAYASHIDA Kazuaki MURAKAMI
Online profiling methodologies are studied for exploiting dynamic optimization. On a dynamic optimizable system with online profilers, it has to get accurate profile in early step of the program execution for effective execution. However, for getting more effective profile by online profiling, it has to satisfy "Rapidness" and "Accuracy". They are conflicted requirements. Therefore, it has to choose trade-off point at implementation. We focused into online Hot Instruction Sequence (HIS) profiler to exploit reconfigurable functional units. To circumstantiate the effectiveness of online HIS profiling, we build some evaluation models for experimental evaluation. Our profiler models are SC/DM, SC/FA and JC/DM. These models have different policy of event counting and table lookup. Our event counting policies are simple-counting or jumble-counting. On the other hand, table lookup policies are direct-map or full-associative. In our experimental evaluation, SC/FA and JC/DM models scored higher accuracy than SC/DM. The JC/DM model is able to implement by lower cost for table lookup, but it scored high accuracy comparable to SC/FA.
Jacir L. BORDIM Yasuaki ITO Koji NAKANO
The main contribution of this paper is to present an FPGA-based implementation of an instance-specific hardware which accelerates the CKY (Cocke-Kasami-Younger) parsing for context-free grammars. Given a context-free grammar G and a string x, the CKY parsing determines whether G derives x. We have developed a hardware generator that creates a Verilog HDL source to perform the CKY parsing for any given context-free grammar G. The generated source is embedded in an FPGA using the design software provided by the FPGA vendor. We evaluated the instance-specific hardware, generated by our hardware generator, using a timing analyzer and tested it using the Altera FPGAs. The generated hardware attains a speed-up factor of approximately 750 over the software CKY parsing algorithm.
Hideyuki ITO Ryusuke KONISHI Hiroshi NAKADA Kiyoshi OGURI Minoru INAMORI Akira NAGOYA
This paper describes the realization of a dynamically reconfigurable logic LSI based on a novel parallel computer architecture. The key point of the architecture is its dual-structured cell array which enables dynamic and autonomous reconfiguration of the logic circuits. The LSI was completed by successfully introducing two specific features: fully asynchronous logic circuits and a homogeneous structure, only LUTs are used.