The search functionality is under construction.

Keyword Search Result

[Keyword] reconfigurable computing(22hit)

1-20hit(22hit)

  • Study of FIT Dedicated Computer with Dataflow Architecture for High Performance 2-D Magneto-Static Field Simulation

    Chenxu WANG  Hideki KAWAGUCHI  Kota WATANABE  

     
    PAPER

      Pubricized:
    2022/08/23
      Vol:
    E106-C No:4
      Page(s):
    136-143

    An approach to dedicated computers is discussed in this study as a possibility for portable, low-cost, and low-power consumption high-performance computing technologies. Particularly, dataflow architecture dedicated computer of the finite integration technique (FIT) for 2D magnetostatic field simulation is considered for use in industrial applications. The dataflow architecture circuit of the BiCG-Stab matrix solver of the FIT matrix calculation is designed by the very high-speed integrated circuit hardware description language (VHDL). The operation of the dedicated computer's designed circuit is considered by VHDL logic circuit simulation.

  • An Implementation of Multiple-Standard Video Decoder on a Mixed-Grained Reconfigurable Computing Platform

    Leibo LIU  Dong WANG  Yingjie CHEN  Min ZHU  Shouyi YIN  Shaojun WEI  

     
    PAPER-Computer System

      Pubricized:
    2016/02/02
      Vol:
    E99-D No:5
      Page(s):
    1285-1295

    This paper presents the design of a multiple-standard 1080 high definition (HD) video decoder on a mixed-grained reconfigurable computing platform integrating coarse-grained reconfigurable processing units (RPUs) and FPGAs. The proposed RPU, including 16×16 multi-functional processing elements (PEs), is used to accelerate compute-intensive tasks in the video decoding. A soft-core-based microprocessor array is implemented on the FPGA and adopted to speed-up the dynamic reconfiguration of the RPU. Furthermore, a mail-box-based communication scheme is utilized to improve the communication efficiency between RPUs and FPGAs. By exploiting dynamic reconfiguration of the RPUs and static reconfiguration of the FPGAs, the proposed platform achieves scalable performances and cost trade-offs to support a variety of video coding standards, including MPEG-2, AVS, H.264, and HEVC. The measured results show that the proposed platform can support H.264 1080 HD video streams at up to 57 frames per second (fps) and HEVC 1080 HD video streams at up to 52fps under 250MHz, at the same time, it achieves a 3.6× performance gain over an industrial coarse-grained reconfigurable processor for H.264 decoding, and a 6.43× performance boosts over a general purpose processor based implementation for HEVC decoding.

  • Novel Reconfigurable Hardware Accelerator for Protein Sequence Alignment Using Smith-Waterman Algorithm

    Atef IBRAHIM  Hamed ELSIMARY  Abdullah ALJUMAH  

     
    PAPER-Digital Signal Processing

      Vol:
    E99-A No:3
      Page(s):
    683-690

    This paper presents novel reconfigurable semi-systolic array architecture for the Smith-Waterman with an affine gap penalty algorithm to align protein sequences optimized for shorter database sequences. This architecture has been modified to enable hardware reuse rather than replicating processing elements of the semi-systolic array in multiple FPGAs. The proposed hardware architecture and the previously published conventional one are described at the Register Transfer Level (RTL) using VHDL language and implemented using the FPGA technology. The results show that the proposed design has significant higher normalized speedup (up to 125%) over the conventional one for query sequence lengths less than 512 residues. According to the UniProtKB/TrEMBL protein database (release 2015_05) statistics, the largest number of sequences (about 80%) have sequence length less than 512 residues that makes the proposed design outperforms the conventional one in terms of speed and area in this sequence lengths range.

  • Battery-Aware Loop Nests Mapping for CGRAs

    Yu PENG  Shouyi YIN  Leibo LIU  Shaojun WEI  

     
    PAPER-Architecture

      Vol:
    E98-D No:2
      Page(s):
    230-242

    Coarse-grained Reconfigurable Architecture (CGRA) is a promising mobile computing platform that provides both high performance and high energy efficiency. In an application, loop nests are usually mapped onto CGRA for further acceleration, so optimizing the mapping is an important goal for design of CGRAs. Moreover, obviously almost all of mobile devices are powered by batteries, how to reduce energy consumption also becomes one of primary concerns in using CGRAs. This paper makes three contributions: a) Proposing an energy consumption model for CGRA; b) Formulating loop nests mapping problem to minimize the battery charge loss; c) Extract an efficient heuristic algorithm called BPMap. Experiment results on most kernels of the benchmarks and real-life applications show that our methods can improve the performance of the kernels and lower the energy consumption.

  • Battery-Aware Task Mapping for Coarse-Grained Reconfigurable Architecture

    Shouyi YIN  Rui SHI  Leibo LIU  Shaojun WEI  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2524-2535

    Coarse-grained Reconfigurable Architecture (CGRA) is a parallel computing platform that provides both high performance of hardware and high flexibility of software. It is becoming a promising platform for embedded and mobile applications. Since the embedded and mobile devices are usually battery-powered, improving battery lifetime becomes one of the primary design issues in using CGRAs. In this paper, we propose a battery-aware task-mapping method to optimize energy consumption and improve battery lifetime. The proposed method mainly addresses two problems: task partitioning and task scheduling when mapping applications onto CGRA. The task partitioning and scheduling are formulated as a joint optimization problem of minimizing the energy consumption. The nonlinear effects of real battery are taken into account in problem formulation. Using the insights from the problem formulation, we design the task-mapping algorithm. We have used several real-world benchmarks to test the effectiveness of the proposed method. Experiment results show that our method can dramatically lower the energy consumption and prolong the battery-life.

  • Affine Transformations for Communication and Reconfiguration Optimization of Mapping Loop Nests on CGRAs

    Shouyi YIN  Dajiang LIU  Leibo LIU  Shaojun WEI  

     
    PAPER-Design Methodology

      Vol:
    E96-D No:8
      Page(s):
    1582-1591

    A coarse-grained reconfigurable architecture (CGRA) is typically hybrid architecture, which is composed of a reconfigurable processing unit (RPU) and a host microprocessor. Many computation-intensive kernels (e.g., loop nests) are often mapped onto RPUs to speed up the execution of programs. Thus, mapping optimization of loop nests is very important to improve the performance of CGRA. Processing element (PE) utilization rate, communication volume and reconfiguration cost are three crucial factors for the performance of RPUs. Loop transformations can affect these three performance influencing factors greatly, and would be of much significance when mapping loops onto RPUs. In this paper, a joint loop transformation approach for RPUs is proposed, where the PE utilization rate, communication cost and reconfiguration cost are under a joint consideration. Our approach could be integrated into compilers for CGRAs to improve the operating performance. Compared with the communication-minimal approach, experimental results show that our scheme can improve 5.8% and 13.6% of execution time on motion estimation (ME) and partial differential equation (PDE) solvers kernels, respectively. Also, run-time complexity is acceptable for the practical cases.

  • High-Speed Fully-Adaptable CRC Accelerators

    Amila AKAGIC  Hideharu AMANO  

     
    PAPER-Computer System

      Vol:
    E96-D No:6
      Page(s):
    1299-1308

    Cyclic Redundancy Check (CRC) is a well known error detection scheme used to detect corruption of digital content in digital networks and storage devices. Since it is a compute-intensive process which adversely affects performance, hardware acceleration using FPGAs has been tried and satisfactory performance has been achieved. However, recent extended usage of networks and storage systems require various correction capabilities for various CRC standards. Traditional hardware designs based on the LFSR (Linear Feedback Shift Register) tend to have fixed structure without such flexibility. Here, fully-adaptable CRC accelerator based on a table-based algorithm is proposed. The table-based algorithm is a flexible method commonly used in software implementations. It has been rarely implemented with the hardware, since it is believed that the operational speed is not enough. However, by using pipelined structure and efficient use of memory modules in FPGAs, it appeared that the table-based fixed CRC accelerators achieved better performance than traditional implementation. Based on the implementation, fully-adaptable CRC accelerator which eliminate the need for many non-adaptable CRC implementations is proposed. The accelerator has ability to process arbitrary number of input data and generates CRC for any known CRC standard, up to 65 bits of generator polynomial, during run-time. Further, we modify Table generation algorithm in order to decrease its space complexity from O(nm) to O(n). On Xilinx Virtex 6 LX550T board, the fully-adaptable accelerators occupy between 1 to 2% area to produce maximum of 289.8 Gbps at 283.1 MHz if BRAM is deployed, or between 1.6 - 14% of area for 418 Gbps at 408.9 MHz if tables are implemented in logic. Proposed architecture enables further expansion of throughput by increasing a number of input bits M processed at a time.

  • Hardware Software Co-design of H.264 Baseline Encoder on Coarse-Grained Dynamically Reconfigurable Computing System-on-Chip

    Hung K. NGUYEN  Peng CAO  Xue-Xiang WANG  Jun YANG  Longxing SHI  Min ZHU  Leibo LIU  Shaojun WEI  

     
    PAPER-Computer System

      Vol:
    E96-D No:3
      Page(s):
    601-615

    REMUS-II (REconfigurable MUltimedia System 2) is a coarse-grained dynamically reconfigurable computing system for multimedia and communication baseband processing. This paper proposes a real-time H.264 baseline profile encoder on REMUS-II. First, we propose an overall mapping flow for mapping algorithms onto the platform of REMUS-II system and then illustrate it by implementing the H.264 encoder. Second, parallel and pipelining techniques are considered for fully exploiting the abundant computing resources of REMUS-II, thus increasing total computing throughput and solving high computational complexity of H.264 encoder. Besides, some data-reuse schemes are also used to increase data-reuse ratio and therefore reduce the required data bandwidth. Third, we propose a scheduling scheme to manage run-time reconfiguration of the system. The scheduling is also responsible for synchronizing the data communication between tasks and handling conflict between hardware resources. Experimental results prove that the REMUS-MB (REMUS-II version for mobile applications) system can perform a real-time H.264/AVC baseline profile encoder. The encoder can encode CIF@30 fps video sequences with two reference frames and maximum search range of [-16,15]. The implementation, thereby, can be applied to handheld devices targeted at mobile multimedia applications. The platform of REMUS-MB system is designed and synthesized by using TSMC 65 nm low power technology. The die size of REMUS-MB is 13.97 mm2. REMUS-MB consumes, on average, about 100 mW while working at 166 MHz. To my knowledge, in the literature this is the first implementation of H.264 encoding algorithm on a coarse-grained dynamically reconfigurable computing system.

  • Mapping Optimization of Affine Loop Nests for Reconfigurable Computing Architecture

    Dajiang LIU  Shouyi YIN  Chongyong YIN  Leibo LIU  Shaojun WEI  

     
    PAPER-Computer Architecture

      Vol:
    E95-D No:12
      Page(s):
    2898-2907

    Reconfigurable computing system is a class of parallel architecture with the ability of computing in hardware to increase performance, while remaining much of flexibility of a software solution. This architecture is particularly suitable for running regular and compute-intensive tasks, nevertheless, most compute-intensive tasks spend most of their running time in nested loops. Polyhedron model is a powerful tool to give a reasonable transformation on such nested loops. In this paper, a number of issues are addressed towards the goal of optimization of affine loop nests for reconfigurable cell array (RCA), such as approach to make the most use of processing elements (PE) while minimizing the communication volume by loop transformation in polyhedron model, determination of tilling form by the intra-statement dependence analysis and determination of tilling size by the tilling form and the RCA size. Experimental results on a number of kernels demonstrate the effectiveness of the mapping optimization approaches developed. Compared with DFG-based optimization approach, the execution performances of 1-d jacobi and matrix multiplication are improved by 28% and 48.47%. Lastly, the run-time complexity is acceptable for the practical cases.

  • Optimisations Techniques for the Automatic ISA Customisation Algorithm

    Antoine TROUVE  Kazuaki MURAKAMI  

     
    LETTER-Design Optimisation

      Vol:
    E95-D No:2
      Page(s):
    437-440

    This article introduces some improvements to the already proposed custom instruction candidates selection for the automatic ISA customisation problem targeting reconfigurable processors. It introduces new opportunities to prune the search space, and a technique based on dynamic programming to check the independence between groups. The proposed new algorithm yields one order less measured number of convexity checks than the related work for the same inputs and outputs.

  • Configuration Context Reduction for Coarse-Grained Reconfigurable Architecture

    Shouyi YIN  Chongyong YIN  Leibo LIU  Min ZHU  Shaojun WEI  

     
    PAPER-Design Methodology

      Vol:
    E95-D No:2
      Page(s):
    335-344

    Coarse-grained reconfigurable architecture (CGRA) combines the performance of application-specific integrated circuits (ASICs) and the flexibility of general-purpose processors (GPPs), which is a promising solution for embedded systems. With the increasing complexity of reconfigurable resources (processing elements, routing cells, I/O blocks, etc.), the reconfiguration cost is becoming the performance bottleneck. The major reconfiguration cost comes from the frequent memory-read/write operations for transferring the configuration context from main memory to context buffer. To improve the overall performance, it is critical to reduce the amount of configuration context. In this paper, we propose a configuration context reduction method for CGRA. The proposed method exploits the structure correlation of computation tasks that are mapped onto CGRA and reduce the redundancies in configuration context. Experimental results show that the proposed method can averagely reduce the configuration context size up to 71% and speed up the execution up to 68%. The proposed method does not depend on any architectural feature and can be applied to CGRA with an arbitrary architecture.

  • Compiler Framework for Reconfigurable Computing Architecture

    Chongyong YIN  Shouyi YIN  Leibo LIU  Shaojun WEI  

     
    BRIEF PAPER

      Vol:
    E92-C No:10
      Page(s):
    1284-1290

    Compiler is the most important supporting tool to facilitate the use of reconfigurable computing architecture (RCA). In this paper, a template-based compiler framework is proposed. This compiler can synthesize the executables for RCA from native high-level programming language source code directly. It supports to generate run-time dynamic configuration context. And it is capable to generate both full configuration context and partial configuration context. Experimental results show that the executables generated by the proposed compiler can achieve better execution performance and smaller configuration context size than previous compilers. Moreover, this compiler does not require the programmer to have any extra knowledge about the hardware architecture of RCA.

  • FPGA Implementation of Highly Modular Fast Universal Discrete Transforms

    Panan POTIPANTONG  Phaophak SIRISUK  Soontorn ORAINTARA  Apisak WORAPISHET  

     
    PAPER-Integrated Electronics

      Vol:
    E92-C No:4
      Page(s):
    576-586

    This paper presents an FPGA implementation of highly modular universal discrete transforms. The implementation relies upon the unified discrete Fourier Hartley transform (UDFHT), based on which essential sinusoidal transforms including discrete Fourier transform (DFT), discrete Hartley transform (DHT), discrete cosine transform (DCT) and discrete sine transform (DST) can be realized. It employs a reconfigurable, scalable and modular architecture that consists of a memory-based FFT processor equipped with pre- and post-processing units. Besides, a pipelining technique is exploited to seamlessly harmonize the operation between each sub-module. Experimental results based on Xilinx Virtex-II Pro are given to examine the performance of the proposed UDFHT implementation. Two practical applications are also shown to demonstrate the flexibility and modularity of the proposed work.

  • An Efficient and Effective Algorithm for Online Task Placement with I/O Communications in Partially Reconfigurable FPGAs

    Mitsuru TOMONO  Masaki NAKANISHI  Shigeru YAMASHITA  Kazuo NAKAJIMA  Katsumasa WATANABE  

     
    PAPER-System Level Design

      Vol:
    E89-A No:12
      Page(s):
    3416-3426

    In a partially reconfigurable FPGA of the future, arbitrary portions of its logic resources and interconnection networks will be reconfigured without affecting the other parts. Multiple tasks will be mapped and executed concurrently in such an FPGA. Efficient execution of the tasks using the limited resources of the FPGA will necessitate effective resource management. A number of online FPGA placement methods have recently been proposed for such an FPGA. However, they cannot handle I/O communications of the tasks. Taking such I/O communications into consideration, we introduce a new approach to online FPGA placement. We present an algorithm for placing each arriving task in an empty area so as to complete all the tasks efficiently. We develop two fitting strategies to effectively handle I/O communications of the tasks. Our experimental results show that properly weighted combinations of these and two other previously proposed strategies enable this algorithm to run very fast and make an effective placement of the tasks. In fact, we show that the overhead associated with the use of this algorithm is negligible as compared to the total execution time of the tasks.

  • A Cost Effective Interconnection Network for Reconfigurable Computing Processor in Digital Signal Processing Applications

    Yeong-Kang LAI  Lien-Fei CHEN  Jian-Chou CHEN  Chun-Wei CHIU  

     
    LETTER

      Vol:
    E89-C No:11
      Page(s):
    1674-1675

    In this paper, a novel cost effective interconnection network for two-way pipelined SIMD-based reconfigurable computing processor is proposed. Our reconfigurable computing engine is composed of the SIMD-based function units, flexible interconnection networks, and two-bank on-chip memories. In order to connect the function units, the reconfigurable network is proposed to connect all neighbors of each function unit. The proposed interconnection network is a kind of full and bidirectional connection with the data duplication to perform the data-parallelism applications efficiently. Moreover, it is a multistage network to accomplish the high flexibility and low hardware cost.

  • Hardware Object Model and Its Application to the Image Processing

    Kenji KUDO  Yoshihiro MYOKAN  Winh Chan THAN  Shinji AKIMOTO  Takashi KANAMARU  Masatoshi SEKINE  

     
    PAPER-Video/Image Coding

      Vol:
    E87-A No:3
      Page(s):
    547-558

    To realize the hardware object which facilitates the application development in the reconfigurable computing system, a hardware module (HwModule) is proposed and implemented. To access the circuit in the HwModule from the standard PC without detailed knowledge of the hardware, an object manager (ObjectManager) is also implemented. With the help of the ObjectManager, the programmers can use the hardware objects like the usual software objects. The HwModule is applied to the image matching, and the easiness of the application development for the HwModule is confirmed.

  • Time-Memory Trade-off Cryptanalysis for Limited Key on FPGA-Based Parallel Machine RASH

    Katsumi TAKAHASHI  Hiroai ASAMI  Katsuto NAKAJIMA  Masahiro IIDA  

     
    PAPER

      Vol:
    E86-D No:5
      Page(s):
    781-788

    We designed an FPGA-based parallel machine called "RASH"(Reconfigurable Architecture based on Scalable Hardware) for high speed and flexible signal/data processing. Cryptanalysis is one of the killer applications for FPGA-based machines because huge amounts of logical and/or simple arithmetic operations are required and FPGA is suitable for this. One of the well-known activities in cryptanalysis is the DES (Data Encryption Standard) cracking contest conducted by RSA Data Security. TMTO (Time-Memory Trade-Off) Cryptanalysis is a practical method to dramatically shorten the time for key search when plaintext is given in advance. A string of ASCII characters is used as the key much like a password. The ASCII character is 7-bit character and is changed to 96 kinds of value. The 56-bit DES key is given with a string of 8 ASCII characters. Although the DES key has 64 trillion(=256) possibilities, the key that is given with a string has only 6.4 trillion(=968) possibilities. Therefore, we improve TMTO cryptanalysis so that we search only the limited key by ASCII characters and reduce the quantity of computation. In this paper, we demonstrate how TMTO cryptanalysis for limited key is well suited to our FPGA-based RASH machine. By limiting the key to a string, DES key will be found at 80% probability within 45 minutes after ciphertext is given on 10 units of RASH. The precomputation before starting key search takes 3 weeks on the same RASH configuration.

  • Evaluating Online Hot Instruction Sequence Profilers for Dynamically Reconfigurable Functional Units

    Takanori HAYASHIDA  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E86-D No:5
      Page(s):
    901-909

    Online profiling methodologies are studied for exploiting dynamic optimization. On a dynamic optimizable system with online profilers, it has to get accurate profile in early step of the program execution for effective execution. However, for getting more effective profile by online profiling, it has to satisfy "Rapidness" and "Accuracy". They are conflicted requirements. Therefore, it has to choose trade-off point at implementation. We focused into online Hot Instruction Sequence (HIS) profiler to exploit reconfigurable functional units. To circumstantiate the effectiveness of online HIS profiling, we build some evaluation models for experimental evaluation. Our profiler models are SC/DM, SC/FA and JC/DM. These models have different policy of event counting and table lookup. Our event counting policies are simple-counting or jumble-counting. On the other hand, table lookup policies are direct-map or full-associative. In our experimental evaluation, SC/FA and JC/DM models scored higher accuracy than SC/DM. The JC/DM model is able to implement by lower cost for table lookup, but it scored high accuracy comparable to SC/FA.

  • Accelerating the CKY Parsing Using FPGAs

    Jacir L. BORDIM  Yasuaki ITO  Koji NAKANO  

     
    PAPER

      Vol:
    E86-D No:5
      Page(s):
    803-810

    The main contribution of this paper is to present an FPGA-based implementation of an instance-specific hardware which accelerates the CKY (Cocke-Kasami-Younger) parsing for context-free grammars. Given a context-free grammar G and a string x, the CKY parsing determines whether G derives x. We have developed a hardware generator that creates a Verilog HDL source to perform the CKY parsing for any given context-free grammar G. The generated source is embedded in an FPGA using the design software provided by the FPGA vendor. We evaluated the instance-specific hardware, generated by our hardware generator, using a timing analyzer and tested it using the Altera FPGAs. The generated hardware attains a speed-up factor of approximately 750 over the software CKY parsing algorithm.

  • Dynamically Reconfigurable Logic LSI--PCA-1: The First Realization of the Plastic Cell Architecture

    Hideyuki ITO  Ryusuke KONISHI  Hiroshi NAKADA  Kiyoshi OGURI  Minoru INAMORI  Akira NAGOYA  

     
    PAPER

      Vol:
    E86-D No:5
      Page(s):
    859-867

    This paper describes the realization of a dynamically reconfigurable logic LSI based on a novel parallel computer architecture. The key point of the architecture is its dual-structured cell array which enables dynamic and autonomous reconfiguration of the logic circuits. The LSI was completed by successfully introducing two specific features: fully asynchronous logic circuits and a homogeneous structure, only LUTs are used.

1-20hit(22hit)