The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] CGRA(8hit)

1-8hit
  • Flexible and Energy-Efficient Crypto-Processor for Arbitrary Input Length Processing in Blockchain-Based IoT Applications

    Vu-Trung-Duong LE  Hoai-Luan PHAM  Thi-Hong TRAN  Yasuhiko NAKASHIMA  

     
    PAPER

      Pubricized:
    2023/09/04
      Vol:
    E107-A No:3
      Page(s):
    319-330

    Blockchain-based Internet of Things (IoT) applications require flexible, fast, and low-power hashing hardware to ensure IoT data integrity and maintain blockchain network confidentiality. However, existing hashing hardware poses challenges in achieving high performance and low power and limits flexibility to compute multiple hash functions with different message lengths. This paper introduces the flexible and energy-efficient crypto-processor (FECP) to achieve high flexibility, high speed, and low power with high hardware efficiency for blockchain-based IoT applications. To achieve these goals, three new techniques are proposed, namely the crypto arithmetic logic unit (Crypto-ALU), dual buffering extension (DBE), and local data memory (LDM) scheduler. The experiments on ASIC show that the FECP can perform various hash functions with a power consumption of 0.239-0.676W, a throughput of 10.2-3.35Gbps, energy efficiency of 4.44-14.01Gbps/W, and support up to 8916-bit message input. Compared to state-of-art works, the proposed FECP is 1.65-4.49 times, 1.73-21.19 times, and 1.48-17.58 times better in throughput, energy efficiency, and energy-delay product (EDP), respectively.

  • Recovering Faulty Non-Volatile Flip Flops for Coarse-Grained Reconfigurable Architectures

    Takeharu IKEZOE  Takuya KOJIMA  Hideharu AMANO  

     
    PAPER

      Pubricized:
    2020/12/14
      Vol:
    E104-C No:6
      Page(s):
    215-225

    Recent IoT devices require extremely low standby power consumption, while a certain performance is needed during the active time, and Coarse-Grained Reconfigurable Arrays (CGRAs) have received attention because of their high energy efficiency. For further reduction of the standby energy consumption of CGRAs, the leakage power for their configuration memory must be reduced. Although the power gating is a common technique, the lost data in flip-flops and memory must be retrieved after the wake-up. Recovering everything requires numerous state transitions and considerable overhead both on its execution time and energy. To address the problem, Non-volatile Cool Mega Array (NVCMA), a CGRA providing non-volatile flip-flops (NVFFs) with spin transfer torque type non-volatile memory (NVM) technology has been developed. However, in general, non-volatile memory technologies have problems with reliability. Some NVFFs are stacked-at-0/1, and cannot store the data in a certain possibility. To improve the chip yield, we propose a mapping algorithm to avoid faulty processing elements of the CGRA caused by the erroneous configuration data. Next, we also propose a method to add an error-correcting code (ECC) mechanism to NVFFs for the configuration and constant memory. The proposed method was applied to NVCMA to evaluate the availability rate and reduction of write time. By using both methods, the average availability ratio of 94.2% was achieved, while the average availability ratio of the nine applications was 0.056% when the probability of failure of the FF was 0.01. The energy for storing data becomes about 2.3 times because of the hardware overhead of ECC but the proposed method can save 8.6% of the writing power on average.

  • A Fine-Grained Multicasting of Configuration Data for Coarse-Grained Reconfigurable Architectures

    Takuya KOJIMA  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2019/04/05
      Vol:
    E102-D No:7
      Page(s):
    1247-1256

    A novel configuration data compression technique for coarse-grained reconfigurable architectures (CGRAs) is proposed. Reducing the size of configuration data of CGRAs shortens the reconfiguration time especially when the communication bandwidth between a CGRA and a host CPU is limited. In addition, it saves energy consumption of configuration cache and controller. The proposed technique is based on a multicast configuration technique called RoMultiC, which reduces the configuration time by multicasting the same data to multiple PEs (Processing Elements) with two bit-maps. Scheduling algorithms for an optimizing the order of multicasting have been proposed. However, the multicasting is possible only if each PE has completely the same configuration. In general, configuration data for CGRAs can be divided into some fields like machine code formats of general perpose CPUs. The proposed scheme confines a part of fields for multicasting so that the possibility of multicasting more PEs can be increased. This paper analyzes algorithms to find a configuration pattern which maximizes the number of multicasted PEs. We implemented the proposed scheme to CMA (Cool Mega Array), a straight forward CGRA as a case study. Experimental results show that the proposed method achieves 40.0% smaller configuration than a previous method for an image processing application at maximum. The exploration of the multicasted grain size reveals the effective grain size for each algorithm. Furthermore, since both a dynamic power consumption of the configuration controller and a configuration time are improved, it achieves 50.1% reduction of the energy consumption for the configuration with a negligible area overhead.

  • Optimization of Body Biasing for Variable Pipelined Coarse-Grained Reconfigurable Architectures

    Takuya KOJIMA  Naoki ANDO  Hayate OKUHARA  Ng. Anh Vu DOAN  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2018/03/09
      Vol:
    E101-D No:6
      Page(s):
    1532-1540

    Variable Pipeline Cool Mega Array (VPCMA) is a low power Coarse Grained Reconfigurable Architecture (CGRA) based on the concept of CMA (Cool Mega Array). It provides a pipeline structure in the PE array that can be configured so as to fit target algorithms and required performance. Also, VPCMA uses the Silicon On Thin Buried oxide (SOTB) technology, a type of Fully Depleted Silicon On Insulator (FDSOI), so it is possible to control its body bias voltage to provide a balance between performance and leakage power. In this paper, we study the optimization of the VPCMA body bias while considering simultaneously its variable pipeline structure. Through evaluations, we can observe that it is possible to achieve an average reduction of energy consumption, for the studied applications, of 17.75% and 10.49% when compared to respectively the zero bias (without body bias control) and the uniform (control of the whole PE array) cases, while respecting performance constraints. Besides, it is observed that, with appropriate body bias control, it is possible to extend the possible performance, hence enabling broader trade-off analyzes between consumption and performance. Considering the dynamic power as well as the static power, more appropriate pipeline structure and body bias voltage can be obtained. In addition, when the control of VDD is integrated, higher performance can be achieved with a steady increase of the power. These promising results show that applying an adequate optimization technique for the body bias control while simultaneously considering pipeline structures can not only enable further power reduction than previous methods, but also allow more trade-off analysis possibilities.

  • Body Bias Domain Partitioning Size Exploration for a Coarse Grained Reconfigurable Accelerator

    Yusuke MATSUSHITA  Hayate OKUHARA  Koichiro MASUYAMA  Yu FUJITA  Ryuta KAWANO  Hideharu AMANO  

     
    PAPER-Architecture

      Pubricized:
    2017/07/14
      Vol:
    E100-D No:12
      Page(s):
    2828-2836

    Body biasing can be used to control the leakage power and performance by changing the threshold voltage of transistors after fabrication. Especially, a new process called Silicon-On-Thin Box (SOTB) CMOS can control their balance widely. When it is applied to a Coarse Grained Reconfigurable Array (CGRA), the leakage power can be much reduced by precise bias control with small domain size including a small number of PEs. On the other hand, the area overhead for separating power domain and delivering a lot of wires for body bias voltage supply increases. This paper explores the grain of domain size of an energy efficient CGRA called CMA (Cool Mega Array). By using Genetic Algorithm based body bias assignment method, the leakage reduction of various grain size was evaluated. As a result, a domain with 2x1 PEs achieved about 40% power reduction with a 6% area overhead. It has appeared that a combination of three body bias voltages; zero bias, weak reverse bias and strong reverse bias can achieve the optimal leakage reduction and area overhead balance in most cases.

  • Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators

    Yoshikazu INAGAKI  Shinya TAKAMAEDA-YAMAZAKI  Jun YAO  Yasuhiko NAKASHIMA  

     
    PAPER-Architecture

      Pubricized:
    2015/09/15
      Vol:
    E98-D No:12
      Page(s):
    2141-2149

    The Energy-aware Multi-mode Accelerator eXtension [24],[25] (EMAX) is equipped with distributed single-port local memories and ring-formed interconnections. The accelerator is designed to achieve extremely high throughput for scientific computations, big data, and image processing as well as low-power consumption. However, before mapping algorithms on the accelerator, application developers require sufficient knowledge of the hardware organization and specially designed instructions. They also need significant effort to tune the code for improving execution efficiency when no well-designed compiler or library is available. To address this problem, we focus on library support for stencil (nearest-neighbor) computations that represent a class of algorithms commonly used in many partial differential equation (PDE) solvers. In this research, we address the following topics: (1) system configuration, features, and mnemonics of EMAX; (2) instruction mapping techniques that reduce the amount of data to be read from the main memory; (3) performance evaluation of the library for PDE solvers. With the features of a library that can reuse the local data across the outer loop iterations and map many instructions by unrolling the outer loops, the amount of data to be read from the main memory is significantly reduced to a minimum of 1/7 compared with a hand-tuned code. In addition, the stencil library reduced the execution time 23% more than a general-purpose processor.

  • Fast AdaBoost-Based Face Detection System on a Dynamically Coarse Grain Reconfigurable Architecture

    Jian XIAO  Jinguo ZHANG  Min ZHU  Jun YANG  Longxing SHI  

     
    PAPER-Application

      Vol:
    E95-D No:2
      Page(s):
    392-402

    An AdaBoost-based face detection system is proposed, on a Coarse Grain Reconfigurable Architecture (CGRA) named “REMUS-II”. Our work is quite distinguished from previous ones in three aspects. First, a new hardware-software partition method is proposed and the whole face detection system is divided into several parallel tasks implemented on two Reconfigurable Processing Units (RPU) and one micro Processors Unit (µPU) according to their relationships. These tasks communicate with each other by a mailbox mechanism. Second, a strong classifier is treated as a smallest phase of the detection system, and every phase needs to be executed by these tasks in order. A phase of Haar classifier is dynamically mapped onto a Reconfigurable Cell Array (RCA) only when needed, and it's quite different from traditional Field Programmable Gate Array (FPGA) methods in which all the classifiers are fabricated statically. Third, optimized data and configuration word pre-fetch mechanisms are employed to improve the whole system performance. Implementation results show that our approach under 200 MHz clock rate can process up-to 17 frames per second on VGA size images, and the detection rate is over 95%. Our system consumes 194 mW, and the die size of fabricated chip is 23 mm2 using TSMC 65 nm standard cell based technology. To the best of our knowledge, this work is the first implementation of the cascade Haar classifier algorithm on a dynamically CGRA platform presented in the literature.

  • Configuration Context Reduction for Coarse-Grained Reconfigurable Architecture

    Shouyi YIN  Chongyong YIN  Leibo LIU  Min ZHU  Shaojun WEI  

     
    PAPER-Design Methodology

      Vol:
    E95-D No:2
      Page(s):
    335-344

    Coarse-grained reconfigurable architecture (CGRA) combines the performance of application-specific integrated circuits (ASICs) and the flexibility of general-purpose processors (GPPs), which is a promising solution for embedded systems. With the increasing complexity of reconfigurable resources (processing elements, routing cells, I/O blocks, etc.), the reconfiguration cost is becoming the performance bottleneck. The major reconfiguration cost comes from the frequent memory-read/write operations for transferring the configuration context from main memory to context buffer. To improve the overall performance, it is critical to reduce the amount of configuration context. In this paper, we propose a configuration context reduction method for CGRA. The proposed method exploits the structure correlation of computation tasks that are mapped onto CGRA and reduce the redundancies in configuration context. Experimental results show that the proposed method can averagely reduce the configuration context size up to 71% and speed up the execution up to 68%. The proposed method does not depend on any architectural feature and can be applied to CGRA with an arbitrary architecture.