The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] fpga(330hit)

81-100hit(330hit)

  • A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing

    Hoang-Gia VU  Shinya TAKAMAEDA-YAMAZAKI  Takashi NAKADA  Yasuhiko NAKASHIMA  

     
    PAPER-Device and Architecture

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    288-302

    Modern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide “fine-grained” management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).

  • FPGA Components for Integrating FPGAs into Robot Systems

    Takeshi OHKAWA  Kazushi YAMASHINA  Hitomi KIMURA  Kanemitsu OOTSU  Takashi YOKOTA  

     
    PAPER-Emerging Applications

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    363-375

    A component-oriented FPGA design platform is proposed for robot system integration. FPGAs are known to be a power-efficient hardware platform, but the development cost of FPGA-based systems is currently too high to integrate them into robot systems. To solve this problem, we propose an FPGA component that allows FPGA devices to be easily integrated into robot systems based on the Robot Operating System (ROS). ROS-compliant FPGA components offer a seamless interface between the FPGA hardware and software running on the CPU. Two experiments were conducted using the proposed components. For the first experiment, the results show that the execution time of an FPGA component for image processing was 1.7 times faster than that of the original software-based component and was 2.51 times more power efficient than an ordinary PC processor, despite substantial communication overhead. The second experiment showed that an FPGA component for sensor fusion was able to process multiple sensor inputs efficiently and with very low latency via parallel processing.

  • An FPGA Realization of a Random Forest with k-Means Clustering Using a High-Level Synthesis Design

    Akira JINGUJI  Shimpei SATO  Hiroki NAKAHARA  

     
    PAPER-Emerging Applications

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    354-362

    A random forest (RF) is a kind of ensemble machine learning algorithm used for a classification and a regression. It consists of multiple decision trees that are built from randomly sampled data. The RF has a simple, fast learning, and identification capability compared with other machine learning algorithms. It is widely used for application to various recognition systems. Since it is necessary to un-balanced trace for each tree and requires communication for all the ones, the random forest is not suitable in SIMD architectures such as GPUs. Although the accelerators using the FPGA have been proposed, such implementations were based on HDL design. Thus, they required longer design time than the soft-ware based realizations. In the previous work, we showed the high-level synthesis design of the RF including the fully pipelined architecture and the all-to-all communication. In this paper, to further reduce the amount of hardware, we use k-means clustering to share comparators of the branch nodes on the decision tree. Also, we develop the krange tool flow, which generates the bitstream with a few number of hyper parameters. Since the proposed tool flow is based on the high-level synthesis design, we can obtain the high performance RF with short design time compared with the conventional HDL design. We implemented the RF on the Xilinx Inc. ZC702 evaluation board. Compared with the CPU (Intel Xeon (R) E5607 Processor) and the GPU (NVidia Geforce Titan) implementations, as for the performance, the FPGA realization was 8.4 times faster than the CPU one, and it was 62.8 times faster than the GPU one. As for the power consumption efficiency, the FPGA realization was 7.8 times better than the CPU one, and it was 385.9 times better than the GPU one.

  • Area Efficient Annealing Processor for Ising Model without Random Number Generator

    Hidenori GYOTEN  Masayuki HIROMOTO  Takashi SATO  

     
    PAPER-Device and Architecture

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    314-323

    An area-efficient FPGA-based annealing processor that is based on Ising model is proposed. The proposed processor eliminates random number generators (RNGs) and temperature schedulers, which are the key components in the conventional annealing processors and occupying a large portion of the design. Instead, a shift-register-based spin flipping scheme successfully helps the Ising model from stucking in the local optimum solutions. An FPGA implementation and software-based evaluation on max-cut problems of 2D-grid torus structure demonstrate that our annealing processor solves the problems 10-104 times faster than conventional optimization algorithms to obtain the solution of equal accuracy.

  • A Threshold Neuron Pruning for a Binarized Deep Neural Network on an FPGA

    Tomoya FUJII  Shimpei SATO  Hiroki NAKAHARA  

     
    PAPER-Emerging Applications

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    376-386

    For a pre-trained deep convolutional neural network (CNN) for an embedded system, a high-speed and a low power consumption are required. In the former of the CNN, it consists of convolutional layers, while in the latter, it consists of fully connection layers. In the convolutional layer, the multiply accumulation operation is a bottleneck, while the fully connection layer, the memory access is a bottleneck. The binarized CNN has been proposed to realize many multiply accumulation circuit on the FPGA, thus, the convolutional layer can be done with a high-seed operation. However, even if we apply the binarization to the fully connection layer, the amount of memory was still a bottleneck. In this paper, we propose a neuron pruning technique which eliminates almost part of the weight memory, and we apply it to the fully connection layer on the binarized CNN. In that case, since the weight memory is realized by an on-chip memory on the FPGA, it achieves a high-speed memory access. To further reduce the memory size, we apply the retraining the CNN after neuron pruning. In this paper, we propose a sequential-input parallel-output fully connection layer circuit for the binarized fully connection layer, while proposing a streaming circuit for the binarized 2D convolutional layer. The experimental results showed that, by the neuron pruning, as for the fully connected layer on the VGG-11 CNN, the number of neurons was reduced by 39.8% with keeping the 99% baseline accuracy. We implemented the neuron pruning CNN on the Xilinx Inc. Zynq Zedboard. Compared with the ARM Cortex-A57, it was 1773.0 times faster, it dissipated 3.1 times lower power, and its performance per power efficiency was 5781.3 times better. Also, compared with the Maxwell GPU, it was 11.1 times faster, it dissipated 7.7 times lower power, and its performance per power efficiency was 84.1 times better. Thus, the binarized CNN on the FPGA is suitable for the embedded system.

  • A Describing Method of an Image Processing Software in C for a High-Level Synthesis Considering a Function Chaining

    Akira YAMAWAKI  Seiichi SERIKAWA  

     
    PAPER-Design Methodology and Platform

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    324-334

    This paper shows a describing method of an image processing software in C for high-level synthesis (HLS) technology considering function chaining to realize an efficient hardware. A sophisticated image processing would be built on the sequence of several primitives represented as sub-functions like the gray scaling, filtering, binarization, thinning, and so on. Conventionally, generic describing methods for each sub-function so that HLS technology can generate an efficient hardware module have been shown. However, few studies have focused on a systematic describing method of the single top function consisting of the sub-functions chained. According to the proposed method, any number of sub-functions can be chained, maintaining the pipeline structure. Thus, the image processing can achieve the near ideal performance of 1 pixel per clock even when the processing chain is long. In addition, implicitly, the deadlock due to the mismatch of the number of pushes and pops on the FIFO connecting the functions is eliminated and the interpolation of the border pixels is done. The case study on a canny edge detection including the chain of some sub-functions demonstrates that our proposal can easily realize the expected hardware mentioned above. The experimental results on ZYNQ FPGA show that our proposal can be converted to the pipelined hardware with moderate size and achieve the performance gain of more than 70 times compared to the software execution. Moreover, the reconstructed C software program following our proposed method shows the small performance degradation of 8% compared with the pure C software through a comparative evaluation preformed on the Cortex A9 embedded processor in ZYNQ FPGA. This fact indicates that a unified image processing library using HLS software which can be executed on CPU or hardware module for HW/SW co-design can be established by using our proposed describing method.

  • Three Dimensional FPGA Architecture with Fewer TSVs

    Motoki AMAGASAKI  Masato IKEBE  Qian ZHAO  Masahiro IIDA  Toshinori SUEYOSHI  

     
    PAPER-Device and Architecture

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    278-287

    Three-dimensional (3D) field-programmable gate arrays (FPGAs) are expected to offer higher logic density as well as improved delay and power performance by utilizing 3D integrated circuit technology. However, because through-silicon-vias (TSVs) for conventional 3D FPGA interlayer connections have a large area overhead, there is an inherent tradeoff between connectivity and small size. To find a balance between cost and performance, and to explore 3D FPGAs with realistic 3D integration processes, we propose two types of 3D FPGA and construct design tool sets for architecture exploration. In previous research, we created a TSV-free 3D FPGA with a face-down integration method; however, this was limited to two layers. In this paper, we discuss the face-up stacking of several face-down stacked FPGAs. To minimize the number of TSVs, we placed TSVs peripheral to the FPGAs for 3D-FPGA with 4 layers. According to our results, a 2-layer 3D FPGA has reasonable performance when limiting the design to two layers, but a 4-layer 3D FPGA is a better choice when area is emphasized.

  • A Compact Matched Filter Bank for an Optical ZCZ Sequence Set with Zero-Correlation Zone 2z

    Yasuaki OHIRA  Takahiro MATSUMOTO  Hideyuki TORII  Yuta IDA  Shinya MATSUFUJI  

     
    LETTER

      Vol:
    E101-A No:1
      Page(s):
    195-198

    In this paper, we propose a new structure for a compact matched filter bank (MFB) for an optical zero-correlation zone (ZCZ) sequence set with Zcz=2z. The proposed MFB can reduces operation elements such as 2-input adders and delay elements. The number of 2-input adders decrease from O(N2) to O(N log2 N), delay elements decrease from O(N2) to O(N). In addition, the proposed MFBs for the sequence of length 32, 64, 128 and 256 with Zcz=2,4 and 8 are implemented on a field programmable gate array (FPGA). As a result, the numbers of logic elements (LEs) of the proposed MFBs for the sequences with Zcz=2 of length 32, 64, 128 and 256 are suppressed to about 76.2%, 84.2%, 89.7% and 93.4% compared to that of the conventional MFBs, respectively.

  • An Analysis of Time Domain Reed Solomon Decoder with FPGA Implementation

    Kentaro KATO  Somsak CHOOMCHUAY  

     
    PAPER-Computer System

      Pubricized:
    2017/08/23
      Vol:
    E100-D No:12
      Page(s):
    2953-2961

    This paper analyzes the time domain Reed Solomon Decoder with FPGA implementation. Data throughput and area is carefully evaluated compared with typical frequency domain Reed Solomon Decoder. In this analysis, three hardware architecture to enhance the data throughput, namely, the pipelined architecture, the parallel architecture, and the truncated arrays, is evaluated, too. The evaluation reveals that the number of the consumed resources of RS(255, 239) is about 20% smaller than those of the frequency domain decoder although data throughput is less than 10% of the frequency domain decoder. The number of the consumed resources of the pipelined architecture is 28% smaller than that of the parallel architecture when data throughput is same. It is because the pipeline architecture requires less extra logics than the parallel architecture. To get higher data throughput, the pipelined architecture is better than the parallel architecture from the viewpoint of consumed resources.

  • Design and Implementation of 176-MHz WXGA 30-fps Real-Time Optical Flow Processor

    Yu SUZUKI  Masato ITO  Satoshi KANDA  Kousuke IMAMURA  Yoshio MATSUDA  Tetsuya MATSUMURA  

     
    PAPER

      Vol:
    E100-A No:12
      Page(s):
    2888-2900

    This paper describes the design and implementation of a real-time optical flow processor using a single field-programmable gate array (FPGA) chip. By introducing the modified initial flow generation method, the successive over-relaxation (SOR) method for both layers, the optimization of the reciprocal operation method, and the image division method, it is now possible to both reduce hardware requirements and improve flow accuracy. Additionally, by introducing a pipeline structure to this processor, high-throughput hardware implementation could be achieved. Total logic cell (LC) amounts and processer memory capacity are reduced by about 8% and 16%, respectively, compared to our previous hierarchical optical flow estimation (HOE) processor. The results of our evaluation confirm that this processor can perform 30 fps wide extended graphics array (WXGA) 175.7MHz real-time optical flow processing with a single FPGA.

  • A 10 Gbps D-PHY Transmitter Bridge Chip for FPGA-Based Frame Generator Supporting MIPI DSI of Mobile Display

    Ho-Seong KIM  Pil-Ho LEE  Jin-Wook HAN  Seung-Hun SHIN  Seung-Wuk BAEK  Doo-Ill PARK  Yongkyu SEO  Young-Chan JANG  

     
    BRIEF PAPER

      Vol:
    E100-C No:11
      Page(s):
    1035-1038

    A 10 Gbps transmitter bridge chip including four data lanes, which increases the bandwidth using an 8-to-1 serialization, is proposed for a field-programmable gate array (FPGA)-based frame generator to support the protocol of the D-PHY version 1.2 for the mobile industry processor interface (MIPI) display serial interface (DSI).

  • Accelerating Weeder: A DNA Motif Search Tool Using the Micron Automata Processor and FPGA

    Qiong WANG  Mohamed EL-HADEDY  Kevin SKADRON  Ke WANG  

     
    PAPER-Computer System

      Pubricized:
    2017/06/29
      Vol:
    E100-D No:10
      Page(s):
    2470-2477

    Motif searching, i.e., identifying meaningful patterns from biological data, has been studied extensively due to its importance in the biomedical sciences. In this work, we seek to improve the performance of Weeder, a widely-used tool for automatic de novo motif searching. Weeder consists of several functions, among which we find that the function oligo_scan, which handles the pattern matching, is the bottleneck, especially when dealing with large datasets. Motivated by this observation, we adopt the Micron Automata Processor (AP) to accelerate the pattern-matching stage of Weeder. The AP is a massively-parallel, non-von-Neumann semiconductor architecture that is purpose-built for symbolic pattern matching. Relying on the fact that AP is capable of performing matching for thousands of patterns in parallel, we develop an AP-accelerated Weeder implementation in this work. In particular, we describe how to map Weeder's pattern matching to the AP chip and use the high-end FPGA on the AP board to postprocess the output from AP. Our experiment shows that the AP-accelerated Weeder achieves 751x speedup on pattern matching, compared to a single-threaded CPU implementation.

  • Energy-Efficient and Highly-Reliable Nonvolatile FPGA Using Self-Terminated Power-Gating Scheme

    Daisuke SUZUKI  Takahiro HANYU  

     
    PAPER-VLSI Architecture

      Pubricized:
    2017/05/19
      Vol:
    E100-D No:8
      Page(s):
    1618-1624

    An energy-efficient nonvolatile FPGA with assuring highly-reliable backup operation using a self-terminated power-gating scheme is proposed. Since the write current is automatically cut off just after the temporal data in the flip-flop is successfully backed up in the nonvolatile device, the amount of write energy can be minimized with no write failure. Moreover, when the backup operation in a particular cluster is completed, power supply of the cluster is immediately turned off, which minimizes standby energy due to leakage current. In fact, the total amount of energy consumption during the backup operation is reduced by 66% in comparison with that of a conventional worst-case-based approach where the long time write current pulse is used for the reliable write.

  • An Approach for Solving SAT/MaxSAT-Encoded Formal Verification Problems on FPGA

    Kenji KANAZAWA  Tsutomu MARUYAMA  

     
    PAPER-Computer System

      Pubricized:
    2017/05/12
      Vol:
    E100-D No:8
      Page(s):
    1807-1818

    WalkSAT (WSAT) is one of the best performing stochastic local search algorithms for the Boolean Satisfiability (SAT) and the Maximum Boolean Satisfiability (MaxSAT). WSAT is very suitable for hardware acceleration because of its high inherent parallelism. Formal verification of digital circuits is one of the most important applications of SAT and MaxSAT. Structural knowledge such as logic gates and their dependencies can be derived from SAT/MaxSAT instances generated from formal verification of digital circuits. Such that knowledge is useful to solve these instances efficiently. In this paper, we first discuss a heuristic to utilize the structural knowledge for solving these problems by using WSAT. Then, we show its implementation on FPGA. The problem size of the formal verification is typically very large, and most data have to be placed in off-chip DRAMs. In this situation, the acceleration by FPGA is limited by the throughput and access latency of the DRAMs. In our implementation, data are carefully mapped on the on-chip memory banks and off-chip DRAMs so that most data in the off-chip DRAMs can be continuously accessed using burst-read. Furthermore, a variable-way cache memory comprised of the on-chip memory banks is used in order to hide the DRAM access latency by caching the head portion of the continuous read from the DRAMs and giving them to the circuit till the rest portion is started to be given by the burst-read. We evaluate the performance of our proposed method by changing configuration of the variable-way cache and the processing parallelism, and discuss how much acceleration can be achieved.

  • Comparative Evaluation of FPGA Implementation Alternatives for Real-Time Robust Ellipse Estimation based on RANSAC Algorithm

    Theint Theint THU  Jimpei HAMAMURA  Rie SOEJIMA  Yuichiro SHIBATA  Kiyoshi OGURI  

     
    PAPER

      Vol:
    E100-A No:7
      Page(s):
    1409-1417

    Field Programmable Gate Array (FPGA) based robust model fitting enjoys immense popularity in image processing because of its high efficiency. This paper focuses on the tradeoff analysis of real-time FPGA implementation of robust circle and ellipse estimations based on the random sample consensus (RANSAC) algorithm, which estimates parameters of a statistical model from a data set of feature points which contains outliers. In particular, this paper mainly highlights implementation alternatives for solvers of simultaneous equations and compares Gauss-Jordan elimination and Cramer's rule by changing matrix size and arithmetic processes. Experimental evaluation shows a Cramer's rule approach coupled with long integer arithmetic can reduce most hardware resources without unacceptable degradation of estimation accuracy compared to floating point versions.

  • Soft-Error-Tolerant Dual-Modular-Redundancy Architecture with Repair and Retry Scheme for Memory-Control Circuit on FPGA

    Makoto SAEN  Tadanobu TOBA  Yusuke KANNO  

     
    PAPER

      Vol:
    E100-C No:4
      Page(s):
    382-390

    This paper presents a soft-error-tolerant memory-control circuit for SRAM-based field programmable gate arrays (FPGAs). A potential obstacle to applying such FPGAs to safety-critical industrial control systems is their low tolerance. The main reason is that soft errors damage circuit-configuration data stored in SRAM-based configuration memory. To overcome this obstacle, the soft-error tolerance must thus be improved while suppressing the circuit area overhead, and data stored in external memory must be protected when a fault occurs on the FPGA. Therefore, a memory-control circuit was developed on the basis of a dual-modular-redundancy (DMR) architecture. This memory controller has a repair and retry scheme that repairs damaged circuit-configuration data and re-executes unfinished accesses after the repair. The developed architecture reduces circuit redundancy below that of a commonly used triple-modular-redundancy (TMR) architecture. Moreover, a write-invalidation circuit was developed to protect data in external memory, and an external-memory-state recovery circuit was developed to enable resumption of memory access after fault repair. The developed memory controller was implemented in a prototype circuit on an FPGA and evaluated using the prototype. The evaluation results demonstrated that the developed memory controller can operate successfully for 1.03×109 hours (at sea level). In addition, its circuit area overhead was found to be sufficiently smaller than that of the TMR architecture.

  • Physical Fault Detection and Recovery Methods for System-LSI Loaded FPGA-IP Core Open Access

    Motoki AMAGASAKI  Yuki NISHITANI  Kazuki INOUE  Masahiro IIDA  Morihiro KUGA  Toshinori SUEYOSHI  

     
    INVITED PAPER

      Pubricized:
    2017/01/13
      Vol:
    E100-D No:4
      Page(s):
    633-644

    Fault tolerance is an important feature for the system LSIs used in reliability-critical systems. Although redundancy techniques are generally used to provide fault tolerance, these techniques have significantly hardware costs. However, FPGAs can easily provide high reliability due to their reconfiguration ability. Even if faults occur, the implemented circuit can perform correctly by reconfiguring to a fault-free region of the FPGA. In this paper, we examine an FPGA-IP core loaded in SoC and introduce a fault-tolerant technology based on fault detection and recovery as a CAD-level approach. To detect fault position, we add a route to the manufacturing test method proposed in earlier research and identify fault areas. Furthermore, we perform fault recovery at the logic tile and multiplexer levels using reconfiguration. The evaluation results for the FPGA-IP core loaded in the system LSI demonstrate that it was able to completely identify and avoid fault areas relative to the faults in the routing area.

  • FPGA Hardware Acceleration of a Phylogenetic Tree Reconstruction with Maximum Parsimony Algorithm

    Henry BLOCK  Tsutomu MARUYAMA  

     
    PAPER-Computer System

      Pubricized:
    2016/11/14
      Vol:
    E100-D No:2
      Page(s):
    256-264

    In this paper, we present an FPGA hardware implementation for a phylogenetic tree reconstruction with a maximum parsimony algorithm. We base our approach on a particular stochastic local search algorithm that uses the Progressive Neighborhood and the Indirect Calculation of Tree Lengths method. This method is widely used for the acceleration of the phylogenetic tree reconstruction algorithm in software. In our implementation, we define a tree structure and accelerate the search by parallel and pipeline processing. We show results for eight real-world biological datasets. We compare execution times against our previous hardware approach, and TNT, the fastest available parsimony program, which is also accelerated by the Indirect Calculation of Tree Lengths method. Acceleration rates between 34 to 45 per rearrangement, and 2 to 6 for the whole search, are obtained against our previous hardware approach. Acceleration rates between 2 to 36 per rearrangement, and 18 to 112 for the whole search, are obtained against TNT.

  • A Fast MER Enumeration Algorithm for Online Task Placement on Reconfigurable FPGAs

    Tieyuan PAN  Lian ZENG  Yasuhiro TAKASHIMA  Takahiro WATANABE  

     
    PAPER

      Vol:
    E99-A No:12
      Page(s):
    2412-2424

    In this paper, we propose a fast Maximal Empty Rectangle (MER) enumeration algorithm for online task placement on reconfigurable Field-Programmable Gate Arrays (FPGAs). On the assumption that each task utilizes rectangle-shaped resources, the proposed algorithm can manage the free space on FPGAs by an MER list. When assigning or removing a task, a series of MERs are selected and cut into segments according to the task and its assignment location. By processing these segments, the MER list can be updated quickly with low memory consumption. Under the proof of the upper limit of the number of the MERs on the FPGA, we analyze both the time and space complexity of the proposed algorithm. The efficiency of the proposed algorithm is verified by experiments.

  • Low Overhead Design of Power Reconfigurable FPGA with Fine-Grained Body Biasing on 65-nm SOTB CMOS Technology

    Masakazu HIOKI  Hanpei KOIKE  

     
    PAPER-Computer System

      Pubricized:
    2016/09/13
      Vol:
    E99-D No:12
      Page(s):
    3082-3089

    A Field Programmable Gate Array (FPGA) with fine-grained body biasing shows satisfactory static power reduction. Contrarily, the FPGA incurs high overhead because additional body bias selectors and electrical isolation regions are needed to program the threshold voltage (Vt) of elemental circuits such as MUX, buffer and LUT in the FPGA. In this paper, low overhead design of FPGA with fine-grained body biasing is described. The FPGA is designed and fabricated on 65-nm SOTB CMOS technology. By not only adopting a customized design rule specifying that reliability is verified by TEGs but downsizing a body bias selector, the FPGA tile area becomes small by 39% compared with the conventional design, resulting in 900 FPGA tiles with 4,4000 programmable Vt regions. In addition, the chip performance is evaluated by implementing 32-bit binary counter in the supply voltage range of 0.5V from 1.2V. The counter circuit operates at a frequency of 72MHz and 14MHz with the supply voltage of 1.2V and 0.5V respectively. The static power saving of 80% in elemental circuits of the FPGA at 0.5-V supply voltage and 0.5-V reverse body bias voltage is achieved in the best case. In the whole chip including configuration memory and body bias selector in addition to elemental circuits, effective static power reduction around 30% is maintained by applying 0.3-V reverse body bias voltage at each supply voltage.

81-100hit(330hit)