IEICE global.ieice.org Site

Keyword Search Result

[Keyword] high-level synthesis(66hit)

1-20hit(66hit)

A 0.13 mJ/Prediction CIFAR-100 Fully Synthesizable Raster-Scan-Based Wired-Logic Processor in 16-nm FPGA Open Access
Dongzhu LI Zhijie ZHAN Rei SUMIKAWA Mototsugu HAMADA Atsutake KOSUGE Tadahiro KURODA

PAPER

Pubricized:
2023/11/24
Vol:
E107-C No:6
Page(s):
155-162
A 0.13mJ/prediction with 68.6% accuracy wired-logic deep neural network (DNN) processor is developed in a single 16-nm field-programmable gate array (FPGA) chip. Compared with conventional von-Neumann architecture DNN processors, the energy efficiency is greatly improved by eliminating DRAM/BRAM access. A technical challenge for conventional wired-logic processors is the large amount of hardware resources required for implementing large-scale neural networks. To implement a large-scale convolutional neural network (CNN) into a single FPGA chip, two technologies are introduced: (1) a sparse neural network known as a non-linear neural network (NNN), and (2) a newly developed raster-scan wired-logic architecture. Furthermore, a novel high-level synthesis (HLS) technique for wired-logic processor is proposed. The proposed HLS technique enables the automatic generation of two key components: (1) Verilog-hardware description language (HDL) code for a raster-scan-based wired-logic processor and (2) test bench code for conducting equivalence checking. The automated process significantly mitigates the time and effort required for implementation and debugging. Compared with the state-of-the-art FPGA-based processor, 238 times better energy efficiency is achieved with only a slight decrease in accuracy on the CIFAR-100 task. In addition, 7 times better energy efficiency is achieved compared with the state-of-the-art network-optimized application-specific integrated circuit (ASIC).
CLAHE Implementation and Evaluation on a Low-End FPGA Board by High-Level Synthesis
Koki HONDA Kaijie WEI Masatoshi ARAI Hideharu AMANO

PAPER

Pubricized:
2021/07/12
Vol:
E104-D No:12
Page(s):
2048-2056
Automobile companies have been trying to replace side mirrors of cars with small cameras for reducing air resistance. It enables us to apply some image processing to improve the quality of the image. Contrast Limited Adaptive Histogram Equalization (CLAHE) is one of such techniques to improve the quality of the image for the side mirror camera, which requires a large computation performance. Here, an implementation method of CLAHE on a low-end FPGA board by high-level synthesis is proposed. CLAHE has two main processing parts: cumulative distribution function (CDF) generation, and bilinear interpolation. During the CDF generation, the effect of increasing loop initiation interval can be greatly reduced by placing multiple Processing Elements (PEs). and during the interpolation, latency and BRAM usage were reduced by revising how to hold CDF and calculation method. Finally, by connecting each module with streaming interfaces, using data flow pragmas, overlapping processing, and hiding data transfer, our HLS implementation achieved a comparable result to that of HDL. We parameterized the components of the algorithm so that the number of tiles and the size of the image can be easily changed. The source code for this research can be downloaded from https://github.com/kokihonda/fpga_clahe.
Reconfigurable 3D Sound Processor and Its Automatic Design Environment Using High-Level Synthesis
Saya OHIRA Naoki TSUCHIYA Tetsuya MATSUMURA

PAPER

Vol:
E102-A No:12
Page(s):
1804-1812
We propose a three-dimensional (3D) sound processor architecture that includes super-directional modulation intellectual property (IP) and 3D sound processing IP and for consumer applications. In addition, we also propose an automatic design environment for 3D sound processing IP. This processor can generate realistic small sound fields in arbitrary spaces using ultrasound. In particular, in the 3D sound processing IP, in order to reproduce 3D audio, it is necessary to reproduce the personal frequency characteristics of complex head related transfer functions. For this reason, we have constructed an automatic design environment with high reconfigurability. This automatic design environment is based on high-level synthesis, and it is possible to automatically generate a C-based algorithm simulator and automatically synthesize the IP hardware by inputting a parameter description file for filter design. This automatic design environment can reduce the design period to approximately 1/5 as compared with conventional manual design. Applying the automatic design environment, a 3D sound processing IP was designed experimentally. The designed IP can be sufficiently applied to consumer applications from the viewpoints of hardware amount and power consumption.
VHDL vs. SystemC: Design of Highly Parameterizable Artificial Neural Networks
David ALEDO Benjamin CARRION SCHAFER Félix MORENO

PAPER-Computer System

Pubricized:
2018/11/29
Vol:
E102-D No:3
Page(s):
512-521
This paper describes the advantages and disadvantages observed when describing complex parameterizable Artificial Neural Networks (ANNs) at the behavioral level using SystemC and at the Register Transfer Level (RTL) using VHDL. ANNs are complex to parameterize because they have a configurable number of layers, and each one of them has a unique configuration. This kind of structure makes ANNs, a priori, challenging to parameterize using Hardware Description Languages (HDL). Thus, it seems intuitively that ANNs would benefit from the raise in level of abstraction from RTL to behavioral level. This paper presents the results of implementing an ANN using both levels of abstractions. Results surprisingly show that VHDL leads to better results and allows a much higher degree of parameterization than SystemC. The implementation of these parameterizable ANNs are made open source and are freely available online. Finally, at the end of the paper we make some recommendation for future HLS tools to improve their parameterization capabilities.
A Describing Method of an Image Processing Software in C for a High-Level Synthesis Considering a Function Chaining
Akira YAMAWAKI Seiichi SERIKAWA

PAPER-Design Methodology and Platform

Pubricized:
2017/11/17
Vol:
E101-D No:2
Page(s):
324-334
This paper shows a describing method of an image processing software in C for high-level synthesis (HLS) technology considering function chaining to realize an efficient hardware. A sophisticated image processing would be built on the sequence of several primitives represented as sub-functions like the gray scaling, filtering, binarization, thinning, and so on. Conventionally, generic describing methods for each sub-function so that HLS technology can generate an efficient hardware module have been shown. However, few studies have focused on a systematic describing method of the single top function consisting of the sub-functions chained. According to the proposed method, any number of sub-functions can be chained, maintaining the pipeline structure. Thus, the image processing can achieve the near ideal performance of 1 pixel per clock even when the processing chain is long. In addition, implicitly, the deadlock due to the mismatch of the number of pushes and pops on the FIFO connecting the functions is eliminated and the interpolation of the border pixels is done. The case study on a canny edge detection including the chain of some sub-functions demonstrates that our proposal can easily realize the expected hardware mentioned above. The experimental results on ZYNQ FPGA show that our proposal can be converted to the pipelined hardware with moderate size and achieve the performance gain of more than 70 times compared to the software execution. Moreover, the reconstructed C software program following our proposed method shows the small performance degradation of 8% compared with the pure C software through a comparative evaluation preformed on the Cortex A9 embedded processor in ZYNQ FPGA. This fact indicates that a unified image processing library using HLS software which can be executed on CPU or hardware module for HW/SW co-design can be established by using our proposed describing method.
A Bitwidth-Aware High-Level Synthesis Algorithm Using Operation Chainings for Tiled-DR Architectures
Kotaro TERADA Masao YANAGISAWA Nozomu TOGAWA

PAPER

Vol:
E100-A No:12
Page(s):
2911-2924
As application hardware designs and implementations in a short term are required, high-level synthesis is more and more essential EDA technique nowadays. In deep-submicron era, interconnection delays are not negligible even in high-level synthesis thus distributed-register and -controller architectures (DR architectures) have been proposed in order to cope with this problem. It is also profitable to take data-bitwidth into account in high-level synthesis. In this paper, we propose a bitwidth-aware high-level synthesis algorithm using operation chainings targeting Tiled-DR architectures. Our proposed algorithm optimizes bitwidths of functional units and utilizes the vacant tiles by adding some extra functional units to realize effective operation chainings to generate high performance circuits without increasing the total area. Experimental results show that our proposed algorithm reduces the overall latency by up to 47% compared to the conventional approach without area overheads by eliminating unnecessary bitwidths and adding efficient extra FUs for Tiled-DR architectures.
A Floorplan Aware High-Level Synthesis Algorithm with Body Biasing for Delay Variation Compensation
Koki IGAWA Masao YANAGISAWA Nozomu TOGAWA

PAPER

Vol:
E100-A No:7
Page(s):
1439-1451
In this paper, we propose a floorplan aware high-level synthesis algorithm with body biasing for delay variation compensation, which minimizes the average leakage energy of manufactured chips. In order to realize floorplan-aware high-level synthesis, we utilize huddle-based distributed register architecture (HDR architecture). HDR architecture divides the chip area into small partitions called a huddle and we can control a body bias voltage for every huddle. During high-level synthesis, we iteratively obtain expected leakage energy for every huddle when applying a body bias voltage. A huddle with smaller expected leakage energy contributes to reducing expected leakage energy of the entire circuit more but can increase the latency. We assign control-data flow graph (CDFG) nodes in non-critical paths to the huddles with larger expected leakage energy and those in critical paths to the huddles with smaller expected leakage energy. We expect to minimize the entire leakage energy in a manufactured chip without increasing its latency. Experimental results show that our algorithm reduces the average leakage energy by up to 39.7% without latency and yield degradation compared with typical-case design with body biasing.
Area-Efficient Soft-Error Tolerant Datapath Synthesis Based on Speculative Resource Sharing
Junghoon OH Mineo KANEKO

PAPER

Vol:
E99-A No:7
Page(s):
1311-1322
As semiconductor technologies have advanced, the reliability problem caused by soft-errors is becoming one of the serious issues in LSIs. Moreover, multiple component errors due to single soft-errors also have become a serious problem. In this paper, we propose a method to synthesize multiple component soft-error tolerant application-specific datapaths via high-level synthesis. The novel feature of our method is speculative resource sharing between the retry parts and the secondary parts for time overhead mitigation. A scheduling algorithm using a special priority function to maximize speculative resource sharing is also an important feature of this study. Our approach can reduce the latency (schedule length) in many applications without deterioration of reliability and chip area compared with conventional datapaths without speculative resource sharing. We also found that our method is more effective when a computation algorithm possesses higher parallelism and a smaller number of resources is available.
A Multi-Scenario High-Level Synthesis Algorithm for Variation-Tolerant Floorplan-Driven Design
Koki IGAWA Masao YANAGISAWA Nozomu TOGAWA

PAPER

Vol:
E99-A No:7
Page(s):
1278-1293
In order to tackle a process-variation problem, we can define several scenarios, each of which corresponds to a particular LSI behavior, such as a typical-case scenario and a worst-case scenario. By designing a single LSI chip which realizes multiple scenarios simultaneously, we can have a process-variation-tolerant LSI chip. In this paper, we propose a multi-scenario high-level synthesis algorithm for variation-tolerant floorplan-driven design targeting new distributed-register architectures, called HDR architectures. We assume two scenarios, a typical-case scenario and a worst-case scenario, and realize them onto a single chip. We first schedule/bind each of the scenarios independently. After that, we commonize the scheduling/binding results for the typical-case and worst-case scenarios and thus generate a commonized area-minimized floorplan result. At that time, we can explicitly take into account interconnection delays by using distributed-register architectures. Experimental results show that our algorithm reduces the latency of the typical-case scenario by up to 50% without increasing the latency of the worst-case scenario, compared with several existing methods.
Interconnection-Delay and Clock-Skew Estimate Modelings for Floorplan-Driven High-Level Synthesis Targeting FPGA Designs
Koichi FUJIWARA Kazushi KAWAMURA Masao YANAGISAWA Nozomu TOGAWA

PAPER

Vol:
E99-A No:7
Page(s):
1294-1310
Recently, high-level synthesis techniques for FPGA designs (FPGA-HLS techniques) are strongly required in various applications. Both interconnection delays and clock skews have a large impact on circuit performance implemented onto FPGA, which indicates the need for floorplan-driven FPGA-HLS algorithms considering them. To appropriately estimate interconnection delays and clock skews at HLS phase, a reasonable model to estimate them becomes essential. In this paper, we demonstrate several experiments to characterize interconnection delays and clock skews in FPGA and propose novel estimate models called “IDEF” and “CSEF”. In order to evaluate our models, we integrate them into a conventional floorplan-driven FPGA-HLS algorithm. Experimental results demonstrate that our algorithm can realize FPGA designs which reduce the latency by up to 22% compared with conventional approaches.
A High-Level Synthesis Algorithm with Inter-Island Distance Based Operation Chainings for RDR Architectures
Kotaro TERADA Masao YANAGISAWA Nozomu TOGAWA

PAPER

Vol:
E98-A No:7
Page(s):
1366-1375
In deep-submicron era, interconnection delays are not negligible even in high-level synthesis and regular-distributed-register architectures (RDR architectures) have been proposed to cope with this problem. In this paper, we propose a high-level synthesis algorithm using operation chainings which reduces the overall latency targeting RDR architectures. Our algorithm consists of three steps: The first step enumerates candidate operations for chaining. The second step introduces maximal chaining distance (MCD), which gives the maximal allowable inter-island distance on RDR architecture between chaining candidate operations. The last step performs list-scheduling and binding simultaneously based on the results of the two preceding steps. Our algorithm enumerates feasible chaining candidates and selects the best ones for RDR architecture. Experimental results show that our proposed algorithm reduces the latency by up to 40.0% compared to the original approach, and by up to 25.0% compared to a conventional approach. Our algorithm also reduces the number of registers and the number of multiplexers compared to the conventional approaches in some cases.
A Floorplan-Driven High-Level Synthesis Algorithm for Multiplexer Reduction Targeting FPGA Designs
Koichi FUJIWARA Kazushi KAWAMURA Shin-ya ABE Masao YANAGISAWA Nozomu TOGAWA

PAPER

Vol:
E98-A No:7
Page(s):
1392-1405
Recently, high-level synthesis (HLS) techniques for FPGA designs are required in various applications such as computerized stock tradings and reconfigurable network processings. In HLS for FPGA designs, we need to consider module floorplan and reduce multiplexer's cost concurrently. In this paper, we propose a floorplan-driven HLS algorithm for multiplexer reduction targeting FPGA designs. By utilizing distributed-register architectures called HDR, we can easily consider module floorplan in HLS. In order to reduce multiplexer's cost, we propose two novel binding methods called datapath-oriented scheduling/FU binding and datapath-oriented register binding. Experimental results demonstrate that our algorithm can realize FPGA designs which reduce the number of slices by up to 47% and latency by up to 22% compared with conventional approaches while the number of required control steps is almost the same.
An Energy-Efficient Floorplan Driven High-Level Synthesis Algorithm for Multiple Clock Domains Design
Shin-ya ABE Youhua SHI Kimiyoshi USAMI Masao YANAGISAWA Nozomu TOGAWA

PAPER

Vol:
E98-A No:7
Page(s):
1376-1391
In this paper, we first propose an HDR-mcd architecture, which integrates periodically all-in-phase based multiple clock domains and multi-cycle interconnect communication into high-level synthesis. In HDR-mcd, an entire chip is divided into several huddles. Huddles can realize synchronization between different clock domains in which interconnection delay should be considered during high-level synthesis. Next, we propose a high-level synthesis algorithm for HDR-mcd, which can reduce energy consumption by optimizing configuration and placement of huddles. Experimental results show that the proposed method achieves 32.5% energy-saving compared with the existing single clock domain based methods.
Performance Modeling of Stencil Computing on a Stream-Based FPGA Accelerator for Efficient Design Space Exploration
Keisuke DOHI Koji OKINA Rie SOEJIMA Yuichiro SHIBATA Kiyoshi OGURI

PAPER-Application

Pubricized:
2014/11/19
Vol:
E98-D No:2
Page(s):
298-308
In this paper, we discuss performance modeling of 3-D stencil computing on an FPGA accelerator with a high-level synthesis environment, aiming for efficient exploration of user-space design parameters. First, we analyze resource utilization and performance to formulate these relationships as mathematical models. Then, in order to evaluate our proposed models, we implement heat conduction simulations as a benchmark application, by using MaxCompiler, which is a high-level synthesis tool for FPGAs, and MaxGenFD, which is a domain specific framework of the MaxCompiler for finite-difference equation solvers. The experimental results with various settings of architectural design parameters show the best combination of design parameters for pipeline structure can be systematically found by using our models. The effects of changing arithmetic accuracy and using data stream compression are also discussed.
An Energy-Efficient Patchable Accelerator and Its Design Methods
Hiroaki YOSHIDA Masayuki WAKIZAKA Shigeru YAMASHITA Masahiro FUJITA

PAPER-High-Level Synthesis and System-Level Design

Vol:
E97-A No:12
Page(s):
2507-2517
With the shorter time-to-market and the rising cost in SoC development, the demand for post-silicon programmability has been increasing. Recently, programmable accelerators have attracted more attention as an enabling solution for post-silicon engineering change. However, programmable accelerators suffers from 5∼10X less energy efficiency than fixed-function accelerators mainly due to their extensive use of memories. This paper proposes a highly energy-efficient accelerator which enables post-silicon engineering change by a control patching mechanism. Then, we propose a patch compilation method from a given pair of an original design and a modified design. We also propose a design method to add redundant wires in advance to decrease the necessary amount of patch memory for post-silicon engineering change. Experimental results demonstrate that the proposed accelerators offer high energy efficiency competitive to fixed-function accelerators and can achieve about 5X higher efficiency than the existing programmable accelerators. We also show the trade-off between redundant wires and the necessary amount of patch memory.
Nested Loop Parallelization Using Polyhedral Optimization in High-Level Synthesis
Akihiro SUDA Hideki TAKASE Kazuyoshi TAKAGI Naofumi TAKAGI

PAPER-High-Level Synthesis and System-Level Design

Vol:
E97-A No:12
Page(s):
2498-2506
We propose a synthesis method of nested loops into parallelized circuits by integrating the polyhedral optimization, which is a state-of-the-art technique in the field of software, into high-level synthesis. Our method constructs circuits equipped with multiple processing elements (PEs), using information generated by the polyhedral optimizing compiler. Since multiple PEs cannot concurrently access the off-chip RAM, a method for constructing on-chip buffers is also proposed. Our buffering method reduces the off-chip RAM access conflicts and further enables burst accesses and data reuses. In our experimental result, the buffered circuits generated by our method are 8.2 times on average and 26.5 times at maximum faster than the sequential non-buffered ones, when each of the parallelized circuits is configured with eight PEs.
Mobility Overlap-Removal-Based Leakage Power and Register-Aware Scheduling in High-Level Synthesis
Nan WANG Song CHEN Wei ZHONG Nan LIU Takeshi YOSHIMURA

PAPER-VLSI Design Technology and CAD

Vol:
E97-A No:8
Page(s):
1709-1719
Scheduling is a key problem in high level synthesis, as the scheduling results affect most of the important design metrics. In this paper, we propose a novel scheduling method to simultaneously optimize the leakage power of functional units with dual-Vth techniques and the number of registers under given timing and resource constraints. The mobility overlaps between operations are removed to eliminate data dependencies, and a simulated-annealing-based method is introduced to explore the mobility overlap removal solution space. Given the overlap-free mobilities, the resource usage and register usage in each control step can be accurately estimated. Meanwhile, operations are scheduled so as to optimize the leakage power of functional units with minimal number of registers. Then, a set of operations is iteratively selected, reassigned as low-Vth, and rescheduled until the resource constraints are all satisfied. Experimental results show the efficiency of the proposed algorithm.
Floorplan Driven Architecture and High-Level Synthesis Algorithm for Dynamic Multiple Supply Voltages
Shin-ya ABE Youhua SHI Kimiyoshi USAMI Masao YANAGISAWA Nozomu TOGAWA

PAPER-High-Level Synthesis and System-Level Design

Vol:
E96-A No:12
Page(s):
2597-2611
In this paper, we propose an adaptive voltage huddle-based distributed-register architecture (AVHDR architecture), which integrates dynamic multiple supply voltages and interconnection delay into high-level synthesis. In AVHDR architecture, voltages can be dynamically assigned for energy reduction. In other words, low supply voltages are assigned to non-critical operations, and leakage power is cut off by turning off the power supply to the sleeping functional units. Next, an AVHDR-based high-level synthesis algorithm is proposed. Our algorithm is based on iterative improvement of scheduling/binding and floorplanning. In the iteration process, the modules in each huddle can be placed close to each other and the corresponding AVHDR architecture can be generated and optimized with floorplanning information. Experimental results show that on average our algorithm achieves 43.9% energy-saving compared with conventional algorithms.
Dual-Edge-Triggered Flip-Flop-Based High-Level Synthesis with Programmable Duty Cycle
Keisuke INOUE Mineo KANEKO

PAPER-VLSI Design Technology and CAD

Vol:
E96-A No:12
Page(s):
2689-2697
This paper addresses a high-level synthesis (HLS) using dual-edge-triggered flip-flops (DETFFs) as memory elements. In DETFF-based HLS, the duty cycle becomes a manageable resource to improve the timing performance. To utilize the duty cycle radically, a programmable duty cycle (PDC) mechanism is built into this HLS, and captured by a new HLS task named PDC scheduling. As a first step toward DETFF-based HLS with PDC, the execution time minimization problem is formulated for given results of operation scheduling. A linear program is presented to solve this problem in polynomial time. As a next step, simultaneous operation scheduling and PDC scheduling problem for the same objective is tackled. A mixed integer linear programming-based (MILP) approach is presented to solve this problem. The experimental results show that the MILP can reduce the execution time for several benchmarks.
Heuristic and Exact Resource Binding Algorithms for Storage Optimization Using Flip-Flops and Latches
Keisuke INOUE Mineo KANEKO

PAPER-VLSI Design Technology and CAD

Vol:
E96-A No:8
Page(s):
1712-1722
A mixed storage-type design using flip-flops and latches (FF/latch-based design) has advantages on such as area and power compared to single storage-type design (only flip-flops or latches). Considering FF/latch-based design at high-level synthesis is necessary, because resource binding process significantly affects the quality of resulting circuits. One of the fundamental aspects in FF/latch-based design is that different resource binding solutions could lead to the different numbers of latch-replacable registers. Therefore, as a first step, this paper addresses a datapath design problem in which resource binding and selecting storage-types of registers are simultaneously optimized for datapath area minimization (i.e., latch replacement maximization). An efficient algorithm based on the compatibility path decomposition and an integer linear programming-based exact approach are presented. Experiments confirm the effectiveness of the proposed approaches.

1-20hit(66hit)

Keyword Search Result

[Keyword] high-level synthesis(66hit)

A 0.13 mJ/Prediction CIFAR-100 Fully Synthesizable Raster-Scan-Based Wired-Logic Processor in 16-nm FPGA Open Access

CLAHE Implementation and Evaluation on a Low-End FPGA Board by High-Level Synthesis

Reconfigurable 3D Sound Processor and Its Automatic Design Environment Using High-Level Synthesis

VHDL vs. SystemC: Design of Highly Parameterizable Artificial Neural Networks

A Describing Method of an Image Processing Software in C for a High-Level Synthesis Considering a Function Chaining

A Bitwidth-Aware High-Level Synthesis Algorithm Using Operation Chainings for Tiled-DR Architectures

A Floorplan Aware High-Level Synthesis Algorithm with Body Biasing for Delay Variation Compensation

Area-Efficient Soft-Error Tolerant Datapath Synthesis Based on Speculative Resource Sharing

A Multi-Scenario High-Level Synthesis Algorithm for Variation-Tolerant Floorplan-Driven Design

Interconnection-Delay and Clock-Skew Estimate Modelings for Floorplan-Driven High-Level Synthesis Targeting FPGA Designs

A High-Level Synthesis Algorithm with Inter-Island Distance Based Operation Chainings for RDR Architectures

A Floorplan-Driven High-Level Synthesis Algorithm for Multiplexer Reduction Targeting FPGA Designs

An Energy-Efficient Floorplan Driven High-Level Synthesis Algorithm for Multiple Clock Domains Design

Performance Modeling of Stencil Computing on a Stream-Based FPGA Accelerator for Efficient Design Space Exploration

An Energy-Efficient Patchable Accelerator and Its Design Methods

Nested Loop Parallelization Using Polyhedral Optimization in High-Level Synthesis

Mobility Overlap-Removal-Based Leakage Power and Register-Aware Scheduling in High-Level Synthesis

Floorplan Driven Architecture and High-Level Synthesis Algorithm for Dynamic Multiple Supply Voltages

Dual-Edge-Triggered Flip-Flop-Based High-Level Synthesis with Programmable Duty Cycle

Heuristic and Exact Resource Binding Algorithms for Storage Optimization Using Flip-Flops and Latches

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles