Hoang-Gia VU Shinya TAKAMAEDA-YAMAZAKI Takashi NAKADA Yasuhiko NAKASHIMA
Modern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide “fine-grained” management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).
Takeshi OHKAWA Kazushi YAMASHINA Hitomi KIMURA Kanemitsu OOTSU Takashi YOKOTA
A component-oriented FPGA design platform is proposed for robot system integration. FPGAs are known to be a power-efficient hardware platform, but the development cost of FPGA-based systems is currently too high to integrate them into robot systems. To solve this problem, we propose an FPGA component that allows FPGA devices to be easily integrated into robot systems based on the Robot Operating System (ROS). ROS-compliant FPGA components offer a seamless interface between the FPGA hardware and software running on the CPU. Two experiments were conducted using the proposed components. For the first experiment, the results show that the execution time of an FPGA component for image processing was 1.7 times faster than that of the original software-based component and was 2.51 times more power efficient than an ordinary PC processor, despite substantial communication overhead. The second experiment showed that an FPGA component for sensor fusion was able to process multiple sensor inputs efficiently and with very low latency via parallel processing.
Akira JINGUJI Shimpei SATO Hiroki NAKAHARA
A random forest (RF) is a kind of ensemble machine learning algorithm used for a classification and a regression. It consists of multiple decision trees that are built from randomly sampled data. The RF has a simple, fast learning, and identification capability compared with other machine learning algorithms. It is widely used for application to various recognition systems. Since it is necessary to un-balanced trace for each tree and requires communication for all the ones, the random forest is not suitable in SIMD architectures such as GPUs. Although the accelerators using the FPGA have been proposed, such implementations were based on HDL design. Thus, they required longer design time than the soft-ware based realizations. In the previous work, we showed the high-level synthesis design of the RF including the fully pipelined architecture and the all-to-all communication. In this paper, to further reduce the amount of hardware, we use k-means clustering to share comparators of the branch nodes on the decision tree. Also, we develop the krange tool flow, which generates the bitstream with a few number of hyper parameters. Since the proposed tool flow is based on the high-level synthesis design, we can obtain the high performance RF with short design time compared with the conventional HDL design. We implemented the RF on the Xilinx Inc. ZC702 evaluation board. Compared with the CPU (Intel Xeon (R) E5607 Processor) and the GPU (NVidia Geforce Titan) implementations, as for the performance, the FPGA realization was 8.4 times faster than the CPU one, and it was 62.8 times faster than the GPU one. As for the power consumption efficiency, the FPGA realization was 7.8 times better than the CPU one, and it was 385.9 times better than the GPU one.
Hidenori GYOTEN Masayuki HIROMOTO Takashi SATO
An area-efficient FPGA-based annealing processor that is based on Ising model is proposed. The proposed processor eliminates random number generators (RNGs) and temperature schedulers, which are the key components in the conventional annealing processors and occupying a large portion of the design. Instead, a shift-register-based spin flipping scheme successfully helps the Ising model from stucking in the local optimum solutions. An FPGA implementation and software-based evaluation on max-cut problems of 2D-grid torus structure demonstrate that our annealing processor solves the problems 10-104 times faster than conventional optimization algorithms to obtain the solution of equal accuracy.
Tomoya FUJII Shimpei SATO Hiroki NAKAHARA
For a pre-trained deep convolutional neural network (CNN) for an embedded system, a high-speed and a low power consumption are required. In the former of the CNN, it consists of convolutional layers, while in the latter, it consists of fully connection layers. In the convolutional layer, the multiply accumulation operation is a bottleneck, while the fully connection layer, the memory access is a bottleneck. The binarized CNN has been proposed to realize many multiply accumulation circuit on the FPGA, thus, the convolutional layer can be done with a high-seed operation. However, even if we apply the binarization to the fully connection layer, the amount of memory was still a bottleneck. In this paper, we propose a neuron pruning technique which eliminates almost part of the weight memory, and we apply it to the fully connection layer on the binarized CNN. In that case, since the weight memory is realized by an on-chip memory on the FPGA, it achieves a high-speed memory access. To further reduce the memory size, we apply the retraining the CNN after neuron pruning. In this paper, we propose a sequential-input parallel-output fully connection layer circuit for the binarized fully connection layer, while proposing a streaming circuit for the binarized 2D convolutional layer. The experimental results showed that, by the neuron pruning, as for the fully connected layer on the VGG-11 CNN, the number of neurons was reduced by 39.8% with keeping the 99% baseline accuracy. We implemented the neuron pruning CNN on the Xilinx Inc. Zynq Zedboard. Compared with the ARM Cortex-A57, it was 1773.0 times faster, it dissipated 3.1 times lower power, and its performance per power efficiency was 5781.3 times better. Also, compared with the Maxwell GPU, it was 11.1 times faster, it dissipated 7.7 times lower power, and its performance per power efficiency was 84.1 times better. Thus, the binarized CNN on the FPGA is suitable for the embedded system.
Akira YAMAWAKI Seiichi SERIKAWA
This paper shows a describing method of an image processing software in C for high-level synthesis (HLS) technology considering function chaining to realize an efficient hardware. A sophisticated image processing would be built on the sequence of several primitives represented as sub-functions like the gray scaling, filtering, binarization, thinning, and so on. Conventionally, generic describing methods for each sub-function so that HLS technology can generate an efficient hardware module have been shown. However, few studies have focused on a systematic describing method of the single top function consisting of the sub-functions chained. According to the proposed method, any number of sub-functions can be chained, maintaining the pipeline structure. Thus, the image processing can achieve the near ideal performance of 1 pixel per clock even when the processing chain is long. In addition, implicitly, the deadlock due to the mismatch of the number of pushes and pops on the FIFO connecting the functions is eliminated and the interpolation of the border pixels is done. The case study on a canny edge detection including the chain of some sub-functions demonstrates that our proposal can easily realize the expected hardware mentioned above. The experimental results on ZYNQ FPGA show that our proposal can be converted to the pipelined hardware with moderate size and achieve the performance gain of more than 70 times compared to the software execution. Moreover, the reconstructed C software program following our proposed method shows the small performance degradation of 8% compared with the pure C software through a comparative evaluation preformed on the Cortex A9 embedded processor in ZYNQ FPGA. This fact indicates that a unified image processing library using HLS software which can be executed on CPU or hardware module for HW/SW co-design can be established by using our proposed describing method.
Motoki AMAGASAKI Masato IKEBE Qian ZHAO Masahiro IIDA Toshinori SUEYOSHI
Three-dimensional (3D) field-programmable gate arrays (FPGAs) are expected to offer higher logic density as well as improved delay and power performance by utilizing 3D integrated circuit technology. However, because through-silicon-vias (TSVs) for conventional 3D FPGA interlayer connections have a large area overhead, there is an inherent tradeoff between connectivity and small size. To find a balance between cost and performance, and to explore 3D FPGAs with realistic 3D integration processes, we propose two types of 3D FPGA and construct design tool sets for architecture exploration. In previous research, we created a TSV-free 3D FPGA with a face-down integration method; however, this was limited to two layers. In this paper, we discuss the face-up stacking of several face-down stacked FPGAs. To minimize the number of TSVs, we placed TSVs peripheral to the FPGAs for 3D-FPGA with 4 layers. According to our results, a 2-layer 3D FPGA has reasonable performance when limiting the design to two layers, but a 4-layer 3D FPGA is a better choice when area is emphasized.
Yasuaki OHIRA Takahiro MATSUMOTO Hideyuki TORII Yuta IDA Shinya MATSUFUJI
In this paper, we propose a new structure for a compact matched filter bank (MFB) for an optical zero-correlation zone (ZCZ) sequence set with Zcz=2z. The proposed MFB can reduces operation elements such as 2-input adders and delay elements. The number of 2-input adders decrease from O(N2) to O(N log2 N), delay elements decrease from O(N2) to O(N). In addition, the proposed MFBs for the sequence of length 32, 64, 128 and 256 with Zcz=2,4 and 8 are implemented on a field programmable gate array (FPGA). As a result, the numbers of logic elements (LEs) of the proposed MFBs for the sequences with Zcz=2 of length 32, 64, 128 and 256 are suppressed to about 76.2%, 84.2%, 89.7% and 93.4% compared to that of the conventional MFBs, respectively.
Kentaro KATO Somsak CHOOMCHUAY
This paper analyzes the time domain Reed Solomon Decoder with FPGA implementation. Data throughput and area is carefully evaluated compared with typical frequency domain Reed Solomon Decoder. In this analysis, three hardware architecture to enhance the data throughput, namely, the pipelined architecture, the parallel architecture, and the truncated arrays, is evaluated, too. The evaluation reveals that the number of the consumed resources of RS(255, 239) is about 20% smaller than those of the frequency domain decoder although data throughput is less than 10% of the frequency domain decoder. The number of the consumed resources of the pipelined architecture is 28% smaller than that of the parallel architecture when data throughput is same. It is because the pipeline architecture requires less extra logics than the parallel architecture. To get higher data throughput, the pipelined architecture is better than the parallel architecture from the viewpoint of consumed resources.
Yu SUZUKI Masato ITO Satoshi KANDA Kousuke IMAMURA Yoshio MATSUDA Tetsuya MATSUMURA
This paper describes the design and implementation of a real-time optical flow processor using a single field-programmable gate array (FPGA) chip. By introducing the modified initial flow generation method, the successive over-relaxation (SOR) method for both layers, the optimization of the reciprocal operation method, and the image division method, it is now possible to both reduce hardware requirements and improve flow accuracy. Additionally, by introducing a pipeline structure to this processor, high-throughput hardware implementation could be achieved. Total logic cell (LC) amounts and processer memory capacity are reduced by about 8% and 16%, respectively, compared to our previous hierarchical optical flow estimation (HOE) processor. The results of our evaluation confirm that this processor can perform 30 fps wide extended graphics array (WXGA) 175.7MHz real-time optical flow processing with a single FPGA.
Ho-Seong KIM Pil-Ho LEE Jin-Wook HAN Seung-Hun SHIN Seung-Wuk BAEK Doo-Ill PARK Yongkyu SEO Young-Chan JANG
A 10 Gbps transmitter bridge chip including four data lanes, which increases the bandwidth using an 8-to-1 serialization, is proposed for a field-programmable gate array (FPGA)-based frame generator to support the protocol of the D-PHY version 1.2 for the mobile industry processor interface (MIPI) display serial interface (DSI).
Qiong WANG Mohamed EL-HADEDY Kevin SKADRON Ke WANG
Motif searching, i.e., identifying meaningful patterns from biological data, has been studied extensively due to its importance in the biomedical sciences. In this work, we seek to improve the performance of Weeder, a widely-used tool for automatic de novo motif searching. Weeder consists of several functions, among which we find that the function oligo_scan, which handles the pattern matching, is the bottleneck, especially when dealing with large datasets. Motivated by this observation, we adopt the Micron Automata Processor (AP) to accelerate the pattern-matching stage of Weeder. The AP is a massively-parallel, non-von-Neumann semiconductor architecture that is purpose-built for symbolic pattern matching. Relying on the fact that AP is capable of performing matching for thousands of patterns in parallel, we develop an AP-accelerated Weeder implementation in this work. In particular, we describe how to map Weeder's pattern matching to the AP chip and use the high-end FPGA on the AP board to postprocess the output from AP. Our experiment shows that the AP-accelerated Weeder achieves 751x speedup on pattern matching, compared to a single-threaded CPU implementation.
An energy-efficient nonvolatile FPGA with assuring highly-reliable backup operation using a self-terminated power-gating scheme is proposed. Since the write current is automatically cut off just after the temporal data in the flip-flop is successfully backed up in the nonvolatile device, the amount of write energy can be minimized with no write failure. Moreover, when the backup operation in a particular cluster is completed, power supply of the cluster is immediately turned off, which minimizes standby energy due to leakage current. In fact, the total amount of energy consumption during the backup operation is reduced by 66% in comparison with that of a conventional worst-case-based approach where the long time write current pulse is used for the reliable write.
Kenji KANAZAWA Tsutomu MARUYAMA
WalkSAT (WSAT) is one of the best performing stochastic local search algorithms for the Boolean Satisfiability (SAT) and the Maximum Boolean Satisfiability (MaxSAT). WSAT is very suitable for hardware acceleration because of its high inherent parallelism. Formal verification of digital circuits is one of the most important applications of SAT and MaxSAT. Structural knowledge such as logic gates and their dependencies can be derived from SAT/MaxSAT instances generated from formal verification of digital circuits. Such that knowledge is useful to solve these instances efficiently. In this paper, we first discuss a heuristic to utilize the structural knowledge for solving these problems by using WSAT. Then, we show its implementation on FPGA. The problem size of the formal verification is typically very large, and most data have to be placed in off-chip DRAMs. In this situation, the acceleration by FPGA is limited by the throughput and access latency of the DRAMs. In our implementation, data are carefully mapped on the on-chip memory banks and off-chip DRAMs so that most data in the off-chip DRAMs can be continuously accessed using burst-read. Furthermore, a variable-way cache memory comprised of the on-chip memory banks is used in order to hide the DRAM access latency by caching the head portion of the continuous read from the DRAMs and giving them to the circuit till the rest portion is started to be given by the burst-read. We evaluate the performance of our proposed method by changing configuration of the variable-way cache and the processing parallelism, and discuss how much acceleration can be achieved.
Theint Theint THU Jimpei HAMAMURA Rie SOEJIMA Yuichiro SHIBATA Kiyoshi OGURI
Field Programmable Gate Array (FPGA) based robust model fitting enjoys immense popularity in image processing because of its high efficiency. This paper focuses on the tradeoff analysis of real-time FPGA implementation of robust circle and ellipse estimations based on the random sample consensus (RANSAC) algorithm, which estimates parameters of a statistical model from a data set of feature points which contains outliers. In particular, this paper mainly highlights implementation alternatives for solvers of simultaneous equations and compares Gauss-Jordan elimination and Cramer's rule by changing matrix size and arithmetic processes. Experimental evaluation shows a Cramer's rule approach coupled with long integer arithmetic can reduce most hardware resources without unacceptable degradation of estimation accuracy compared to floating point versions.
Makoto SAEN Tadanobu TOBA Yusuke KANNO
This paper presents a soft-error-tolerant memory-control circuit for SRAM-based field programmable gate arrays (FPGAs). A potential obstacle to applying such FPGAs to safety-critical industrial control systems is their low tolerance. The main reason is that soft errors damage circuit-configuration data stored in SRAM-based configuration memory. To overcome this obstacle, the soft-error tolerance must thus be improved while suppressing the circuit area overhead, and data stored in external memory must be protected when a fault occurs on the FPGA. Therefore, a memory-control circuit was developed on the basis of a dual-modular-redundancy (DMR) architecture. This memory controller has a repair and retry scheme that repairs damaged circuit-configuration data and re-executes unfinished accesses after the repair. The developed architecture reduces circuit redundancy below that of a commonly used triple-modular-redundancy (TMR) architecture. Moreover, a write-invalidation circuit was developed to protect data in external memory, and an external-memory-state recovery circuit was developed to enable resumption of memory access after fault repair. The developed memory controller was implemented in a prototype circuit on an FPGA and evaluated using the prototype. The evaluation results demonstrated that the developed memory controller can operate successfully for 1.03×109 hours (at sea level). In addition, its circuit area overhead was found to be sufficiently smaller than that of the TMR architecture.
Motoki AMAGASAKI Yuki NISHITANI Kazuki INOUE Masahiro IIDA Morihiro KUGA Toshinori SUEYOSHI
Fault tolerance is an important feature for the system LSIs used in reliability-critical systems. Although redundancy techniques are generally used to provide fault tolerance, these techniques have significantly hardware costs. However, FPGAs can easily provide high reliability due to their reconfiguration ability. Even if faults occur, the implemented circuit can perform correctly by reconfiguring to a fault-free region of the FPGA. In this paper, we examine an FPGA-IP core loaded in SoC and introduce a fault-tolerant technology based on fault detection and recovery as a CAD-level approach. To detect fault position, we add a route to the manufacturing test method proposed in earlier research and identify fault areas. Furthermore, we perform fault recovery at the logic tile and multiplexer levels using reconfiguration. The evaluation results for the FPGA-IP core loaded in the system LSI demonstrate that it was able to completely identify and avoid fault areas relative to the faults in the routing area.
In this paper, we present an FPGA hardware implementation for a phylogenetic tree reconstruction with a maximum parsimony algorithm. We base our approach on a particular stochastic local search algorithm that uses the Progressive Neighborhood and the Indirect Calculation of Tree Lengths method. This method is widely used for the acceleration of the phylogenetic tree reconstruction algorithm in software. In our implementation, we define a tree structure and accelerate the search by parallel and pipeline processing. We show results for eight real-world biological datasets. We compare execution times against our previous hardware approach, and TNT, the fastest available parsimony program, which is also accelerated by the Indirect Calculation of Tree Lengths method. Acceleration rates between 34 to 45 per rearrangement, and 2 to 6 for the whole search, are obtained against our previous hardware approach. Acceleration rates between 2 to 36 per rearrangement, and 18 to 112 for the whole search, are obtained against TNT.
Tieyuan PAN Lian ZENG Yasuhiro TAKASHIMA Takahiro WATANABE
In this paper, we propose a fast Maximal Empty Rectangle (MER) enumeration algorithm for online task placement on reconfigurable Field-Programmable Gate Arrays (FPGAs). On the assumption that each task utilizes rectangle-shaped resources, the proposed algorithm can manage the free space on FPGAs by an MER list. When assigning or removing a task, a series of MERs are selected and cut into segments according to the task and its assignment location. By processing these segments, the MER list can be updated quickly with low memory consumption. Under the proof of the upper limit of the number of the MERs on the FPGA, we analyze both the time and space complexity of the proposed algorithm. The efficiency of the proposed algorithm is verified by experiments.
A Field Programmable Gate Array (FPGA) with fine-grained body biasing shows satisfactory static power reduction. Contrarily, the FPGA incurs high overhead because additional body bias selectors and electrical isolation regions are needed to program the threshold voltage (Vt) of elemental circuits such as MUX, buffer and LUT in the FPGA. In this paper, low overhead design of FPGA with fine-grained body biasing is described. The FPGA is designed and fabricated on 65-nm SOTB CMOS technology. By not only adopting a customized design rule specifying that reliability is verified by TEGs but downsizing a body bias selector, the FPGA tile area becomes small by 39% compared with the conventional design, resulting in 900 FPGA tiles with 4,4000 programmable Vt regions. In addition, the chip performance is evaluated by implementing 32-bit binary counter in the supply voltage range of 0.5V from 1.2V. The counter circuit operates at a frequency of 72MHz and 14MHz with the supply voltage of 1.2V and 0.5V respectively. The static power saving of 80% in elemental circuits of the FPGA at 0.5-V supply voltage and 0.5-V reverse body bias voltage is achieved in the best case. In the whole chip including configuration memory and body bias selector in addition to elemental circuits, effective static power reduction around 30% is maintained by applying 0.3-V reverse body bias voltage at each supply voltage.