The search functionality is under construction.

Keyword Search Result

[Keyword] FPGA(329hit)

21-40hit(329hit)

  • RVCar: An FPGA-Based Simple and Open-Source Mini Motor Car System with a RISC-V Soft Processor

    Takuto KANAMORI  Takashi ODAN  Kazuki HIROHATA  Kenji KISE  

     
    PAPER

      Pubricized:
    2022/08/09
      Vol:
    E105-D No:12
      Page(s):
    1999-2007

    Deep Neural Network (DNN) is widely used for computer vision tasks, such as image classification, object detection, and segmentation. DNN accelerator on FPGA and especially Convolutional Neural Network (CNN) is a hot topic. More research and education should be conducted to boost this field. A starting point is required to make it easy for new entrants to join this field. We believe that FPGA-based Autonomous Driving (AD) motor cars are suitable for this because DNN accelerators can be used for image processing with low latency. In this paper, we propose an FPGA-based simple and open-source mini motor car system named RVCar with a RISC-V soft processor and a CNN accelerator. RVCar is suitable for the new entrants who want to learn the implementation of a CNN accelerator and the surrounding system. The motor car consists of Xilinx Nexys A7 board and simple parts. All modules except the CNN accelerator are implemented in Verilog HDL and SystemVerilog. The CNN accelerator is converted from a PyTorch model by our tool. The accelerator is written in C++, synthesizable by Vitis HLS, and an easy-to-customize baseline for the new entrants. FreeRTOS is used to implement AD algorithms and executed on the RISC-V soft processor. It helps the users to develop the AD algorithms efficiently. We conduct a case study of the simple AD task we define. Although the task is simple, it is difficult to achieve without image recognition. We confirm that RVCar can recognize objects and make correct decisions based on the results.

  • 28nm Atom-Switch FPGA: Static Timing Analysis and Evaluation

    Xu BAI  Ryusuke NEBASHI  Makoto MIYAMURA  Kazunori FUNAHASHI  Naoki BANNO  Koichiro OKAMOTO  Hideaki NUMATA  Noriyuki IGUCHI  Tadahiko SUGIBAYASHI  Toshitsugu SAKAMOTO  Munehiro TADA  

     
    BRIEF PAPER

      Pubricized:
    2022/06/27
      Vol:
    E105-C No:10
      Page(s):
    627-630

    A static timing analysis (STA) tool for a 28nm atom-switch FPGA (AS-FPGA) is introduced to validate the signal delay of an application circuit before implementation. High accuracy of the STA tool is confirmed by implementing a practical application circuit on the 28nm AS-FPGA. Moreover, dramatic improvement of delay and power is demonstrated in comparison with a previous 40nm AS-FPGA.

  • Resource Efficient Top-K Sorter on FPGA

    Binhao HE  Meiting XUE  Shubiao LIU  Feng YU  Weijie CHEN  

     
    LETTER-Digital Signal Processing

      Pubricized:
    2022/03/02
      Vol:
    E105-A No:9
      Page(s):
    1372-1376

    The top-K sorting is a variant of sorting used heavily in applications such as database management systems. Recently, the use of field programmable gate arrays (FPGAs) to accelerate sorting operation has attracted the interest of researchers. However, existing hardware top-K sorting algorithms are either resource-intensive or of low throughput. In this paper, we present a resource-efficient top-K sorting architecture that is composed of L cascading sorting units, and each sorting unit is composed of P sorting cells. K=PL largest elements are produced when a variable length input sequence is processed. This architecture can operate at a high frequency while consuming fewer resources. The experimental results show that our architecture achieved a maximum 1.2x throughput-to-resource improvement compared to previous studies.

  • An Efficient Resource Shared RISC-V Multicore Architecture

    Md Ashraful ISLAM  Kenji KISE  

     
    PAPER-Computer System

      Pubricized:
    2022/05/27
      Vol:
    E105-D No:9
      Page(s):
    1506-1515

    For the increasing demands of computation, heterogeneous multicore architecture is believed to be a promising solution to fulfill the edge computational requirement. In FPGAs, the heterogeneous multicore is realized as multiple soft processor cores with custom processing elements. Since FPGA is a resource-constrained device, sharing the hardware resources among the soft processor cores can be advantageous. A few research works have focused on the resource sharing between soft processors, but they do not study how much FPGA logic is minimized for a different pipeline processor. This paper proposes the microarchitecture of four, and five stage pipeline processors that enables the sharing of functional units for execution among the multiple cores as well as sharing the BRAM ports. We then investigate the performance and hardware resource utilization for a four-core processor. We find that sharing different functional units can save the LUT usage to 31.7% and DSP usage to 75%. We analyze the performance impact of sharing from the simulation of the Embench benchmark program. Our simulation results indicate that for some cases the sharing improves the performance and for other configurations worst-case performance drop is 16.7%.

  • Fast xFlow Proxy: Exploring and Visualizing Deep Inside of Carrier Traffic

    Shohei KAMAMURA  Yuhei HAYASHI  Yuki MIYOSHI  Takeaki NISHIOKA  Chiharu MORIOKA  Hiroyuki OHNISHI  

     
    PAPER-Network System

      Pubricized:
    2021/11/09
      Vol:
    E105-B No:5
      Page(s):
    512-521

    This paper proposes a fast and scalable traffic monitoring system called Fast xFlow Proxy. For efficiently provisioning and operating networks, xFlow such as IPFIX and NetFlow is a promising technology for visualizing the detailed traffic matrix in a network. However, internet protocol (IP) packets in a large carrier network are encapsulated with various outer headers, e.g., layer 2 tunneling protocol (L2TP) or multi-protocol label switching (MPLS) labels. As native xFlow technologies are applied to the outer header, the desired inner information cannot be visualized. From this motivation, we propose Fast xFlow Proxy, which explores the complicated carrier's packet, extracts inner information properly, and relays the inner information to a general flow collector. Fast xFlow Proxy should be able to handle various packet processing operations possible (e.g., header analysis, header elimination, and statistics) at a wire rate. To realize the processing speed needed, we implement Fast xFlow Proxy using the data plane development kit (DPDK) and field-programmable gate array (FPGA). By optimizing deployment of processes between DPDK and FPGA, Fast xFlow Proxy achieves wire rate processing. From evaluations, we can achieve over 20 Gbps performance by using a single server and 100 Gbps performance by using scale-out architecture. We also show that this performance is sufficiently practical for monitoring a nationwide carrier network.

  • High Temporal Resolution-Based Temporal Iterative Tracking for High Framerate and Ultra-Low Delay Dynamic Tracking System

    Tingting HU  Ryuji FUCHIKAMI  Takeshi IKENAGA  

     
    PAPER-Image Processing and Video Processing

      Pubricized:
    2022/02/22
      Vol:
    E105-D No:5
      Page(s):
    1064-1074

    High frame rate and ultra-low delay vision system, which can finish reading and processing of 1000fps sequence within 1ms/frame, draws increasing attention in the field of robotics that requires immediate feedback from image process core. Meanwhile, tracking task plays an important role in many computer vision applications. Among various tracking algorithms, Lucas Kanade (LK)-based template tracking, which tracks targets with high accuracy over the sub-pixel level, is one of the keys for robotic applications, such as factory automation (FA). However, the substantial spatial iterative processing and complex computation in the LK algorithm, make it difficult to achieve a high frame rate and ultra-low delay tracking with limited resources. Aiming at an LK-based template tracking system that reads and processes 1000fps sequences within 1ms/frame with small resource costs, this paper proposes: 1) High temporal resolution-based temporal iterative tracking, which maps the spatial iterations into the temporal domain, efficiently reduces resource cost and delay caused by spatial iterative processing. 2) Label scanner-based multi-stream spatial processing, which maps the local spatial processing into the labeled input pixel stream and aggregates them with a label scanner, makes the local spatial processing in the LK algorithm possible be implemented with a small resource cost. Algorithm evaluation shows that the proposed temporal iterative tracking performs dynamic tracking, which tracks object with coarse accuracy when it's moving fast and achieves higher accuracy when it slows down. Hardware evaluation shows that the proposed label scanner-based multi-stream architecture makes the system implemented on FPGA (zcu102) with resource cost less than 20%, and the designed tracking system supports to read and process 1000fps sequence within 1ms/frame.

  • FPGA Implementation of a Stream-Based Real-Time Hardware Line Segment Detector

    Taito MANABE  Taichi KATAYAMA  Yuichiro SHIBATA  

     
    PAPER

      Pubricized:
    2021/09/02
      Vol:
    E105-A No:3
      Page(s):
    468-477

    Line detection is the fundamental image processing technique which has various applications in the field of computer vision. For example, lane keeping required to realize autonomous vehicles can be implemented based on line detection technique. For such purposes, however, low detection latency and power consumption are essential. Using hardware-based stream processing is considered as an effective way to achieve such properties since it eliminates the need of storing the whole frame into energy-consuming external memory. In addition, adopting FPGAs enables us to keep flexibility of software processing. The line segment detector (LSD) is the algorithm based on intensity gradient, and performs better than the well-known Hough transform in terms of processing speed and accuracy. However, implementing the original LSD on FPGAs as a pipeline structure is difficult mainly because of its iterative region growing approach. Therefore, we propose a simple and stream-friendly line segment detection algorithm based on the concept of LSD. The whole system is implemented on a Xilinx Zynq-7000 XC7Z020-1CLG400C FPGA without any external memory. Evaluation results reveal that the implemented system is able to detect line segments successfully and is compact with 7.5% of Block RAM and less than 7.0% of the other resources used, while maintaining 60 fps throughput for VGA videos. It is also shown that the system is power-efficient compared to software processing on CPUs.

  • A Hardware Oriented Approximate Convex Hull Algorithm and its FPGA Implementation Open Access

    Tatsuma MORI  Taito MANABE  Yuichiro SHIBATA  

     
    PAPER

      Pubricized:
    2021/09/02
      Vol:
    E105-A No:3
      Page(s):
    459-467

    The convex hull is the minimum convex surrounding a given set of points. Since the process of finding convex hulls has various practical application fields including embedded real-time systems, efficient acceleration of convex hull algorithms is an important problem in computer geometry. In this paper, we discuss an FPGA acceleration approach to address this problem. In order to compute the convex hull of an unsorted point set, it is necessary to store all the points during the computation, and thus the capacity of a on-chip memory is likely to be a major constraint for efficient FPGA implementation. On the other hand, approximate convex hulls are often sufficient for practical applications. Therefore, we propose a hardware oriented approximate convex hull algorithm, which can process the input points as a stream without storing all the points in the memory. We also propose some computation reduction techniques for efficient FPGA implementation. Then, we present FPGA implementation of the proposed algorithm, which is parallelized both in temporal and spatial domains, and evaluate its effectiveness in terms of performance and accuracy. As a result, we demonstrated 11 to 30 times faster performance compared to the widely-used convex hull software library Qhull. In addition, accuracy assessment revealed that the maximum approximation error normalized to the diameters of point sets was 0.038%, which was reasonably small for practical use cases.

  • FPGA Implementation of 3-Bit Quantized Multi-Task CNN for Contour Detection and Disparity Estimation

    Masayuki MIYAMA  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2021/10/26
      Vol:
    E105-D No:2
      Page(s):
    406-414

    Object contour detection is a task of extracting the shape created by the boundaries between objects in an image. Conventional methods limit the detection targets to specific categories, or miss-detect edges of patterns inside an object. We propose a new method to represent a contour image where the pixel value is the distance to the boundary. Contour detection becomes a regression problem that estimates this contour image. A deep convolutional network for contour estimation is combined with stereo vision to detect unspecified object contours. Furthermore, thanks to similar inference targets and common network structure, we propose a network that simultaneously estimates both contour and disparity with fully shared weights. As a result of experiments, the multi-tasking network drew a good precision-recall curve, and F-measure was about 0.833 for FlyingThings3D dataset. L1 loss of disparity estimation for the dataset was 2.571. This network reduces the amount of calculation and memory capacity by half, and accuracy drop compared to the dedicated networks is slight. Then we quantize both weights and activations of the network to 3-bit. We devise a dedicated hardware architecture for the quantized CNN and implement it on an FPGA. This circuit uses only internal memory to perform forward propagation calculations, that eliminates high-power external memory accesses. This circuit is a stall-free pixel-by-pixel pipeline, and performs 8 rows, 16 input channels, 16 output channels, 3 by 3 pixels convolution calculations in parallel. The convolution calculation performance at the operating frequency of 250 MHz is 9 TOPs/s.

  • Improving the Performance of Circuit-Switched Interconnection Network for a Multi-FPGA System

    Kohei ITO  Kensuke IIZUKA  Kazuei HIRONAKA  Yao HU  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER

      Pubricized:
    2021/08/05
      Vol:
    E104-D No:12
      Page(s):
    2029-2039

    Multi-FPGA systems have gained attention because of their high performance and power efficiency. A multi-FPGA system called Flow-in-Cloud (FiC) is currently being developed as an accelerator of multi-access edge computing (MEC). FiC consists of multiple mid-range FPGAs tightly connected by high-speed serial links. Since time-critical jobs are assumed in MEC, a circuit-switched network with static time-division multiplexing (STDM) switches has been implemented on FiC. This paper investigates techniques of enhancing the interconnection performance of FiC. Unlike switching fabrics for Network on Chips or parallel machines, economical multi-FPGA systems, such as FiC, use Xilinx Aurora IP and FireFly cables with multiple lanes. We adopted the link aggregation and the slot distribution for using multiple lanes. To mitigate the bottleneck between an STDM switch and user logic, we also propose a multi-ejection STDM switch. We evaluated various combinations of our techniques by using three practical applications on an FiC prototype with 24 boards. When the number of slots is large and transferred data size is small, the slot distribution was sometimes more effective, while the link aggregation was superior for other most cases. Our multi-ejection STDM switch mitigated the bottleneck in ejection ports and successfully reduced the number of time slots. As a result, by combining the link aggregation and multi-ejection STDM switch, communication performance improved up to 7.50 times with few additional resources. Although the performance of the fast Fourier transform with the highest communication ratio could not be enhanced by using multiple boards when a lane was used, 1.99 times performance improvement was achieved by using 8 boards with four lanes and our multi-ejection switch compared with a board.

  • An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs

    Tomoya ITSUBO  Michihiro KOIBUCHI  Hideharu AMANO  Hiroki MATSUTANI  

     
    PAPER

      Pubricized:
    2021/07/01
      Vol:
    E104-D No:12
      Page(s):
    2057-2067

    Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.

  • A Low-Latency Inference of Randomly Wired Convolutional Neural Networks on an FPGA

    Ryosuke KURAMOCHI  Hiroki NAKAHARA  

     
    PAPER

      Pubricized:
    2021/06/24
      Vol:
    E104-D No:12
      Page(s):
    2068-2077

    Convolutional neural networks (CNNs) are widely used for image processing tasks in both embedded systems and data centers. In data centers, high accuracy and low latency are desired for various tasks such as image processing of streaming videos. We propose an FPGA-based low-latency CNN inference for randomly wired convolutional neural networks (RWCNNs), whose layer structures are based on random graph models. Because RWCNNs have several convolution layers that have no direct dependencies between them, our architecture can process them efficiently using a pipeline method. At each layer, we need to use the calculation results of multiple layers as the input. We use an FPGA with HBM2 to enable parallel access to the input data with multiple HBM2 channels. We schedule the order of execution of the layers to improve the pipeline efficiency. We build a conflict graph using the scheduling results. Then, we allocate the calculation results of each layer to the HBM2 channels by coloring the graph. Because the pipeline execution needs to be properly controlled, we developed an automatic generation tool for hardware functions. We implemented the proposed architecture on the Alveo U50 FPGA. We investigated a trade-off between latency and recognition accuracy for the ImageNet classification task by comparing the inference performances for different input image sizes. We compared our accelerator with a conventional accelerator for ResNet-50. The results show that our accelerator reduces the latency by 2.21 times. We also obtained 12.6 and 4.93 times better efficiency than CPU and GPU, respectively. Thus, our accelerator for RWCNNs is suitable for low-latency inference.

  • A Multi-Tenant Resource Management System for Multi-FPGA Systems

    Miho YAMAKURA  Ryousei TAKANO  Akram BEN AHMED  Midori SUGAYA  Hideharu AMANO  

     
    PAPER

      Pubricized:
    2021/10/08
      Vol:
    E104-D No:12
      Page(s):
    2078-2088

    FPGA (Field Programmable Gate Array) based accelerators are attracting significant interest in cloud computing systems. Combining multi-FPGA systems with cloud computing brings a new perspective to the reconfigurable computing research. However, the multi-tenancy of a multi-FPGA system has not been fully discussed in the previous researches. In this paper, we propose a multi-tenant resource management system, named FiC-RM, for a multi-FPGA cloud system. FiC-RM provides users with a set of FPGA resources according to their requirements and allows them to exclusively access FPGA boards and the interconnection network. To achieve this, we propose a placement algorithm which is a key to efficiently share the limited resources. We demonstrate FiC-RM controls a practical scale multi-FPGA system. Moreover, Our simulation study shows that our placement algorithm achieved 3 to 4% improvement in the average resource usage and a 20-second reduction in the response time, compared to other existing naive algorithms.

  • CLAHE Implementation and Evaluation on a Low-End FPGA Board by High-Level Synthesis

    Koki HONDA  Kaijie WEI  Masatoshi ARAI  Hideharu AMANO  

     
    PAPER

      Pubricized:
    2021/07/12
      Vol:
    E104-D No:12
      Page(s):
    2048-2056

    Automobile companies have been trying to replace side mirrors of cars with small cameras for reducing air resistance. It enables us to apply some image processing to improve the quality of the image. Contrast Limited Adaptive Histogram Equalization (CLAHE) is one of such techniques to improve the quality of the image for the side mirror camera, which requires a large computation performance. Here, an implementation method of CLAHE on a low-end FPGA board by high-level synthesis is proposed. CLAHE has two main processing parts: cumulative distribution function (CDF) generation, and bilinear interpolation. During the CDF generation, the effect of increasing loop initiation interval can be greatly reduced by placing multiple Processing Elements (PEs). and during the interpolation, latency and BRAM usage were reduced by revising how to hold CDF and calculation method. Finally, by connecting each module with streaming interfaces, using data flow pragmas, overlapping processing, and hiding data transfer, our HLS implementation achieved a comparable result to that of HDL. We parameterized the components of the algorithm so that the number of tiles and the size of the image can be easily changed. The source code for this research can be downloaded from https://github.com/kokihonda/fpga_clahe.

  • Weight Sparseness for a Feature-Map-Split-CNN Toward Low-Cost Embedded FPGAs

    Akira JINGUJI  Shimpei SATO  Hiroki NAKAHARA  

     
    PAPER

      Pubricized:
    2021/09/27
      Vol:
    E104-D No:12
      Page(s):
    2040-2047

    Convolutional neural network (CNN) has a high recognition rate in image recognition and are used in embedded systems such as smartphones, robots and self-driving cars. Low-end FPGAs are candidates for embedded image recognition platforms because they achieve real-time performance at a low cost. However, CNN has significant parameters called weights and internal data called feature maps, which pose a challenge for FPGAs for performance and memory capacity. To solve these problems, we exploit a split-CNN and weight sparseness. The split-CNN reduces the memory footprint by splitting the feature map into smaller patches and allows the feature map to be stored in the FPGA's high-throughput on-chip memory. Weight sparseness reduces computational costs and achieves even higher performance. We designed a dedicated architecture of a sparse CNN and a memory buffering scheduling for a split-CNN and implemented this on the PYNQ-Z1 FPGA board with a low-end FPGA. An experiment on classification using VGG16 shows that our implementation is 3.1 times faster than the GPU, and 5.4 times faster than an existing FPGA implementation.

  • HBDCA: A Toolchain for High-Accuracy BRAM-Defined CNN Accelerator on FPGA with Flexible Structure

    Zhengjie LI  Jiabao GAO  Jinmei LAI  

     
    PAPER-Biocybernetics, Neurocomputing

      Pubricized:
    2021/07/26
      Vol:
    E104-D No:10
      Page(s):
    1724-1733

    In recent years FPGA has become popular in CNN acceleration, and many CNN-to-FPGA toolchains are proposed to fast deploy CNN on FPGA. However, for these toolchains, updating CNN network means regeneration of RTL code and re-implementation which is time-consuming and may suffer timing-closure problems. So, we propose HBDCA: a toolchain and corresponding accelerator. The CNN on HBDCA is defined by the content of BRAM. The toolchain integrates UpdateMEM utility of Xilinx, which updates content of BRAM without re-synthesis and re-implementation process. The toolchain also integrates TensorFlow Lite which provides high-accuracy quantization. HBDCA supports 8-bits per-channel quantization of weights and 8-bits per-layer quantization of activations. Upgrading CNN on accelerator means the kernel size of CNN may change. Flexible structure of HBDCA supports kernel-level parallelism with three different sizes (3×3, 5×5, 7×7). HBDCA implements four types of parallelism in convolution layer and two types of parallelism in fully-connected layer. In order to reduce access number to memory, both spatial and temporal data-reuse techniques were applied on convolution layer and fully-connect layer. Especially, temporal reuse is adopted at both row and column level of an Input Feature Map of convolution layer. Data can be just read once from BRAM and reused for the following clock. Experiments show by updating BRAM content with single UpdateMEM command, three CNNs with different kernel size (3×3, 5×5, 7×7) are implemented on HBDCA. Compared with traditional design flow, UpdateMEM reduces development time by 7.6X-9.1X for different synthesis or implementation strategy. For similar CNN which is created by toolchain, HBDCA has smaller latency (9.97µs-50.73µs), and eliminates re-implementation when update CNN. For similar CNN which is created by dedicated design, HBDCA also has the smallest latency 9.97µs, the highest accuracy 99.14% and the lowest power 1.391W. For different CNN which is created by similar toolchain which eliminate re-implementation process, HBDCA achieves higher speedup 120.28X.

  • Nonvolatile Field-Programmable Gate Array Using a Standard-Cell-Based Design Flow

    Daisuke SUZUKI  Takahiro HANYU  

     
    PAPER-Logic Design

      Pubricized:
    2021/04/16
      Vol:
    E104-D No:8
      Page(s):
    1111-1120

    A nonvolatile field-programmable gate array (NV-FPGA), where the circuit-configuration information still remains without power supply, offers a powerful solution against the standby power issue. In this paper, an NV-FPGA is proposed where the programmable logic and interconnect function blocks are described in a hardware description language and are pushed through a standard-cell-based design flow with nonvolatile flip-flops. The use of the standard-cell-based design flow makes it possible to migrate any arbitrary process technology and to perform architecture-level simulation with physical information. As a typical example, the proposed NV-FPGA is designed under 55nm CMOS/100nm magnetic tunnel junction (MTJ) technologies, and the performance of the proposed NV-FPGA is evaluated in comparison with that of a CMOS-only volatile FPGA.

  • Energy-Efficient ECG Signals Outlier Detection Hardware Using a Sparse Robust Deep Autoencoder

    Naoto SOGA  Shimpei SATO  Hiroki NAKAHARA  

     
    PAPER-Logic Design

      Pubricized:
    2021/05/17
      Vol:
    E104-D No:8
      Page(s):
    1121-1129

    Advancements in portable electrocardiographs have allowed electrocardiogram (ECG) signals to be recorded in everyday life. Machine-learning techniques, including deep learning, have been used in numerous studies to analyze ECG signals because they exhibit superior performance to conventional methods. A mobile ECG analysis device is needed so that abnormal ECG waves can be detected anywhere. Such mobile device requires a real-time performance and low power consumption, however, deep-learning based models often have too many parameters to implement on mobile hardware, its amount of hardware is too large and dissipates much power consumption. We propose a design flow to implement the outlier detector using an autoencoder on a low-end FPGA. To shorten the preparation time of ECG data used in training an autoencoder, an unsupervised learning technique is applied. Additionally, to minimize the volume of the weight parameters, a weight sparseness technique is applied, and all the parameters are converted into fixed-point values. We show that even if the parameters are reduced converted into fixed-point values, the outlier detection performance degradation is only 0.83 points. By reducing the volume of the weight parameters, all the parameters can be stored in on-chip memory. We design the architecture according to the CRS format, which is the well-known data structure of a sparse matrix, minimizing the hardware size and reducing the power consumption. We use weight sharing to further reduce the weight-parameter volumes. By using weight sharing, we could reduce the bit width of the memories by 60% while maintaining the outlier detection performance. We implemented the autoencoder on a Digilent Inc. ZedBoard and compared the results with those for the ARM mobile CPU for a built-in device. The results indicated that our FPGA implementation of the outlier detector was 12 times faster and 106 times more energy-efficient.

  • Remote Dynamic Reconfiguration of a Multi-FPGA System FiC (Flow-in-Cloud)

    Kazuei HIRONAKA  Kensuke IIZUKA  Miho YAMAKURA  Akram BEN AHMED  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2021/05/12
      Vol:
    E104-D No:8
      Page(s):
    1321-1331

    Multi-FPGA systems have been receiving a lot of attention as a low cost and energy efficient system for Multi-access Edge Computing (MEC). For such purpose, a bare-metal multi-FPGA system called FiC (Flow-in-Cloud) is under development. In this paper, we introduce the FiC multi FPGA cluster which is applied partial reconfiguration (PR) FPGA design flow to support online user defined accelerator replacement while executing FPGA interconnection network and its low-level multiple FPGA management software called remote PR manager. With the remote PR manager, the user can define the FiC FPGA cluster setup by JSON and control the cluster from user application with the cooperation of simple cluster management tool / library called ficmgr on the client host and REST API service provider called ficwww on Raspberry Pi 3 (RPi3) on each node. According to the evaluation results with a prototype FiC FPGA cluster system with 12 nodes, using with online application replacement by PR and on-the-fly FPGA bitstream compression, the time for FPGA bitstream distribution was reduced to 1/17 and the total cluster setup time was reduced by 21∼57% than compared to cluster setup with full configuration FPGA bitstream.

  • FCA-BNN: Flexible and Configurable Accelerator for Binarized Neural Networks on FPGA

    Jiabao GAO  Yuchen YAO  Zhengjie LI  Jinmei LAI  

     
    PAPER-Biocybernetics, Neurocomputing

      Pubricized:
    2021/05/19
      Vol:
    E104-D No:8
      Page(s):
    1367-1377

    A series of Binarized Neural Networks (BNNs) show the accepted accuracy in image classification tasks and achieve the excellent performance on field programmable gate array (FPGA). Nevertheless, we observe existing designs of BNNs are quite time-consuming in change of the target BNN and acceleration of a new BNN. Therefore, this paper presents FCA-BNN, a flexible and configurable accelerator, which employs the layer-level configurable technique to execute seamlessly each layer of target BNN. Initially, to save resource and improve energy efficiency, the hardware-oriented optimal formulas are introduced to design energy-efficient computing array for different sizes of padded-convolution and fully-connected layers. Moreover, to accelerate the target BNNs efficiently, we exploit the analytical model to explore the optimal design parameters for FCA-BNN. Finally, our proposed mapping flow changes the target network by entering order, and accelerates a new network by compiling and loading corresponding instructions, while without loading and generating bitstream. The evaluations on three major structures of BNNs show the differences between inference accuracy of FCA-BNN and that of GPU are just 0.07%, 0.31% and 0.4% for LFC, VGG-like and Cifar-10 AlexNet. Furthermore, our energy-efficiency results achieve the results of existing customized FPGA accelerators by 0.8× for LFC and 2.6× for VGG-like. For Cifar-10 AlexNet, FCA-BNN achieves 188.2× and 60.6× better than CPU and GPU in energy efficiency, respectively. To the best of our knowledge, FCA-BNN is the most efficient design for change of the target BNN and acceleration of a new BNN, while keeps the competitive performance.

21-40hit(329hit)