The search functionality is under construction.
The search functionality is under construction.

Author Search Result

[Author] Shimpei SATO(11hit)

1-11hit
  • GUINNESS: A GUI Based Binarized Deep Neural Network Framework for Software Programmers

    Hiroki NAKAHARA  Haruyoshi YONEKAWA  Tomoya FUJII  Masayuki SHIMODA  Shimpei SATO  

     
    PAPER-Design Tools

      Pubricized:
    2019/02/27
      Vol:
    E102-D No:5
      Page(s):
    1003-1011

    The GUINNESS (GUI based binarized neural network synthesizer) is an open-source tool flow for a binarized deep neural network toward FPGA implementation based on the GUI including both the training on the GPU and inference on the FPGA. Since all the operation is done on the GUI, the software designer is not necessary to write any scripts to design the neural network structure, training behavior, only specify the values for hyperparameters. After finishing the training, it automatically generates C++ codes to synthesis the bit-stream using the Xilinx SDSoC system design tool flow. Thus, our tool flow is suitable for the software programmers who are not familiar with the FPGA design. In our tool flow, we modify the training algorithms both the training and the inference for a binarized CNN hardware. Since the hardware has a limited number of bit precision, it lacks minimal bias in training. Also, for the inference on the hardware, the conventional batch normalization technique requires additional hardware. Our modifications solve these problems. We implemented the VGG-11 benchmark CNN on the Digilent Inc. Zedboard. Compared with the conventional binarized implementations on an FPGA, the classification accuracy was almost the same, the performance per power efficiency is 5.1 times better, as for the performance per area efficiency, it is 8.0 times better, and as for the performance per memory, it is 8.2 times better. We compare the proposed FPGA design with the CPU and the GPU designs. Compared with the ARM Cortex-A57, it was 1776.3 times faster, it dissipated 3.0 times lower power, and its performance per power efficiency was 5706.3 times better. Also, compared with the Maxwell GPU, it was 11.5 times faster, it dissipated 7.3 times lower power, and its performance per power efficiency was 83.0 times better. The disadvantage of our FPGA based design requires additional time to synthesize the FPGA executable codes. From the experiment, it consumed more three hours, and the total FPGA design took 75 hours. Since the training of the CNN is dominant, it is considerable.

  • A Threshold Neuron Pruning for a Binarized Deep Neural Network on an FPGA

    Tomoya FUJII  Shimpei SATO  Hiroki NAKAHARA  

     
    PAPER-Emerging Applications

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    376-386

    For a pre-trained deep convolutional neural network (CNN) for an embedded system, a high-speed and a low power consumption are required. In the former of the CNN, it consists of convolutional layers, while in the latter, it consists of fully connection layers. In the convolutional layer, the multiply accumulation operation is a bottleneck, while the fully connection layer, the memory access is a bottleneck. The binarized CNN has been proposed to realize many multiply accumulation circuit on the FPGA, thus, the convolutional layer can be done with a high-seed operation. However, even if we apply the binarization to the fully connection layer, the amount of memory was still a bottleneck. In this paper, we propose a neuron pruning technique which eliminates almost part of the weight memory, and we apply it to the fully connection layer on the binarized CNN. In that case, since the weight memory is realized by an on-chip memory on the FPGA, it achieves a high-speed memory access. To further reduce the memory size, we apply the retraining the CNN after neuron pruning. In this paper, we propose a sequential-input parallel-output fully connection layer circuit for the binarized fully connection layer, while proposing a streaming circuit for the binarized 2D convolutional layer. The experimental results showed that, by the neuron pruning, as for the fully connected layer on the VGG-11 CNN, the number of neurons was reduced by 39.8% with keeping the 99% baseline accuracy. We implemented the neuron pruning CNN on the Xilinx Inc. Zynq Zedboard. Compared with the ARM Cortex-A57, it was 1773.0 times faster, it dissipated 3.1 times lower power, and its performance per power efficiency was 5781.3 times better. Also, compared with the Maxwell GPU, it was 11.1 times faster, it dissipated 7.7 times lower power, and its performance per power efficiency was 84.1 times better. Thus, the binarized CNN on the FPGA is suitable for the embedded system.

  • Design Method of Variable-Latency Circuit with Tunable Approximate Completion-Detection Mechanism

    Yuta UKON  Shimpei SATO  Atsushi TAKAHASHI  

     
    PAPER

      Pubricized:
    2020/12/21
      Vol:
    E104-C No:7
      Page(s):
    309-318

    Advanced information-processing services such as computer vision require a high-performance digital circuit to perform high-load processing at high speed. To achieve high-speed processing, several image-processing applications use an approximate computing technique to reduce idle time of the circuit. However, it is difficult to design the high-speed image-processing circuit while controlling the error rate so as not to degrade service quality, and this technique is used for only a few applications. In this paper, we propose a method that achieves high-speed processing effectively in which processing time for each task is changed by roughly detecting its completion. Using this method, a high-speed processing circuit with a low error rate can be designed. The error rate is controllable, and a circuit design method to minimize the error rate is also presented in this paper. To confirm the effectiveness of our proposal, a ripple-carry adder (RCA), 2-dimensional discrete cosine transform (2D-DCT) circuit, and histogram of oriented gradients (HOG) feature calculation circuit are evaluated. Effective clock periods of these circuits obtained by our method with around 1% error rate are improved about 64%, 6%, and 12%, respectively, compared with circuits without error. Furthermore, the impact of the miscalculation on a video monitoring service using an object detection application is investigated. As a result, more than 99% of detection points required to be obtained are detected, and it is confirmed the miscalculation hardly degrades the service quality.

  • Energy-Efficient ECG Signals Outlier Detection Hardware Using a Sparse Robust Deep Autoencoder

    Naoto SOGA  Shimpei SATO  Hiroki NAKAHARA  

     
    PAPER-Logic Design

      Pubricized:
    2021/05/17
      Vol:
    E104-D No:8
      Page(s):
    1121-1129

    Advancements in portable electrocardiographs have allowed electrocardiogram (ECG) signals to be recorded in everyday life. Machine-learning techniques, including deep learning, have been used in numerous studies to analyze ECG signals because they exhibit superior performance to conventional methods. A mobile ECG analysis device is needed so that abnormal ECG waves can be detected anywhere. Such mobile device requires a real-time performance and low power consumption, however, deep-learning based models often have too many parameters to implement on mobile hardware, its amount of hardware is too large and dissipates much power consumption. We propose a design flow to implement the outlier detector using an autoencoder on a low-end FPGA. To shorten the preparation time of ECG data used in training an autoencoder, an unsupervised learning technique is applied. Additionally, to minimize the volume of the weight parameters, a weight sparseness technique is applied, and all the parameters are converted into fixed-point values. We show that even if the parameters are reduced converted into fixed-point values, the outlier detection performance degradation is only 0.83 points. By reducing the volume of the weight parameters, all the parameters can be stored in on-chip memory. We design the architecture according to the CRS format, which is the well-known data structure of a sparse matrix, minimizing the hardware size and reducing the power consumption. We use weight sharing to further reduce the weight-parameter volumes. By using weight sharing, we could reduce the bit width of the memories by 60% while maintaining the outlier detection performance. We implemented the autoencoder on a Digilent Inc. ZedBoard and compared the results with those for the ARM mobile CPU for a built-in device. The results indicated that our FPGA implementation of the outlier detector was 12 times faster and 106 times more energy-efficient.

  • Power Efficient Object Detector with an Event-Driven Camera for Moving Object Surveillance on an FPGA

    Masayuki SHIMODA  Shimpei SATO  Hiroki NAKAHARA  

     
    PAPER-Applications

      Pubricized:
    2019/02/27
      Vol:
    E102-D No:5
      Page(s):
    1020-1028

    We propose an object detector using a sliding window method for an event-driven camera which outputs a subtracted frame (usually a binary value) when changes are detected in captured images. Since sliding window skips unchanged portions of the output, the number of target object area candidates decreases dramatically, which means that our system operates faster and with lower power consumption than a system using a straightforward sliding window approach. Since the event-driven camera output consists of binary precision frames, an all binarized convolutional neural network (ABCNN) can be available, which means that it allows all convolutional layers to share the same binarized convolutional circuit, thereby reducing the area requirement. We implemented our proposed method on the Xilinx Inc. Zedboard and then evaluated it using the PETS 2009 dataset. The results showed that our system outperformed BCNN system from the viewpoint of detection performance, hardware requirement, and computation time. Also, we showed that FPGA is an ideal method for our system than mobile GPU. From these results, our proposed system is more suitable for the embedded systems based on stationary cameras (such as security cameras).

  • Weight Sparseness for a Feature-Map-Split-CNN Toward Low-Cost Embedded FPGAs

    Akira JINGUJI  Shimpei SATO  Hiroki NAKAHARA  

     
    PAPER

      Pubricized:
    2021/09/27
      Vol:
    E104-D No:12
      Page(s):
    2040-2047

    Convolutional neural network (CNN) has a high recognition rate in image recognition and are used in embedded systems such as smartphones, robots and self-driving cars. Low-end FPGAs are candidates for embedded image recognition platforms because they achieve real-time performance at a low cost. However, CNN has significant parameters called weights and internal data called feature maps, which pose a challenge for FPGAs for performance and memory capacity. To solve these problems, we exploit a split-CNN and weight sparseness. The split-CNN reduces the memory footprint by splitting the feature map into smaller patches and allows the feature map to be stored in the FPGA's high-throughput on-chip memory. Weight sparseness reduces computational costs and achieves even higher performance. We designed a dedicated architecture of a sparse CNN and a memory buffering scheduling for a split-CNN and implemented this on the PYNQ-Z1 FPGA board with a low-end FPGA. An experiment on classification using VGG16 shows that our implementation is 3.1 times faster than the GPU, and 5.4 times faster than an existing FPGA implementation.

  • ArchHDL: A Novel Hardware RTL Modeling and High-Speed Simulation Environment

    Shimpei SATO  Ryohei KOBAYASHI  Kenji KISE  

     
    PAPER-Design Methodology and Platform

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    344-353

    LSIs are generally designed through four stages including architectural design, logic design, circuit design, and physical design. In architectural design and logic design, designers describe their target hardware in RTL. However, they generally use different languages for each phase. Typically a general purpose programming language such as C or C++ and a hardware description language such as Verilog HDL or VHDL are used for architectural design and logic design, respectively. That is time-consuming way for designing a hardware and more efficient design environment is required. In this paper, we propose a new hardware modeling and high-speed simulation environment for architectural design and logic design. Our environment realizes writing and verifying hardware by one language. The environment consists of (1) a new hardware description language called ArchHDL, which enables to simulate hardware faster than Verilog HDL simulation, and (2) a source code translation tool from ArchHDL code to Verilog HDL code. ArchHDL is a new language for hardware RTL modeling based on C++. The key features of this language are that (1) designers describe a combinational circuit as a function and (2) the ArchHDL library realizes non-blocking assignment in C++. Using these features, designers are able to write a hardware transparently from abstracted level description to RTL description in Verilog HDL-like style. Source codes in ArchHDL is converted to Verilog HDL codes by the translation tool and they are used to synthesize for FPGAs or ASICs. As the evaluation of our environment, we implemented a practical many-core processor in ArchHDL and measured the simulation speed on an Intel CPU and an Intel Xeon Phi processor. The simulation speed for the Intel CPU by ArchHDL achieves about 4.5 times faster than the simulation speed by Synopsys VCS. We also confirmed that the RTL simulation by ArchHDL is efficiently parallelized on the Intel Xeon Phi processor. We convert the ArchHDL code to a Verilog HDL code and estimated the hardware utilization on an FPGA. To implement a 48-node many-core processor, 71% of entire resources of a Virtex-7 FPGA are consumed.

  • A Low Area Overhead Design Method for High-Performance General-Synchronous Circuits with Speculative Execution

    Shimpei SATO  Eijiro SASSA  Yuta UKON  Atsushi TAKAHASHI  

     
    PAPER

      Vol:
    E102-A No:12
      Page(s):
    1760-1769

    In order to obtain high-performance circuits in advanced technology nodes, design methodology has to take the existence of large delay variations into account. Clock scheduling and speculative execution have overheads to realize them, but have potential to improve the performance by averaging the imbalance of maximum delay among paths and by utilizing valid data available earlier than worst-case scenarios, respectively. In this paper, we propose a high-performance digital circuit design method with speculative executions with less overhead by utilizing clock scheduling with delay insertions effectively. The necessity of speculations that cause overheads is effectively reduced by clock scheduling with delay insertion. Experiments show that a generated circuit achieves 26% performance improvement with 1.3% area overhead compared to a circuit without clock scheduling and without speculative execution.

  • An FPGA Realization of a Random Forest with k-Means Clustering Using a High-Level Synthesis Design

    Akira JINGUJI  Shimpei SATO  Hiroki NAKAHARA  

     
    PAPER-Emerging Applications

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    354-362

    A random forest (RF) is a kind of ensemble machine learning algorithm used for a classification and a regression. It consists of multiple decision trees that are built from randomly sampled data. The RF has a simple, fast learning, and identification capability compared with other machine learning algorithms. It is widely used for application to various recognition systems. Since it is necessary to un-balanced trace for each tree and requires communication for all the ones, the random forest is not suitable in SIMD architectures such as GPUs. Although the accelerators using the FPGA have been proposed, such implementations were based on HDL design. Thus, they required longer design time than the soft-ware based realizations. In the previous work, we showed the high-level synthesis design of the RF including the fully pipelined architecture and the all-to-all communication. In this paper, to further reduce the amount of hardware, we use k-means clustering to share comparators of the branch nodes on the decision tree. Also, we develop the krange tool flow, which generates the bitstream with a few number of hyper parameters. Since the proposed tool flow is based on the high-level synthesis design, we can obtain the high performance RF with short design time compared with the conventional HDL design. We implemented the RF on the Xilinx Inc. ZC702 evaluation board. Compared with the CPU (Intel Xeon (R) E5607 Processor) and the GPU (NVidia Geforce Titan) implementations, as for the performance, the FPGA realization was 8.4 times faster than the CPU one, and it was 62.8 times faster than the GPU one. As for the power consumption efficiency, the FPGA realization was 7.8 times better than the CPU one, and it was 385.9 times better than the GPU one.

  • A Fast Length Matching Routing Pattern Generation Method for Set-Pair Routing Problem Using Selective Pin-Pair Connections Open Access

    Shimpei SATO  Kano AKAGI  Atsushi TAKAHASHI  

     
    PAPER

      Vol:
    E103-A No:9
      Page(s):
    1037-1044

    Routing problems derived from silicon-interposer and etc. are often formulated as a set-pair routing problem where the combination of pin-pairs to be connected is flexible. In this routing problem, a length matching routing pattern is often required due to the requirement of the signal propagation delays be the same. We propose a fast length matching routing method for the set-pair routing problem. The existing algorithm generates a good length matching routing pattern in practical time. However, due to the limited searching range, there are length matching routing patterns that cannot find due to the limited searching range of the algorithm. Also, it needs heavy iterative steps to improve a solution, and the computation time is practical but not fast. In the set-pair routing, although pin-pairs to be connected is flexible, it is expected that combinations of pin-pairs which realize length matching are restricted. In our method, such a combination of pin-pairs is selected in advance, then routing is performed to realize the connection of the selected pin-pairs. Heavy iterative steps are not used for both the selection and the routing, then a routing pattern is generated in a short time. In the experiments, we confirm that the quality of routing patterns generated by our method is almost equivalent to the existing algorithm. Furthermore, our method finds length matching routing patterns that the existing algorithm cannot find. The computation time is about 360 times faster than the existing algorithm.

  • SENTEI: Filter-Wise Pruning with Distillation towards Efficient Sparse Convolutional Neural Network Accelerators

    Masayuki SHIMODA  Youki SADA  Ryosuke KURAMOCHI  Shimpei SATO  Hiroki NAKAHARA  

     
    PAPER-Computer System

      Pubricized:
    2020/08/03
      Vol:
    E103-D No:12
      Page(s):
    2463-2470

    In the realization of convolutional neural networks (CNNs) in resource-constrained embedded hardware, the memory footprint of weights is one of the primary problems. Pruning techniques are often used to reduce the number of weights. However, the distribution of nonzero weights is highly skewed, which makes it more difficult to utilize the underlying parallelism. To address this problem, we present SENTEI*, filter-wise pruning with distillation, to realize hardware-aware network architecture with comparable accuracy. The filter-wise pruning eliminates weights such that each filter has the same number of nonzero weights, and retraining with distillation retains the accuracy. Further, we develop a zero-weight skipping inter-layer pipelined accelerator on an FPGA. The equalization enables inter-filter parallelism, where a processing block for a layer executes filters concurrently with straightforward architecture. Our evaluation of semantic-segmentation tasks indicates that the resulting mIoU only decreased by 0.4 points. Additionally, the speedup and power efficiency of our FPGA implementation were 33.2× and 87.9× higher than those of the mobile GPU. Therefore, our technique realizes hardware-aware network with comparable accuracy.