The search functionality is under construction.

Author Search Result

[Author] Tomoya FUJII(3hit)

1-3hit
  • GUINNESS: A GUI Based Binarized Deep Neural Network Framework for Software Programmers

    Hiroki NAKAHARA  Haruyoshi YONEKAWA  Tomoya FUJII  Masayuki SHIMODA  Shimpei SATO  

     
    PAPER-Design Tools

      Pubricized:
    2019/02/27
      Vol:
    E102-D No:5
      Page(s):
    1003-1011

    The GUINNESS (GUI based binarized neural network synthesizer) is an open-source tool flow for a binarized deep neural network toward FPGA implementation based on the GUI including both the training on the GPU and inference on the FPGA. Since all the operation is done on the GUI, the software designer is not necessary to write any scripts to design the neural network structure, training behavior, only specify the values for hyperparameters. After finishing the training, it automatically generates C++ codes to synthesis the bit-stream using the Xilinx SDSoC system design tool flow. Thus, our tool flow is suitable for the software programmers who are not familiar with the FPGA design. In our tool flow, we modify the training algorithms both the training and the inference for a binarized CNN hardware. Since the hardware has a limited number of bit precision, it lacks minimal bias in training. Also, for the inference on the hardware, the conventional batch normalization technique requires additional hardware. Our modifications solve these problems. We implemented the VGG-11 benchmark CNN on the Digilent Inc. Zedboard. Compared with the conventional binarized implementations on an FPGA, the classification accuracy was almost the same, the performance per power efficiency is 5.1 times better, as for the performance per area efficiency, it is 8.0 times better, and as for the performance per memory, it is 8.2 times better. We compare the proposed FPGA design with the CPU and the GPU designs. Compared with the ARM Cortex-A57, it was 1776.3 times faster, it dissipated 3.0 times lower power, and its performance per power efficiency was 5706.3 times better. Also, compared with the Maxwell GPU, it was 11.5 times faster, it dissipated 7.3 times lower power, and its performance per power efficiency was 83.0 times better. The disadvantage of our FPGA based design requires additional time to synthesize the FPGA executable codes. From the experiment, it consumed more three hours, and the total FPGA design took 75 hours. Since the training of the CNN is dominant, it is considerable.

  • A Threshold Neuron Pruning for a Binarized Deep Neural Network on an FPGA

    Tomoya FUJII  Shimpei SATO  Hiroki NAKAHARA  

     
    PAPER-Emerging Applications

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    376-386

    For a pre-trained deep convolutional neural network (CNN) for an embedded system, a high-speed and a low power consumption are required. In the former of the CNN, it consists of convolutional layers, while in the latter, it consists of fully connection layers. In the convolutional layer, the multiply accumulation operation is a bottleneck, while the fully connection layer, the memory access is a bottleneck. The binarized CNN has been proposed to realize many multiply accumulation circuit on the FPGA, thus, the convolutional layer can be done with a high-seed operation. However, even if we apply the binarization to the fully connection layer, the amount of memory was still a bottleneck. In this paper, we propose a neuron pruning technique which eliminates almost part of the weight memory, and we apply it to the fully connection layer on the binarized CNN. In that case, since the weight memory is realized by an on-chip memory on the FPGA, it achieves a high-speed memory access. To further reduce the memory size, we apply the retraining the CNN after neuron pruning. In this paper, we propose a sequential-input parallel-output fully connection layer circuit for the binarized fully connection layer, while proposing a streaming circuit for the binarized 2D convolutional layer. The experimental results showed that, by the neuron pruning, as for the fully connected layer on the VGG-11 CNN, the number of neurons was reduced by 39.8% with keeping the 99% baseline accuracy. We implemented the neuron pruning CNN on the Xilinx Inc. Zynq Zedboard. Compared with the ARM Cortex-A57, it was 1773.0 times faster, it dissipated 3.1 times lower power, and its performance per power efficiency was 5781.3 times better. Also, compared with the Maxwell GPU, it was 11.1 times faster, it dissipated 7.7 times lower power, and its performance per power efficiency was 84.1 times better. Thus, the binarized CNN on the FPGA is suitable for the embedded system.

  • Practical Improvement and Performance Evaluation of Road Damage Detection Model using Machine Learning

    Tomoya FUJII  Rie JINKI  Yuukou HORITA  

     
    LETTER-Image

      Pubricized:
    2023/06/13
      Vol:
    E106-A No:9
      Page(s):
    1216-1219

    The social infrastructure, including roads and bridges built during period of rapid economic growth in Japan, is now aging, and there is a need to strategically maintain and renew the social infrastructure that is aging. On the other hand, road maintenance in rural areas is facing serious problems such as reduced budgets for maintenance and a shortage of engineers due to the declining birthrate and aging population. Therefore, it is difficult to visually inspect all roads in rural areas by maintenance engineers, and a system to automatically detect road damage is required. This paper reports practical improvements to the road damage model using YOLOv5, an object detection model capable of real-time operation, focusing on road image features.