The search functionality is under construction.

Keyword Search Result

[Keyword] hardware acceleration(6hit)

1-6hit
  • Resource Efficient Top-K Sorter on FPGA

    Binhao HE  Meiting XUE  Shubiao LIU  Feng YU  Weijie CHEN  

     
    LETTER-Digital Signal Processing

      Pubricized:
    2022/03/02
      Vol:
    E105-A No:9
      Page(s):
    1372-1376

    The top-K sorting is a variant of sorting used heavily in applications such as database management systems. Recently, the use of field programmable gate arrays (FPGAs) to accelerate sorting operation has attracted the interest of researchers. However, existing hardware top-K sorting algorithms are either resource-intensive or of low throughput. In this paper, we present a resource-efficient top-K sorting architecture that is composed of L cascading sorting units, and each sorting unit is composed of P sorting cells. K=PL largest elements are produced when a variable length input sequence is processed. This architecture can operate at a high frequency while consuming fewer resources. The experimental results show that our architecture achieved a maximum 1.2x throughput-to-resource improvement compared to previous studies.

  • High Performance Application Specific Stream Architecture for Hardware Acceleration of HOG-SVM on FPGA

    Piyumal RANAWAKA  Mongkol EKPANYAPONG  Adriano TAVARES  Mathew DAILEY  Krit ATHIKULWONGSE  Vitor SILVA  

     
    PAPER

      Vol:
    E102-A No:12
      Page(s):
    1792-1803

    Conventional sequential processing on software with a general purpose CPU has become significantly insufficient for certain heavy computations due to the high demand of processing power to deliver adequate throughput and performance. Due to many reasons a high degree of interest could be noted for high performance real time video processing on embedded systems. However, embedded processing platforms with limited performance could least cater the processing demand of several such intensive computations in computer vision domain. Therefore, hardware acceleration could be noted as an ideal solution where process intensive computations could be accelerated using application specific hardware integrated with a general purpose CPU. In this research we have focused on building a parallelized high performance application specific architecture for such a hardware accelerator for HOG-SVM computation implemented on Zynq 7000 FPGA. Histogram of Oriented Gradients (HOG) technique combined with a Support Vector Machine (SVM) based classifier is versatile and extremely popular in computer vision domain in contrast to high demand for processing power. Due to the popularity and versatility, various previous research have attempted on obtaining adequate throughput on HOG-SVM. This research with a high throughput of 240FPS on single scale on VGA frames of size 640x480 out performs the best case performance on a single scale of previous research by approximately a factor of 3-4. Further it's an approximately 15x speed up over the GPU accelerated software version with the same accuracy. This research has explored the possibility of using a novel architecture based on deep pipelining, parallel processing and BRAM structures for achieving high performance on the HOG-SVM computation. Further the above developed (video processing unit) VPU which acts as a hardware accelerator will be integrated as a co-processing peripheral to a host CPU using a novel custom accelerator structure with on chip buses in a System-On-Chip (SoC) fashion. This could be used to offload the heavy video stream processing redundant computations to the VPU whereas the processing power of the CPU could be preserved for running light weight applications. This research mainly focuses on the architectural techniques used to achieve higher performance on the hardware accelerator and on the novel accelerator structure used to integrate the accelerator with the host CPU.

  • Hardware Accelerated Marking for Mark & Sweep Garbage Collection

    Shinji KAWAMURA  Tomoaki TSUMURA  

     
    PAPER-Computer System

      Pubricized:
    2018/01/15
      Vol:
    E101-D No:4
      Page(s):
    1107-1115

    Many mobile systems need to achieve both high performance and low memory usage, and the total performance of such the systems can be largely affected by the effectiveness of GC. Hence, the recent popularization of mobile devices makes the GC performance play one of the important roles on the wide range of platforms. The response performance degradation caused by suspending all processes for GC has been a well-known potential problem. Therefore, GC algorithms have been actively studied and improved, but they still have not reached any fundamental solution. In this paper, we focus on the point that the same objects are redundantly marked during the GC procedure implemented on DalvikVM, which is one of the famous runtime environments for the mobile devices. Then we propose a hardware support technique for improving marking routine of GC. We installed a set of tables to a processor for managing marked objects, and redundant marking for marked objects can be omitted by referring these tables. The result of the simulation experiment shows that the percentage of redundant marking is reduced by more than 50%.

  • Enabling FPGA-as-a-Service in the Cloud with hCODE Platform

    Qian ZHAO  Motoki AMAGASAKI  Masahiro IIDA  Morihiro KUGA  Toshinori SUEYOSHI  

     
    PAPER-Design Methodology and Platform

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    335-343

    Major cloud service providers, including Amazon and Microsoft, have started employing field-programmable gate arrays (FPGAs) to build high-performance and low-power-consumption cloud capability. However, utilizing an FPGA-enabled cloud is still challenging because of two main reasons. First, the introduction of software and hardware co-design leads to high development complexity. Second, FPGA virtualization and accelerator scheduling techniques are not fully researched for cluster deployment. In this paper, we propose an open-source FPGA-as-a-service (FaaS) platform, the hCODE, to simplify the design, management and deployment of FPGA accelerators at cluster scale. The proposed platform implements a Shell-and-IP design pattern and an open accelerator repository to reduce design and management costs of FPGA projects. Efficient FPGA virtualization and accelerator scheduling techniques are proposed to deploy accelerators on the FPGA-enabled cluster easily. With the proposed hCODE, hardware designers and accelerator users can be organized on one platform to efficiently build open-hardware ecosystem.

  • FPGA Hardware Acceleration of a Phylogenetic Tree Reconstruction with Maximum Parsimony Algorithm

    Henry BLOCK  Tsutomu MARUYAMA  

     
    PAPER-Computer System

      Pubricized:
    2016/11/14
      Vol:
    E100-D No:2
      Page(s):
    256-264

    In this paper, we present an FPGA hardware implementation for a phylogenetic tree reconstruction with a maximum parsimony algorithm. We base our approach on a particular stochastic local search algorithm that uses the Progressive Neighborhood and the Indirect Calculation of Tree Lengths method. This method is widely used for the acceleration of the phylogenetic tree reconstruction algorithm in software. In our implementation, we define a tree structure and accelerate the search by parallel and pipeline processing. We show results for eight real-world biological datasets. We compare execution times against our previous hardware approach, and TNT, the fastest available parsimony program, which is also accelerated by the Indirect Calculation of Tree Lengths method. Acceleration rates between 34 to 45 per rearrangement, and 2 to 6 for the whole search, are obtained against our previous hardware approach. Acceleration rates between 2 to 36 per rearrangement, and 18 to 112 for the whole search, are obtained against TNT.

  • Synthesis of Minimum-Cost Multilevel Logic Networks via Genetic Algorithm

    Barry SHACKLEFORD  Etsuko OKUSHI  Mitsuhiro YASUDA  Hisao KOIZUMI  Katsuhiko SEO  Hiroto YASUURA  

     
    PAPER-Logic Synthesis

      Vol:
    E83-A No:12
      Page(s):
    2528-2537

    The problem of synthesizing a minimum-cost logic network is formulated for a genetic algorithm (GA). When benchmarked against a commercial logic synthesis tool, an odd parity circuit required 24 basic cells (BCs) versus 28 BCs for the design produced by the commercial system. A magnitude comparator required 20 BCs versus 21 BCs for the commercial system's design. Poor temporal performance, however, is the main disadvantage of the GA-based approach. The design of a hardware-based cost function that would accelerate the GA by several thousand times is described.