1-6hit |
Binhao HE Meiting XUE Shubiao LIU Feng YU Weijie CHEN
The top-K sorting is a variant of sorting used heavily in applications such as database management systems. Recently, the use of field programmable gate arrays (FPGAs) to accelerate sorting operation has attracted the interest of researchers. However, existing hardware top-K sorting algorithms are either resource-intensive or of low throughput. In this paper, we present a resource-efficient top-K sorting architecture that is composed of L cascading sorting units, and each sorting unit is composed of P sorting cells. K=PL largest elements are produced when a variable length input sequence is processed. This architecture can operate at a high frequency while consuming fewer resources. The experimental results show that our architecture achieved a maximum 1.2x throughput-to-resource improvement compared to previous studies.
Piyumal RANAWAKA Mongkol EKPANYAPONG Adriano TAVARES Mathew DAILEY Krit ATHIKULWONGSE Vitor SILVA
Conventional sequential processing on software with a general purpose CPU has become significantly insufficient for certain heavy computations due to the high demand of processing power to deliver adequate throughput and performance. Due to many reasons a high degree of interest could be noted for high performance real time video processing on embedded systems. However, embedded processing platforms with limited performance could least cater the processing demand of several such intensive computations in computer vision domain. Therefore, hardware acceleration could be noted as an ideal solution where process intensive computations could be accelerated using application specific hardware integrated with a general purpose CPU. In this research we have focused on building a parallelized high performance application specific architecture for such a hardware accelerator for HOG-SVM computation implemented on Zynq 7000 FPGA. Histogram of Oriented Gradients (HOG) technique combined with a Support Vector Machine (SVM) based classifier is versatile and extremely popular in computer vision domain in contrast to high demand for processing power. Due to the popularity and versatility, various previous research have attempted on obtaining adequate throughput on HOG-SVM. This research with a high throughput of 240FPS on single scale on VGA frames of size 640x480 out performs the best case performance on a single scale of previous research by approximately a factor of 3-4. Further it's an approximately 15x speed up over the GPU accelerated software version with the same accuracy. This research has explored the possibility of using a novel architecture based on deep pipelining, parallel processing and BRAM structures for achieving high performance on the HOG-SVM computation. Further the above developed (video processing unit) VPU which acts as a hardware accelerator will be integrated as a co-processing peripheral to a host CPU using a novel custom accelerator structure with on chip buses in a System-On-Chip (SoC) fashion. This could be used to offload the heavy video stream processing redundant computations to the VPU whereas the processing power of the CPU could be preserved for running light weight applications. This research mainly focuses on the architectural techniques used to achieve higher performance on the hardware accelerator and on the novel accelerator structure used to integrate the accelerator with the host CPU.
Shinji KAWAMURA Tomoaki TSUMURA
Many mobile systems need to achieve both high performance and low memory usage, and the total performance of such the systems can be largely affected by the effectiveness of GC. Hence, the recent popularization of mobile devices makes the GC performance play one of the important roles on the wide range of platforms. The response performance degradation caused by suspending all processes for GC has been a well-known potential problem. Therefore, GC algorithms have been actively studied and improved, but they still have not reached any fundamental solution. In this paper, we focus on the point that the same objects are redundantly marked during the GC procedure implemented on DalvikVM, which is one of the famous runtime environments for the mobile devices. Then we propose a hardware support technique for improving marking routine of GC. We installed a set of tables to a processor for managing marked objects, and redundant marking for marked objects can be omitted by referring these tables. The result of the simulation experiment shows that the percentage of redundant marking is reduced by more than 50%.
Qian ZHAO Motoki AMAGASAKI Masahiro IIDA Morihiro KUGA Toshinori SUEYOSHI
Major cloud service providers, including Amazon and Microsoft, have started employing field-programmable gate arrays (FPGAs) to build high-performance and low-power-consumption cloud capability. However, utilizing an FPGA-enabled cloud is still challenging because of two main reasons. First, the introduction of software and hardware co-design leads to high development complexity. Second, FPGA virtualization and accelerator scheduling techniques are not fully researched for cluster deployment. In this paper, we propose an open-source FPGA-as-a-service (FaaS) platform, the hCODE, to simplify the design, management and deployment of FPGA accelerators at cluster scale. The proposed platform implements a Shell-and-IP design pattern and an open accelerator repository to reduce design and management costs of FPGA projects. Efficient FPGA virtualization and accelerator scheduling techniques are proposed to deploy accelerators on the FPGA-enabled cluster easily. With the proposed hCODE, hardware designers and accelerator users can be organized on one platform to efficiently build open-hardware ecosystem.
In this paper, we present an FPGA hardware implementation for a phylogenetic tree reconstruction with a maximum parsimony algorithm. We base our approach on a particular stochastic local search algorithm that uses the Progressive Neighborhood and the Indirect Calculation of Tree Lengths method. This method is widely used for the acceleration of the phylogenetic tree reconstruction algorithm in software. In our implementation, we define a tree structure and accelerate the search by parallel and pipeline processing. We show results for eight real-world biological datasets. We compare execution times against our previous hardware approach, and TNT, the fastest available parsimony program, which is also accelerated by the Indirect Calculation of Tree Lengths method. Acceleration rates between 34 to 45 per rearrangement, and 2 to 6 for the whole search, are obtained against our previous hardware approach. Acceleration rates between 2 to 36 per rearrangement, and 18 to 112 for the whole search, are obtained against TNT.
Barry SHACKLEFORD Etsuko OKUSHI Mitsuhiro YASUDA Hisao KOIZUMI Katsuhiko SEO Hiroto YASUURA
The problem of synthesizing a minimum-cost logic network is formulated for a genetic algorithm (GA). When benchmarked against a commercial logic synthesis tool, an odd parity circuit required 24 basic cells (BCs) versus 28 BCs for the design produced by the commercial system. A magnitude comparator required 20 BCs versus 21 BCs for the commercial system's design. Poor temporal performance, however, is the main disadvantage of the GA-based approach. The design of a hardware-based cost function that would accelerate the GA by several thousand times is described.