The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] hardware architecture(20hit)

1-20hit
  • SLIT: An Energy-Efficient Reconfigurable Hardware Architecture for Deep Convolutional Neural Networks Open Access

    Thi Diem TRAN  Yasuhiko NAKASHIMA  

     
    PAPER

      Pubricized:
    2020/12/18
      Vol:
    E104-C No:7
      Page(s):
    319-329

    Convolutional neural networks (CNNs) have dominated a range of applications, from advanced manufacturing to autonomous cars. For energy cost-efficiency, developing low-power hardware for CNNs is a research trend. Due to the large input size, the first few convolutional layers generally consume most latency and hardware resources on hardware design. To address these challenges, this paper proposes an innovative architecture named SLIT to extract feature maps and reconstruct the first few layers on CNNs. In this reconstruction approach, total multiply-accumulate operations are eliminated on the first layers. We evaluate new topology with MNIST, CIFAR, SVHN, and ImageNet datasets on image classification application. Latency and hardware resources of the inference step are evaluated on the chip ZC7Z020-1CLG484C FPGA with Lenet-5 and VGG schemes. On the Lenet-5 scheme, our architecture reduces 39% of latency and 70% of hardware resources with a 0.456 W power consumption compared to previous works. Even though the VGG models perform with a 10% reduction in hardware resources and latency, we hope our overall results will potentially give a new impetus for future studies to reach a higher optimization on hardware design. Notably, the SLIT architecture efficiently merges with most popular CNNs at a slightly sacrificing accuracy of a factor of 0.27% on MNIST, ranging from 0.5% to 1.5% on CIFAR, approximately 2.2% on ImageNet, and remaining the same on SVHN databases.

  • Full-HD 60fps FPGA Implementation of Spatio-Temporal Keypoint Extraction Based on Gradient Histogram and Parallelization of Keypoint Connectivity

    Takahiro SUZUKI  Takeshi IKENAGA  

     
    PAPER-Vision

      Vol:
    E99-A No:11
      Page(s):
    1937-1946

    Recently, cloud systems have started to be utilized for services which analyze user's data in the field of computer vision. In these services, keypoints are extracted from images or videos, and the data is identified by machine learning with a large database in the cloud. To reduce the number of keypoints which are sent to the cloud, Keypoints of Interest (KOI) extraction has been proposed. However, since its computational complexity is large, hardware implementation is required for real-time processing. Moreover, the hardware resource must be low because it is embedded in devices of users. This paper proposes a hardware-friendly KOI algorithm with low amount of computations and its real-time hardware implementation based on dual threshold keypoint detection by gradient histogram and parallelization of connectivity of adjacent keypoint-utilizing register counters. The algorithm utilizes dual-histogram based detection and keypoint-matching based calculation of motion information and dense-clustering based keypoint smoothing. The hardware architecture is composed of a detection module utilizing descriptor, and grid-region-parallelization based density clustering. Finally, the evaluation results of hardware implementation show that the implemented hardware achieves Full-HD (1920x1080)-60 fps spatio-temporal keypoint extraction. Further, it is 47 times faster than low complexity keypoint extraction on software and 12 times faster than spatio-temporal keypoint extraction on software, and the hardware resources are almost the same as SIFT hardware implementation, maintaining accuracy.

  • Design and Evaluation of a Configurable Query Processing Hardware for Data Streams

    Yasin OGE  Masato YOSHIMI  Takefumi MIYOSHI  Hideyuki KAWASHIMA  Hidetsugu IRIE  Tsutomu YOSHINAGA  

     
    PAPER-Computer System

      Pubricized:
    2015/09/14
      Vol:
    E98-D No:12
      Page(s):
    2207-2217

    In this paper, we propose Configurable Query Processing Hardware (CQPH), an FPGA-based accelerator for continuous query processing over data streams. CQPH is a highly optimized and minimal-overhead execution engine designed to deliver real-time response for high-volume data streams. Unlike most of the other FPGA-based approaches, CQPH provides on-the-fly configurability for multiple queries with its own dynamic configuration mechanism. With a dedicated query compiler, SQL-like queries can be easily configured into CQPH at run time. CQPH supports continuous queries including selection, group-by operation and sliding-window aggregation with a large number of overlapping sliding windows. As a proof of concept, a prototype of CQPH is implemented on an FPGA platform for a case study. Evaluation results indicate that a given query can be configured within just a few microseconds, and the prototype implementation of CQPH can process over 150 million tuples per second with a latency of less than a microsecond. Results also indicate that CQPH provides linear scalability to increase its flexibility (i.e., on-the-fly configurability) without sacrificing performance (i.e., maximum allowable clock speed).

  • Scalable Hardware Winner-Take-All Neural Network with DPLL

    Masaki AZUMA  Hiroomi HIKAWA  

     
    PAPER-Biocybernetics, Neurocomputing

      Pubricized:
    2015/07/21
      Vol:
    E98-D No:10
      Page(s):
    1838-1846

    Neural networks are widely used in various fields due to their superior learning abilities. This paper proposes a hardware winner-take-all neural network (WTANN) that employs a new winner-take-all (WTA) circuit with phase-modulated pulse signals and digital phase-locked loops (DPLLs). The system uses DPLL as a computing element, so all input values are expressed by phases of rectangular signals. The proposed WTA circuit employs a simple winner search circuit. The proposed WTANN architecture is described by very high speed integrated circuit (VHSIC) hardware description language (VHDL), and its feasibility was tested and verified through simulations and experiments. Conventional WTA takes a global winner search approach, in which vector distances are collected from all neurons and compared. In contrast, the WTA in the proposed system is carried out locally by a distributed winner search circuit among neurons. Therefore, no global communication channels with a wide bandwidth between the winner search module and each neuron are required. Furthermore, the proposed WTANN can easily extend the system scale, merely by increasing the number of neurons. The circuit size and speed were then evaluated by applying the VHDL description to a logic synthesis tool and experiments using a field programmable gate array (FPGA). Vector classifications with WTANN using two kinds of data sets, Iris and Wine, were carried out in VHDL simulations. The results revealed that the proposed WTANN achieved valid learning.

  • Hardware Architecture of the Fast Mode Decision Algorithm for H.265/HEVC

    Wenjun ZHAO  Takao ONOYE  Tian SONG  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E98-A No:8
      Page(s):
    1787-1795

    In this paper, a specified hardware architecture of the Fast Mode Decision (FMD) algorithms presented by our previous work is proposed. This architecture is designed as an embedded mode dispatch module. On the basis of this module, some unnecessary modes can be skipped or the mode decision process can be terminated in advanced. In order to maintain a higher compatibility, the FMD algorithms are unitedly designed as an unique module that can be easily embedded into a common video codec for H.265/HEVC. The input and output interfaces between the proposed module and other parts of the codec are designed based on simple but effective protocol. Hardware synthesis results on FPGA demonstrate that the proposed architecture achieves a maximum frequency of about 193 MHz with less than 1% of the total resources consumed. Moreover, the proposed module can improve the overall throughput.

  • Energy-Efficient IDCT Design for DS-CDMA Watermarking Systems

    Shan-Chun KUO  Hong-Yuan JHENG  Fan-Chieh CHENG  Shanq-Jang RUAN  

     
    LETTER-VLSI Design Technology and CAD

      Vol:
    E96-A No:5
      Page(s):
    995-996

    In this letter, a design of inverse discrete cosine transform for energy-efficient watermarking mechanism based on DS-CDMA with significant energy and area reduction is presented. Taking advantage of converged input data value set as a precomputation concept, the proposed one-dimensional IDCT is a multiplierless hardware which differs from Loeffler architecture and has benefits of low complexity and low power consumption. The experimental results show that our design can reduce 85.2% energy consumption and 58.6% area. Various spectrum and spatial attacks are also tested to corroborate the robustness.

  • Highly Parallel and Fully Reused H.264/AVC High Profile Intra Predictor Generation Engine for Super Hi-Vision 4k4k@60 fps

    Yiqing HUANG  Xiaocong JIN  Jin ZHOU  Jia SU  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    428-438

    One high profile intra predictor generation engine is proposed in this paper. Firstly, hardware level algorithm optimization for intra 88 (I8MB) mode is introduced. The original candidate pixels for generating prediction samples of I8MB are replaced with boundary pixels of intra 44 (I4MB) blocks. Based on this adoption, full data reuse between predictors of I4MB and filtered samples of I8MB can be achieved with almost no quality loss. Secondly, one lossless two-44-block based parallel predictor generation flow is proposed. The original predictor generation flow is optimized from 16 stages to 10 stages for I4MB and Intra 1616 (I16MB), which saves 37.5% processing cycles. For I8MB, similar methodology with different processing order of 44 scaled blocks is introduced. Thirdly, fully utilized hardwired engines for I4MB, I16MB and I8MB are proposed in this paper. Except DC (direct current) and plane modes, full data reuse among all intra modes of high profile can be achieved. Fourthly, for DC mode, one combined predictor generation process is introduced and predictor generation of I16MB's DC mode is merged into the process of I4MB's DC mode. Moreover, by configuring proposed hardwired engines, predictor generation of I16MB's plane mode and chrominance plane mode can be accomplished with only 50% cycles of original design. Totally, when compared with original full-mode design and latest dynamic mode reused design, the proposed predictor generation engine can achieve 89.5% and 73.2% saving of processing cycles, respectively. Synthesized by TSMC 0.18 µm technology under worst work conditions (1.62 V, 125°C), with 380 MHz and 37.2 k gates, the proposed design can handle real-time high profile intra predictor generation of Super Hi-Vision 4 k4 k@60 fps. The maximum work frequency of our design under worst condition is 468 MHz.

  • A 530 Mpixels/s Intra Prediction Architecture for Ultra High Definition H.264/AVC Encoder

    Gang HE  Dajiang ZHOU  Jinjia ZHOU  Tianruo ZHANG  Satoshi GOTO  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    419-427

    Intra coding in H.264/AVC significantly enhances video compression efficiency. However, due to the high data dependency of intra prediction in H.264, both pipelining and parallel processing techniques are limited to be applied. Moreover, it is difficult to get high hardware utilization and throughput because of the long block/MB-level reconstruction loops. This paper proposes a high-performance intra prediction architecture that can support H.264/AVC high profile. The proposed MB/block co-reordering can avoid data dependency and improve pipeline utilization. Therefore, the timing constraint of real-time 40962160 encoding can be achieved with negligible quality loss. 1616 prediction engine and 88 prediction engine work parallel for prediction and coefficients generating. A reordering interlaced reconstruction is also designed for fully pipelined architecture. It takes only 160 cycles to process one macroblock (MB). Hardware utilization of prediction and reconstruction modules is almost 100%. Furthermore, PE-reusable 88 intra predictor and hybrid SAD & SATD mode decision are proposed to save hardware cost. The design is implemented by 90 nm CMOS technology with 113.2 k gates and can encode 40962160 video sequences at 60 fps with operation frequency of 332 MHz.

  • How to Maximize the Potential of FPGA-Based DSPs for Modular Exponentiation

    Daisuke SUZUKI  Tsutomu MATSUMOTO  

     
    PAPER-Implementation

      Vol:
    E94-A No:1
      Page(s):
    211-222

    This paper describes a modular exponentiation processing method and circuit architecture that can exhibit the maximum performance of FPGA resources. The modular exponentiation architecture proposed by us comprises three main techniques. The first one is to improve the Montgomery multiplication algorithm in order to maximize the performance of the multiplication unit in an FPGA. The second one is to balance and improve the circuit delay. The third one is to ensure scalability of the circuit. Our architecture can perform fast operations using small-scale resources; in particular, it can complete a 512-bit modular exponentiation as fast as in 0.26 ms with the smallest Virtex-4 FPGA, XC4VF12-10SF363. In fact the number of SLICEs used is approx. 4200, which proves the compactness of our design. Moreover, the scalability of our design also allows 1024-, 1536-, and 2048-bit modular exponentiations to be processed in the same circuit.

  • How to Decide Selection Functions for Power Analysis: From the Viewpoint of Hardware Architecture of Block Ciphers

    Daisuke SUZUKI  Minoru SAEKI  Koichi SHIMIZU  Tsutomu MATSUMOTO  

     
    PAPER-Implementation

      Vol:
    E94-A No:1
      Page(s):
    200-210

    In this paper we first demonstrate that effective selection functions in power analysis attacks change depending on circuit architectures of a block cipher. We then conclude that the most resistant architecture on its own, in the case of the loop architecture, has two data registers have separate roles: one for storing the plaintext and ciphertext, and the other for storing intermediate values. There, the pre-whitening operation is placed at the output of the former register. The architecture allows the narrowest range of selection functions and thereby has resistance against ordinary CPA. Thus, we can easily defend against attacks by ordinary CPA at the architectural level, whereas we cannot against DPA. Secondly, we propose a new technique called "self-templates" in order to raise the accuracy of evaluation of DPA-based attacks. Self-templates enable to differentiate meaningful selection functions for DPA-based attacks without any strong assumption as in the template attack. We also present the results of attacks to an AES co-processor on an ASIC and demonstrate the effectiveness of the proposed technique.

  • Design and Implementation of a Non-pipelined MD5 Hardware Architecture Using a New Functional Description

    Ignacio ALGREDO-BADILLO  Claudia FEREGRINO-URIBE  Rene CUMPLIDO  Miguel MORALES-SANDOVAL  

     
    LETTER-VLSI Systems

      Vol:
    E91-D No:10
      Page(s):
    2519-2523

    MD5 is a cryptographic algorithm used for authentication. When implemented in hardware, the performance is affected by the data dependency of the iterative compression function. In this paper, a new functional description is proposed with the aim of achieving higher throughput by mean of reducing the critical path and latency. This description can be used in similar structures of other hash algorithms, such as SHA-1, SHA-2 and RIPEMD-160, which have comparable data dependence. The proposed MD5 hardware architecture achieves a high throughput/area ratio, results of implementation in an FPGA are presented and discussed, as well as comparisons against related works.

  • A Performance Optimized Architecture of Deblocking Filter in H.264/AVC

    Kyeong-Yuk MIN  Jong-Wha CHONG  

     
    PAPER

      Vol:
    E91-A No:4
      Page(s):
    1038-1043

    In this paper, we propose memory and performance optimized architecture to accelerate the operation speed of adaptive deblocking filter for H.264/JVT/AVC video coding. The proposed deblocking filter executes loading/storing and filtering operations with only 192 cycles for 1 macroblock. Only 244 internal buffers and 3216 internal SRAM are adopted for the buffering operation of deblocking filter with I/O bandwidth of 32 bit. The proposed architecture can process the filtering operation for 1 macroblock with less filtering cycles and lower memory sizes than some conventional approaches of realizing deblocking filter. The efficient hardware architecture is implemented with novel data arrangement, hybrid filter scheduling and minimum number of buffer. The proposed architecture is suitable for low cost and real-time applications, and the real-time decoding with 1080HD (19201088@30 fps) can be easily achieved when working frequency is 70 MHz.

  • Parallel Improved HDTV720p Targeted Propagate Partial SAD Architecture for Variable Block Size Motion Estimation in H.264/AVC

    Yiqing HUANG  Zhenyu LIU  Yang SONG  Satoshi GOTO  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E91-A No:4
      Page(s):
    987-997

    One hardware efficient and high speed architecture for variable block size motion estimation (VBSME) in H.264 is presented in this paper. By improving the pipeline structure and processing element (PE) circuits, the system latency and hardware cost is reduced, which makes this structure more hardware efficient than the original Propagate Partial SAD architecture. For small and middle frame size picture's coding, the proposed structure can save 12.1% hardware cost compared with original Propagate Partial SAD structure. In the case of HDTV, since small inter modes trivially contribute to the coding quality, we remove modes below 88 in our design. By adopting mode reduction technique, when the set number of PE array is less than 8, the proposed mode reduction based Propagate Partial SAD structure can work at faster clock speed and consume less hardware cost than widely used SAD Tree architecture. It is more robust to the high speed timing constraint when parallel processing is considered. With TSMC 0.18 µm technology in worst work conditions (1.62 V, 125), its peak throughput of 8-set PE array structure is 720p@30 Hz with 12864 search range and 5 reference frames. 12 k gates hardware cost can be reduced by our design compared with the parallel SAD Tree architecture.

  • Hardware Architecture for Fast Motion Estimation in H.264/AVC Video Coding

    Myung-Suk BYEON  Yil-Mi SHIN  Yong-Beom CHO  

     
    LETTER

      Vol:
    E89-A No:6
      Page(s):
    1744-1745

    This paper describes the efficiency of VLSI architecture for UMHexagonS (hybrid Unsymmetrical cross Multi Hexagon grid Search) matching algorithm. This algorithm is used for ME (Motion Estimation) of H.264/AVC video compression standard. The UMHexagonS is called a hybrid algorithm since it uses different kinds of searching patterns. VLSI architecture based on UMHexagonS is designed to provide a good tradeoff between gate sizes and high throughput. We implemented this architecture with about 309 K gates and 1/1792 throughput [block/cycle] for a search range of 16 and 44 macro blocks using synthesizable Verilog HDL.

  • Fast Learning Algorithms for Self-Organizing Map Employing Rough Comparison WTA and its Digital Hardware Implementation

    Hakaru TAMUKOH  Keiichi HORIO  Takeshi YAMAKAWA  

     
    PAPER

      Vol:
    E87-C No:11
      Page(s):
    1787-1794

    This paper describes a new fast learning algorithm for Self-Organizing Map employing a "rough comparison winner-take-all" and its digital hardware architecture. In rough comparison winner-take-all algorithm, the winner unit is roughly and strictly assigned in early and later learning stage, respectively. It realizes both of high accuracy and fast learning. The digital hardware of the self-organizing map with proposed WTA algorithm is implemented using FPGA. Experimental results show that the designed hardware is superior to other hardware with respect to calculation speed.

  • Adaptive Tessellation of PN Triangles Using Minimum-Artifact Edge Linking

    Yun-Seok CHOI  Kyu-Sik CHUNG  Lee-Sup KIM  

     
    LETTER-Computer Graphics

      Vol:
    E87-A No:10
      Page(s):
    2821-2828

    The PN triangle method has a great significance in processing tessellation at the hardware level without software assistance. Despite its significance, however, the conventional PN triangle method has certain defects such as inefficient GE operation and degradation of visual quality. Because the method tessellates a curved surface according to the user-defined fixed LOD (Level Of Detail). In this paper, we propose adaptive tessellation of PN triangles using minimum-artifact edge linking. Through this method, higher efficiency of tessellation and better quality of scene are obtained by adaptivity and minimum-artifact edge linking, respectively. This paper also presents a hardware architecture of a PN triangle method using adaptive LOD, which is not a burden for overall 3D graphics hardware.

  • Reduction of Background Computations in Block-Matching Motion Estimation

    Vasily G. MOSHNYAGA  Koichi MASUNAGA  

     
    PAPER-Video/Image Coding

      Vol:
    E87-A No:3
      Page(s):
    539-546

    A new algorithm and architecture to eliminate redundant operations in block-matching (BM) motion estimation is proposed. The key step of this work is to use binary-matching to define image regions with the static background content and then exclude these regions from the actual motion estimation. According to experiments, the approach maintains the highest PSNR, while making as half as less computations in comparison to the adaptive BM or 1/8 of the computations required by the full-search BM. An implementation scheme is outlined.

  • Motion Estimation and Compensation Hardware Architecture for a Scene-Adaptive Algorithm on a Single-Chip MPEG-2 Video Encoder

    Koyo NITTA  Toshihiro MINAMI  Toshio KONDO  Takeshi OGURA  

     
    PAPER-VLSI Systems

      Vol:
    E84-D No:3
      Page(s):
    317-325

    This paper describes a unique motion estimation and compensation (ME/MC) hardware architecture for a scene-adaptive algorithm. By statistically analyzing the characteristics of the scene being encoded and controlling the encoding parameters according to the scene, the quality of the decoded image can be enhanced. The most significant feature of the architecture is that the two modules for ME/MC can work independently. Since a time interval can be inserted between the operations of the two modules, a scene-adaptive algorithm can be implemented in the architecture. The ME/MC architecture is loaded on a single-chip MPEG-2 video encoder.

  • Hardware Framework for Accelerating the Execution Speed of a Genetic Algorithm

    Barry SHACKLEFORD  Etsuko OKUSHI  Mitsuhiro YASUDA  Hisao KOIZUMI  Katsuhiko SEO  Takashi IWAMOTO  

     
    PAPER-Multi Processors

      Vol:
    E80-C No:7
      Page(s):
    962-969

    Genetic algorithms were introduced by Holland in 1975 as a method of solving difficult optimization problems by means of simulated evolution. A major drawback of genetic algorithms is their slowness when emulated by software on conventional computers. Described is an adaptation of the original genetic algorithm that is advantageous to hardware implementation along with the architecture of a hardware framework that performs the functions of population storage, selection, crossover, mutation, fitness evaluation, and survival determination. Programming of the framework is illustrated with the set coverage problem that exhibits a 6,000 speed-up over software emulation on a 100 MHz workstation.

  • Parallel Move Generation System for Computer Chess

    Yi-Fan KE   Tai-Ming PARNG  

     
    PAPER-Computer Hardware and Design

      Vol:
    E79-D No:4
      Page(s):
    290-296

    This paper presents a parallel move generation of a Chess machine system for achieving the purpose of reducing the number of move generation cycles. The parallel system is composed of five move generation modules which share the move generating cycles to reduce the time of building a game tree. Simulation results show that the proposed parallel move generation architecture takes about half of the number of move generation cycles to build a game tree that is the same as the one built by a sequential move generation module.