The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] fpga(330hit)

121-140hit(330hit)

  • Network-Level FPGA Acceleration of Low Latency Market Data Feed Arbitration

    Stewart DENHOLM  Hiroaki INOUE  Takashi TAKENAKA  Tobias BECKER  Wayne LUK  

     
    PAPER-Application

      Pubricized:
    2014/11/19
      Vol:
    E98-D No:2
      Page(s):
    288-297

    Financial exchanges provide market data feeds to update their members about changes in the market. Feed messages are often used in time-critical automated trading applications, and two identical feeds (A and B feeds) are provided in order to reduce message loss. A key challenge is to support A/B line arbitration efficiently to compensate for missing packets, while offering flexibility for various operational modes such as prioritising for low latency or for high data reliability. This paper presents a reconfigurable acceleration approach for A/B arbitration operating at the network level, capable of supporting any messaging protocol. Two modes of operation are provided simultaneously: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods. We also present a model for message feed processing latencies that is useful for evaluating scalability in future applications. We outline a new low latency, high throughput architecture and demonstrate a cycle-accurate testing framework to measure the actual latency of packets within the FPGA. We implement and compare the performance of the NASDAQ TotalView-ITCH, OPRA and ARCA market data feed protocols using a Xilinx Virtex-6 FPGA. For high reliability messages we achieve latencies of 42ns for TotalView-ITCH and 36.75ns for OPRA and ARCA. 6ns and 5.25ns are obtained for low latency messages. The most resource intensive protocol, TotalView-ITCH, is also implemented in a Xilinx Virtex-5 FPGA within a network interface card; it is used to validate our approach with real market data. We offer latencies 10 times lower than an FPGA-based commercial design and 4.1 times lower than the hardware-accelerated IBM PowerEN processor, with throughputs more than double the required 10Gbps line rate.

  • Performance Modeling of Stencil Computing on a Stream-Based FPGA Accelerator for Efficient Design Space Exploration

    Keisuke DOHI  Koji OKINA  Rie SOEJIMA  Yuichiro SHIBATA  Kiyoshi OGURI  

     
    PAPER-Application

      Pubricized:
    2014/11/19
      Vol:
    E98-D No:2
      Page(s):
    298-308

    In this paper, we discuss performance modeling of 3-D stencil computing on an FPGA accelerator with a high-level synthesis environment, aiming for efficient exploration of user-space design parameters. First, we analyze resource utilization and performance to formulate these relationships as mathematical models. Then, in order to evaluate our proposed models, we implement heat conduction simulations as a benchmark application, by using MaxCompiler, which is a high-level synthesis tool for FPGAs, and MaxGenFD, which is a domain specific framework of the MaxCompiler for finite-difference equation solvers. The experimental results with various settings of architectural design parameters show the best combination of design parameters for pipeline structure can be systematically found by using our models. The effects of changing arithmetic accuracy and using data stream compression are also discussed.

  • A Multidimensional Configurable Processor Array — Vocalise

    Jiang LI  Yusuke ATSUMARI  Hiromasa KUBO  Yuichi OGISHIMA  Satoru YOKOTA  Hakaru TAMUKOH  Masatoshi SEKINE  

     
    PAPER-Computer System

      Pubricized:
    2014/10/27
      Vol:
    E98-D No:2
      Page(s):
    313-324

    A processing system with multiple field programmable gate array (FPGA) cards is described. Each FPGA card can interconnect using six I/O (up, down, left, right, front, and back) terminals. The communication network among FPGAs is scalable according to user design. When the system operates multi-dimensional applications, transmission efficiency among FPGA improved through user-adjusted dimensionality and network topologies for different applications. We provide a fast and flexible circuit configuration method for FPGAs of a multi-dimensional FPGA array. To demonstrate the effectiveness of the proposed method, we assess performance and power consumption of a circuit that calculated 3D Poisson equations using the finite difference method.

  • Real-Time Touch Controller with High-Speed Touch Accelerator for Large-Sized Touch Screens

    SangHyuck BAE  DoYoung JUNG  CheolSe KIM  KyoungMoon LIM  Yong-Surk LEE  

     
    LETTER-Human-computer Interaction

      Pubricized:
    2014/10/17
      Vol:
    E98-D No:1
      Page(s):
    193-196

    For a large-sized touch screen, we designed and evaluated a real-time touch microarchitecture using a field-programmable gate array (FPGA). A high-speed hardware accelerator based on a parallel touch algorithm is suggested and implemented in this letter. The touch controller also has a timing control unit and an analog digital convert (ADC) control unit for analog touch sensing circuits. Measurement results of processing time showed that the touch controller with its proposed microarchitecture is five times faster than the 32-bit reduced instruction set computer (RISC) processor without the touch accelerator.

  • A Compact Matched Filter Bank for a Mutually Orthogonal ZCZ Sequence Set Consisting of Ternary Sequence Pairs

    Takahiro MATSUMOTO  Hideyuki TORII  Yuta IDA  Shinya MATSUFUJI  

     
    LETTER-Sequences

      Vol:
    E97-A No:12
      Page(s):
    2595-2600

    In this paper, we propose a new structure for a compact matched filter bank for a mutually orthogonal zero-correlation zone (MO-ZCZ) sequence set consisting of ternary sequence pairs obtained by Hadamard and binary ZCZ sequence sets; this construction reduces the number of two-input adders and delay elements. The matched filter banks are implemented on a field-programmable gate array (FPGA) with 51,840 logic elements (LEs). The proposed matched filter bank for an MO-ZCZ sequence set of length 160 can be constructed by a circuit size that is about 8.6% that of a conventional matched filter bank.

  • An FPGA Implementation of the Two-Dimensional FDTD Method and Its Performance Comparison with GPGPU

    Ryota TAKASU  Yoichi TOMIOKA  Yutaro ISHIGAKI  Ning LI  Tsugimichi SHIBATA  Mamoru NAKANISHI  Hitoshi KITAZAWA  

     
    PAPER

      Vol:
    E97-C No:7
      Page(s):
    697-706

    Electromagnetic field analysis is a time-consuming process, and a method involving the use of an FPGA accelerator is one of the attractive ways to accelerate the analysis; the other method involve the use of CPU and GPU. In this paper, we propose an FPGA accelerator dedicated for a two-dimensional finite-difference time-domain (FDTD) method. This accelerator is based on a two-dimensional single instruction multiple data (SIMD) array architecture. Each processing element (PE) is composed of a six-stage pipeline that is optimized for the FDTD method. Moreover, driving signal generation and impedance termination are also implemented in the hardware. We demonstrate that our accelerator is 11 times faster than existing FPGA accelerators and 9 times faster than parallel computing on the NVIDIA Tesla C2075. As an application of the high-speed FDTD accelerator, the design optimization of a waveguide is shown.

  • Reconfigurable Out-of-Order System for Fluid Dynamics Computation Using Unstructured Mesh

    Takayuki AKAMINE  Mohamad Sofian ABU TALIP  Yasunori OSANA  Naoyuki FUJITA  Hideharu AMANO  

     
    PAPER-Computer System

      Vol:
    E97-D No:5
      Page(s):
    1225-1234

    Computational fluid dynamics (CFD) is an important tool for designing aircraft components. FaSTAR (Fast Aerodynamics Routines) is one of the most recent CFD packages and has various subroutines. However, its irregular and complicated data structure makes it difficult to execute FaSTAR on parallel machines due to memory access problem. The use of a reconfigurable platform based on field programmable gate arrays (FPGAs) is a promising approach to accelerating memory-bottlenecked applications like FaSTAR. However, even with hardware execution, a large number of pipeline stalls can occur due to read-after-write (RAW) data hazards. Moreover, it is difficult to predict when such stalls will occur because of the unstructured mesh used in FaSTAR. To eliminate this problem, we developed an out-of-order mechanism for permuting the data order so as to prevent RAW hazards. It uses an execution monitor and a wait buffer. The former identifies the state of the computation units, and the latter temporarily stores data to be processed in the computation units. This out-of-order mechanism can be applied to various types of computations with data dependency by changing the number of execution monitors and wait buffers in accordance with the equations used in the target computation. An out-of-order system can be reconfigured by automatic changing of the parameters. Application of the proposed mechanism to five subroutines in FaSTAR showed that its use reduces the number of stalls to less than 1% compared to without the mechanism. In-order execution was speeded up 2.6-fold and software execution was speeded up 2.9-fold using an Intel Core 2 Duo processor with a reasonable amount of overhead.

  • Asynchronous Circuit Designs on an FPGA for Targeting a Power/Energy Efficient SoC

    Jeong-Gun LEE  Myeong-Hoon OH  

     
    PAPER

      Vol:
    E97-C No:4
      Page(s):
    253-263

    A modern system-on-chip (SoC) includes many heterogeneous IP components. Generally, a few embedded processors are integrated into SoCs. An asynchronous circuit design technique is employed to achieve low power/energy consumption. In this paper, we design an asynchronous embedded processor on FPGAs and analyze its possible benefits on commercial FPGAs. We use commercially available 65nm high-performance Virtex-5 and 45nm low-power Spartan-6 Xilinx FPGAs to show the impact on power consumption for the two different extreme cases. For the high performance Virtex-5, our asynchronous processor shows 36.8% lower power consumption when compared with its synchronous counterpart. On the other hand, the asynchronous processor consumes 25.6% more power in a low power Spartan-6 FPGA. However, through simple analysis and power simulation, we show that the event-driven nature of asynchronous circuits can further save power/energy even in the Spartan-6 FPGA.

  • FPGA Implementation of Exclusive Block Matching for Robust Moving Object Extraction and Tracking

    Yoichi TOMIOKA  Ryota TAKASU  Takashi AOKI  Eiichi HOSOYA  Hitoshi KITAZAWA  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E97-D No:3
      Page(s):
    573-582

    Hardware acceleration is an essential technique for extracting and tracking moving objects in real time. It is desirable to design tracking algorithms such that they are applicable for parallel computations on hardware. Exclusive block matching methods are designed for hardware implementation, and they can realize detailed motion extraction as well as robust moving object tracking. In this study, we develop tracking hardware based on an exclusive block matching method on FPGA. This tracking hardware is based on a two-dimensional systolic array architecture, and can realize robust moving object extraction and tracking at more than 100 fps for QVGA images using the high parallelism of an exclusive block matching method, synchronous shift data transfer, and special circuits to accelerate searching the exclusive correspondence of blocks.

  • A CAM-Based Information Detection Hardware System for Fast Image Matching on FPGA

    Duc-Hung LE  Tran-Bao-Thuong CAO  Katsumi INOUE  Cong-Kha PHAM  

     
    PAPER-Electronic Circuits

      Vol:
    E97-C No:1
      Page(s):
    65-76

    In this paper, the authors present a CAM-based Information Detection Hardware System for fast, exact and approximate image matching on 2-D data, using FPGA. The proposed system can be potentially applied to fast image matching with various required search patterns, without using search principles. In designing the system, we take advantage of Content Addressable Memory (CAM) which has parallel multi-match mode capability and has been designed, using dual-port RAM blocks. The system has a simple structure, and does not employ any Central Processor Unit (CPU) or complicated computations.

  • A Digital TRNG Based on Cross Feedback Ring Oscillators

    Lijuan LI  Shuguo LI  

     
    PAPER-Hardware Based Security

      Vol:
    E97-A No:1
      Page(s):
    284-291

    In this paper, a new digital true random number generator based on Cross Feedback Ring Oscillators (CFRO) is proposed. The random sources of CFRO lie in delay variations (jitter), unpredictable transition behaviors as well as metastability. The CFRO is proved to be truly random by restarting from the same initial states. Compared with the so-called Fibonacci Ring Oscillator (FIRO) and Galois Ring Oscillator (GARO), the CFRO needs less than half of their time to accumulate relatively high entropy and enable extraction of one random bit. Only a simple XOR corrector is used to reduce the bias of output sequences. TRNG based on CFRO can be run continuously at a constant high speed of 150Mbps. For higher security, the TRNG can be set in stateless mode at a cost of slower speed of 10Mbps. The total logical resources used are relatively small and no special placement and routing is needed. The TRNG both in continuous mode and in stateless mode can pass the NIST tests and the DIEHARD tests.

  • Evaluation of an FPGA-Based Heterogeneous Multicore Platform with SIMD/MIMD Custom Accelerators

    Yasuhiro TAKEI  Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E96-A No:12
      Page(s):
    2576-2586

    Heterogeneous multi-core architectures with CPUs and accelerators attract many attentions since they can achieve power-efficient computing in various areas from low-power embedded processing to high-performance computing. Since the optimal architecture is different from application to application, finding the most suitable accelerator is very important. In this paper, we propose an FPGA-based heterogeneous multi-core platform with custom accelerators for power-efficient computing. Using the proposed platform, we evaluate several applications and accelerators to identify many key requirements of the applications and properties of the accelerators. Such an evaluation is very important to select and optimize the most suitable accelerator according to the requirements of an application to achieve the best performance.

  • Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix Access

    Lei GUO  Yuhua TANG  Yong DOU  Yuanwu LEI  Meng MA  Jie ZHOU  

     
    PAPER-Computer System

      Vol:
    E96-D No:12
      Page(s):
    2765-2775

    The effective bandwidth of the dynamic random-access memory (DRAM) for the alternate row-wise/column-wise matrix access (AR/CMA) mode, which is a basic characteristic in scientific and engineering applications, is very low. Therefore, we propose the window memory layout scheme (WMLS), which is a matrix layout scheme that does not require transposition, for AR/CMA applications. This scheme maps one row of a logical matrix into a rectangular memory window of the DRAM to balance the bandwidth of the row- and column-wise matrix access and to increase the DRAM IO bandwidth. The optimal window configuration is theoretically analyzed to minimize the total number of no-data-visit operations of the DRAM. Different WMLS implementationsare presented according to the memory structure of field-programmable gata array (FPGA), CPU, and GPU platforms. Experimental results show that the proposed WMLS can significantly improve DRAM bandwidth for AR/CMA applications. achieved speedup factors of 1.6× and 2.0× are achieved for the general-purpose CPU and GPU platforms, respectively. For the FPGA platform, the WMLS DRAM controller is custom. The maximum bandwidth for the AR/CMA mode reaches 5.94 GB/s, which is a 73.6% improvement compared with that of the traditional row-wise access mode. Finally, we apply WMLS scheme for Chirp Scaling SAR application, comparing with the traditional access approach, the maximum speedup factors of 4.73X, 1.33X and 1.56X can be achieved for FPGA, CPU and GPU platform, respectively.

  • Bitstream Protection in Dynamic Partial Reconfiguration Systems Using Authenticated Encryption

    Yohei HORI  Toshihiro KATASHITA  Hirofumi SAKANE  Kenji TODA  Akashi SATOH  

     
    PAPER-Computer System

      Vol:
    E96-D No:11
      Page(s):
    2333-2343

    Protecting the confidentiality and integrity of a configuration bitstream is essential for the dynamic partial reconfiguration (DPR) of field-programmable gate arrays (FPGAs). This is because erroneous or falsified bitstreams can cause fatal damage to FPGAs. In this paper, we present a high-speed and area-efficient bitstream protection scheme for DPR systems using the Advanced Encryption Standard with Galois/Counter Mode (AES-GCM), which is an authenticated encryption algorithm. Unlike many previous studies, our bitstream protection scheme also provides a mechanism for error recovery and tamper resistance against configuration block deletion, insertion, and disorder. The implementation and evaluation results show that our DPR scheme achieves a higher performance, in terms of speed and area, than previous methods.

  • Design a Fast CAM-Based Exact Pattern Matching System on FPGA and 0.18µm CMOS Process

    Duc-Hung LE  Katsumi INOUE  Cong-Kha PHAM  

     
    LETTER-VLSI Design Technology and CAD

      Vol:
    E96-A No:9
      Page(s):
    1883-1888

    A CAM-based matching system for fast exact pattern matching is implemented on a hardware system with FPGA and ASIC. The system has a simple structure, and does not employ any Central Processor Unit (CPU) as well as complicated computations. We take advantage of Content Addressable Memory (CAM) which has an ability of parallel multi-match mode for designing the system. The system is applied to fast pattern matching with various required search patterns without using search principles. In this paper, the authors present a CAM-based system for fast exact pattern matching on 2-D data.

  • Architecture of an Asynchronous FPGA for Handshake-Component-Based Design

    Yoshiya KOMATSU  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Architecture

      Vol:
    E96-D No:8
      Page(s):
    1632-1644

    This paper presents a novel architecture of an asynchronous FPGA for handshake-component-based design. The handshake-component-based design is suitable for large-scale, complex asynchronous circuit because of its understandability. This paper proposes an area-efficient architecture of an FPGA that is suitable for handshake-component-based asynchronous circuit. Moreover, the Four-Phase Dual-Rail encoding is employed to construct circuits robust to delay variation because the data paths are programmable in FPGA. The FPGA based on the proposed architecture is implemented in a 65 nm process. Its evaluation results show that the proposed FPGA can implement handshake components efficiently.

  • FPGA Implementation of Human Detection by HOG Features with AdaBoost

    Keisuke DOHI  Kazuhiro NEGI  Yuichiro SHIBATA  Kiyoshi OGURI  

     
    PAPER-Application

      Vol:
    E96-D No:8
      Page(s):
    1676-1684

    We implement external memory-free deep pipelined FPGA implementation including HOG feature extraction and AdaBoost classification. To construct our design by compact FPGA, we introduce some simplifications of the algorithm and aggressive use of stream oriented architectures. We present comparison results between our simplified fixed-point scheme and an original floating-point scheme in terms of quality of results, and the results suggest the negative impact of the simplified scheme for hardware implementation is limited. We empirically show that, our system is able to detect human from 640480 VGA images at up to 112 FPS on a Xilinx Virtex-5 XC5VLX50 FPGA.

  • FPGA Design Framework Combined with Commercial VLSI CAD

    Qian ZHAO  Kazuki INOUE  Motoki AMAGASAKI  Masahiro IIDA  Morihiro KUGA  Toshinori SUEYOSHI  

     
    PAPER-Design Methodology

      Vol:
    E96-D No:8
      Page(s):
    1602-1612

    The most widely used open-source field programmable gate array (FPGA) placement and routing tool is the Versatile Packing, Placement and Routing (VPR) software developed at the University of Toronto, Canada. VPR calculates area and timing using target FPGA architecture and physical information. However, it cannot be used in FPGA IP design efficiently for two reasons. First, VPR cannot directly support most newly developed FPGA architectures, and modifying the C-coded VPR so that it can be used to evaluate a number of new architectures is time consuming. Second, the accuracy of the VPR performance results is inadequate for the evaluation of a complete FPGA IP in a design that targets the production of LSI. We propose an FPGA design framework that is focused on improving FPGA IP design efficiency. A novel FPGA routing tool is developed in this framework, namely the EasyRouter which uses the C# language. When an object-oriented programming method is used, there is less source code and it is easier to manage compared to VPR, thus shortening the development time. By using simple HDL code templates, EasyRouter can automatically generate the entire HDL code for a chip and the configuration bitstream. With these files, the FPGA IP can be evaluated with commercial VLSI CAD systems with high accuracy and reliability.

  • A Prototype System for Many-Core Architecture SMYLEref with FPGA Evaluation Boards

    Son-Truong NGUYEN  Masaaki KONDO  Tomoya HIRAO  Koji INOUE  

     
    PAPER-Architecture

      Vol:
    E96-D No:8
      Page(s):
    1645-1653

    Nowadays, the trend of developing micro-processor with hundreds of cores brings a promising prospect for embedded systems. Realizing a high performance and low power many-core processor is becoming a primary technical challenge. Generally, three major issues required to be resolved includes: 1) realizing efficient massively parallel processing, 2) reducing dynamic power consumption, and 3) improving software productivity. To deal with these issues, we propose a solution to use many low-performance but small and very low-power cores to obtain very high performance, and develop a referential many-core architecture and a program development environment. This paper introduces a many-core architecture named SMYLEref and its prototype system with off-the-shelf FPGA evaluation boards. The initial evaluation results of several SPLASH2 benchmark programs conducted on our developed 128-core platform are also presented and discussed in this paper.

  • Design for Delay Measurement Aimed at Detecting Small Delay Defects on Global Routing Resources in FPGA

    Kazuteru NAMBA  Nobuhide TAKASHINA  Hideo ITO  

     
    PAPER-Test and Verification

      Vol:
    E96-D No:8
      Page(s):
    1613-1623

    Small delay defects can cause serious issues such as very short lifetime in the recent VLSI devices. Delay measurement is useful to detect small delay defects in manufacturing testing. This paper presents a design for delay measurement to detect small delay defects on global routing resources, such as double, hex and long lines, in a Xilinx Virtex 4 based FPGA. This paper also shows a measurement method using the proposed design. The proposed measurement method is based on an existing one for SoC using delay value measurement circuit (DVMC). The proposed measurement modifies the construction of configurable logic blocks (CLBs) and utilizes an on-chip DVMC newly added. The number of configurations required by the proposed measurement is 60, which is comparable to that required by stuck-at fault testing for global routing resources in FPGAs. The area overhead is low for general FPGAs, in which the area of routing resources is much larger than that of the other elements such as CLBs. The area of every modified CLB is 7% larger than an original CLB, and the area of the on-chip DVMC is 22% as large as that of an original CLB. For recent FPGAs, we can estimate that the area overhead is approximately 2% or less of the FPGAs.

121-140hit(330hit)