The search functionality is under construction.

Keyword Search Result

[Keyword] FPGA(329hit)

101-120hit(329hit)

  • An FPGA Implementation for a Flexible-Length-Arithmetic Processor Employing the FDFM Processor Core Approach

    Tatsuya KAWAMOTO  Xin ZHOU  Jacir L. BORDIM  Yasuaki ITO  Koji NAKANO  

     
    PAPER-Architecture

      Pubricized:
    2016/08/24
      Vol:
    E99-D No:12
      Page(s):
    2901-2910

    Algorithms requiring fast manipulation of multiple-length numbers are usually implemented in hardware. However, hardware implementation, using HDL (Hardware Description Language) for instance, is a laborious task and the quality of the solution relies heavily on the designer expertise. The main contribution of this work is to present a flexible-length-arithmetic processor based on FDFM (Few DSP slices and Few Memory blocks) approach that supports arithmetic operations on multiple-length numbers using FPGAs (Field Programmable Gate Array). The proposed processor has been implement on the Xilinx Virtex-6 FPGA. Arithmetic instructions of the proposed processor architecture include addition, subtraction, and multiplication of integer numbers exceeding 64-bits. To reduce the burden of implementing algorithm directly on the FPGA, applications requiring multiple-length arithmetic operations are written in a C-like language and translated into a machine program. The machine program is then transferred and executed on the proposed architecture. A 2048-bit RSA encryption/decryption implementation has been used to assess the goodness of the proposed approach. Experimental results shows that the computing time, using the proposed architecture, of a 2048-bit RSA encryption takes only 2.2 times longer than a direct FPGA implementation. Furthermore, by employing multiple FDFM cores for the same task, the computing time reduces considerably.

  • Multi-User MIMO Channel Emulator with Automatic Channel Sounding Feedback

    Tran Thi Thao NGUYEN  Leonardo LANANTE  Yuhei NAGAO  Hiroshi OCHI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E99-A No:11
      Page(s):
    1918-1927

    Wireless channel emulators are used for the performance evaluation of wireless systems when actual wireless environment test is infeasible. The main contribution of this paper is the design of a MU-MIMO channel emulator capable of sending channel feedback automatically to the access point from the generated channel coefficients after the programmable time duration. This function is used for MU beamforming features of IEEE 802.11ac. The second contribution is the low complexity design of MIMO channel emulator with a single path implementation for all MIMO channel taps. A single path design allows all elements of the MIMO channel matrix to use only one Gaussian noise generator, Doppler filter, spatial correlation channel and Rician fading emulator to minimize the hardware complexity. In addition, single path implementation allows the addition of the feedback channel output with only a few additional non-sequential elements which would otherwise double in a parallel implementation. To demonstrate the functionality of our MU-MIMO channel emulator, we present actual hardware emulator results of MU-BF receive signal constellation on oscilloscope.

  • The Reliability Analysis of the 1-out-of-2 System in Which Two Modules Do Mutual Cooperation in Recovery Mode

    Aromhack SAYSANASONGKHAM  Satoshi FUKUMOTO  

     
    LETTER-Reliability, Maintainability and Safety Analysis

      Vol:
    E99-A No:9
      Page(s):
    1730-1734

    In this research, we investigated the reliability of a 1-out-of-2 system with two-stage repair comprising hardware restoration and data reconstruction modes. Hardware restoration is normally independently executed by two modules. In contrast, we assumed that one of the modules could omit data reconstruction by replicating the data from the module during normal operation. In this 1-out-of-2 system, the two modules mutually cooperated in the recovery mode. As a first step, an evaluation model using Markov chains was constructed to derive a reliability measure: “unavailability in steady state.” Numerical examples confirmed that the reliability of the system was improved by the use of two cooperating modules. As the data reconstruction time increased, the gains in terms of system reliability also increased.

  • Accelerating SAT-Based Boolean Matching for Heterogeneous FPGAs Using One-Hot Encoding and CEGAR Technique

    Yusuke MATSUNAGA  

     
    PAPER

      Vol:
    E99-A No:7
      Page(s):
    1374-1380

    This paper describes two speed-up techniques for Boolean matching of LUT-based circuits. One is one-hot encoding technique for variables representing input assignments. Though it requires more variables than existing binary encoding technique, almost all added clauses using one-hot encoding are binary clauses, which are suitable for efficient Boolean constraint propagation. The other is CEGAR (counter example guided abstraction refinement) technique which reduces the CPU time significantly. With both techniques, we can solve Boolean matching problem with 9 input function in 20 milliseconds on average, which is faster than the existing algorithms more than one order of magnitude.

  • Interconnection-Delay and Clock-Skew Estimate Modelings for Floorplan-Driven High-Level Synthesis Targeting FPGA Designs

    Koichi FUJIWARA  Kazushi KAWAMURA  Masao YANAGISAWA  Nozomu TOGAWA  

     
    PAPER

      Vol:
    E99-A No:7
      Page(s):
    1294-1310

    Recently, high-level synthesis techniques for FPGA designs (FPGA-HLS techniques) are strongly required in various applications. Both interconnection delays and clock skews have a large impact on circuit performance implemented onto FPGA, which indicates the need for floorplan-driven FPGA-HLS algorithms considering them. To appropriately estimate interconnection delays and clock skews at HLS phase, a reasonable model to estimate them becomes essential. In this paper, we demonstrate several experiments to characterize interconnection delays and clock skews in FPGA and propose novel estimate models called “IDEF” and “CSEF”. In order to evaluate our models, we integrate them into a conventional floorplan-driven FPGA-HLS algorithm. Experimental results demonstrate that our algorithm can realize FPGA designs which reduce the latency by up to 22% compared with conventional approaches.

  • A Field Programmable Sequencer and Memory with Middle Grained Programmability Optimized for MCU Peripherals

    Yoshifumi KAWAMURA  Naoya OKADA  Yoshio MATSUDA  Tetsuya MATSUMURA  Hiroshi MAKINO  Kazutami ARIMOTO  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E99-A No:5
      Page(s):
    917-928

    A Field Programmable Sequencer and Memory (FPSM), which is a programmable unit exclusively optimized for peripherals on a micro controller unit, is proposed. The FPSM functions as not only the peripherals but also the standard built-in memory. The FPSM provides easier programmability with a smaller area overhead, especially when compared with the FPGA. The FPSM is implemented on the FPGA and the programmability and performance for basic peripherals such as the 8 bit counter and 8 bit accuracy Pulse Width Modulation are emulated on the FPGA. Furthermore, the FPSM core with a 4K bit SRAM is fabricated in 0.18µm 5 metal CMOS process technology. The FPSM is an half the area of FPGA, its power consumption is less than one-fifth.

  • FPGA Implementation of Various Elliptic Curve Pairings over Odd Characteristic Field with Non Supersingular Curves

    Yasuyuki NOGAMI  Hiroto KAGOTANI  Kengo IOKIBE  Hiroyuki MIYATAKE  Takashi NARITA  

     
    PAPER-Cryptography and cryptographic protocols

      Pubricized:
    2016/01/13
      Vol:
    E99-D No:4
      Page(s):
    805-815

    Pairing-based cryptography has realized a lot of innovative cryptographic applications such as attribute-based cryptography and semi homomorphic encryption. Pairing is a bilinear map constructed on a torsion group structure that is defined on a special class of elliptic curves, namely pairing-friendly curve. Pairing-friendly curves are roughly classified into supersingular and non supersingular curves. In these years, non supersingular pairing-friendly curves have been focused on from a security reason. Although non supersingular pairing-friendly curves have an ability to bridge various security levels with various parameter settings, most of software and hardware implementations tightly restrict them to achieve calculation efficiencies and avoid implementation difficulties. This paper shows an FPGA implementation that supports various parameter settings of pairings on non supersingular pairing-friendly curves for which Montgomery reduction, cyclic vector multiplication algorithm, projective coordinates, and Tate pairing have been combinatorially applied. Then, some experimental results with resource usages are shown.

  • Design and Evaluation of a Configurable Query Processing Hardware for Data Streams

    Yasin OGE  Masato YOSHIMI  Takefumi MIYOSHI  Hideyuki KAWASHIMA  Hidetsugu IRIE  Tsutomu YOSHINAGA  

     
    PAPER-Computer System

      Pubricized:
    2015/09/14
      Vol:
    E98-D No:12
      Page(s):
    2207-2217

    In this paper, we propose Configurable Query Processing Hardware (CQPH), an FPGA-based accelerator for continuous query processing over data streams. CQPH is a highly optimized and minimal-overhead execution engine designed to deliver real-time response for high-volume data streams. Unlike most of the other FPGA-based approaches, CQPH provides on-the-fly configurability for multiple queries with its own dynamic configuration mechanism. With a dedicated query compiler, SQL-like queries can be easily configured into CQPH at run time. CQPH supports continuous queries including selection, group-by operation and sliding-window aggregation with a large number of overlapping sliding windows. As a proof of concept, a prototype of CQPH is implemented on an FPGA platform for a case study. Evaluation results indicate that a given query can be configured within just a few microseconds, and the prototype implementation of CQPH can process over 150 million tuples per second with a latency of less than a microsecond. Results also indicate that CQPH provides linear scalability to increase its flexibility (i.e., on-the-fly configurability) without sacrificing performance (i.e., maximum allowable clock speed).

  • Inter-FPGA Routing for Partially Time-Multiplexing Inter-FPGA Signals on Multi-FPGA Systems with Various Topologies

    Masato INAGI  Yuichi NAKAMURA  Yasuhiro TAKASHIMA  Shin'ichi WAKABAYASHI  

     
    PAPER-Physical Level Design

      Vol:
    E98-A No:12
      Page(s):
    2572-2583

    Multi-FPGA systems, which consist of multiple FPGAs and a printed circuit board connecting them, are useful and important tools for prototyping large scale circuits, including SoCs. In this paper, we propose a method for optimizing inter-FPGA signal transmission to accelerate the system frequency of multi-FPGA prototyping systems and shorten prototyping time. Compared with the number of I/O pins of an FPGA, the number of I/O signals between FPGAs usually becomes very large. Thus, time-multiplexed I/Os are used to resolve the problem. On the other hand, they introduce large delays to inter-FPGA I/O signals, and much lower the system frequency. To reduce the degradation of the system frequency, we have proposed a method for optimally selecting signals to be time-multiplexed and signals not to be time-multiplexed. However, this method assumes that there exist physical connections (i.e., wires on the printed circuit board) between every pair of FPGAs, and cannot handle I/O signals between a pair of FPGAs that have no physical connections between them. Thus, in this paper, we propose a method for obtaining indirect inter-FPGA routes for such I/O signals, and then combine the indirect routing method and the time-multiplexed signal selection method to realize effective time-multiplexing of inter-FPGA I/O signals on systems with various topologies.

  • Ultrasmall: A Tiny Soft Processor Architecture with Multi-Bit Serial Datapaths for FPGAs

    Shinya TAKAMAEDA-YAMAZAKI  Hiroshi NAKATSUKA  Yuichiro TANAKA  Kenji KISE  

     
    PAPER-Architecture

      Pubricized:
    2015/09/15
      Vol:
    E98-D No:12
      Page(s):
    2150-2158

    Soft processors are widely used in FPGA-based embedded computing systems. For such purposes, efficiency in resource utilization is as important as high performance. This paper proposes Ultrasmall, a new soft processor architecture for FPGAs. Ultrasmall supports a subset of the MIPS-I instruction set architecture and employs an area efficient microarchitecture to reduce the use of FPGA resources. While supporting the original 32-bit ISA, Ultrasmall uses a 2-bit serial ALU for all of its operations. This approach significantly reduces the resource utilization instead of increasing the performance overheads. In addition to these device-independent optimizations, we applied several device-dependent optimizations for Xilinx Spartan-3E FPGAs using 4-input lookup tables (LUTs). Optimizations using specific primitives aggressively reduce the number of occupied slices. Our evaluation result shows that Ultrasmall occupies only 84% of the previous small soft processor. In addition to the utilized resource reduction, Ultrasmall achieves 2.9 times higher performance than the previous approach.

  • Data-Transfer-Aware Design of an FPGA-Based Heterogeneous Multicore Platform with Custom Accelerators

    Yasuhiro TAKEI  Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E98-A No:12
      Page(s):
    2658-2669

    For an FPGA-based heterogeneous multicore platform, we present the design methodology to reduce the total processing time considering data-transfer. The reconfigurability of recent FPGAs with hard CPU cores allows us to realize a single-chip heterogeneous processor optimized for a given application. The major problem in designing such heterogeneous processors is data-transfer between CPU cores and accelerator cores. The total processing time with data-transfers is modeled considering the overlap of computation time and data-transfer time, and optimal design parameters are searched for.

  • FPGA Hardware with Target-Reconfigurable Object Detector

    Yoshifumi YAZAWA  Tsutomu YOSHIMI  Teruyasu TSUZUKI  Tomomi DOHI  Yuji YAMAUCHI  Takayoshi YAMASHITA  Hironobu FUJIYOSHI  

     
    PAPER

      Pubricized:
    2015/06/22
      Vol:
    E98-D No:9
      Page(s):
    1637-1645

    Much effort has been applied to research on object detection by statistical learning methods in recent years, and the results of that work are expected to find use in fields such as ITS and security. Up to now, the research has included optimization of computational algorithms for real-time processing on hardware such as GPU's and FPGAs. Such optimization most often works only with particular parameters, which often forfeits the flexibility that comes with dynamic changing of the target object. We propose a hardware architecture for faster detection and flexible target reconfiguration while maintaining detection accuracy. Tests confirm operation in a practical time when implemented in an FPGA board.

  • TherWare: Thermal-Aware Placement and Routing Framework for 3D FPGAs with Location-Based Heat Balance

    Ya-Shih HUANG  Han-Yuan CHANG  Juinn-Dar HUANG  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E98-A No:8
      Page(s):
    1796-1805

    The emerging three-dimensional (3D) technology is considered as a promising solution for achieving better performance and easier heterogeneous integration. However, the thermal issue becomes exacerbated primarily due to larger power density and longer heat dissipation paths. The thermal issue would also be critical once FPGAs step into the 3D arena. In this article, we first construct a fine-grained thermal resistive model for 3D FPGAs. We show that merely reducing the total power consumption and/or minimizing the power density in vertical direction is not enough for a thermal-aware 3D FPGA backend (placement and routing) flow. Then, we propose our thermal-aware backend flow named TherWare considering location-based heat balance. In the placement stage, TherWare not only considers power distribution of logic tiles in both lateral and vertical directions but also minimizes the interconnect power. In the routing stage, TherWare concentrates on overall power minimization and evenness of power distribution at the same time. Experimental results show that TherWare can significantly reduce the maximum temperature, the maximum temperature gradient, and the temperature deviation only at the cost of a minor increase in delay and runtime as compared with present arts.

  • A Floorplan-Driven High-Level Synthesis Algorithm for Multiplexer Reduction Targeting FPGA Designs

    Koichi FUJIWARA  Kazushi KAWAMURA  Shin-ya ABE  Masao YANAGISAWA  Nozomu TOGAWA  

     
    PAPER

      Vol:
    E98-A No:7
      Page(s):
    1392-1405

    Recently, high-level synthesis (HLS) techniques for FPGA designs are required in various applications such as computerized stock tradings and reconfigurable network processings. In HLS for FPGA designs, we need to consider module floorplan and reduce multiplexer's cost concurrently. In this paper, we propose a floorplan-driven HLS algorithm for multiplexer reduction targeting FPGA designs. By utilizing distributed-register architectures called HDR, we can easily consider module floorplan in HLS. In order to reduce multiplexer's cost, we propose two novel binding methods called datapath-oriented scheduling/FU binding and datapath-oriented register binding. Experimental results demonstrate that our algorithm can realize FPGA designs which reduce the number of slices by up to 47% and latency by up to 22% compared with conventional approaches while the number of required control steps is almost the same.

  • Through Chip Interface Based Three-Dimensional FPGA Architecture Exploration

    Li-Chung HSU  Masato MOTOMURA  Yasuhiro TAKE  Tadahiro KURODA  

     
    PAPER

      Vol:
    E98-C No:4
      Page(s):
    288-297

    This paper presents work on integrating wireless 3-D interconnection interface, namely ThruChip Interface (TCI), in three-dimensional field-programmable gate array (3-D FPGA) exploration tool (TPR). TCI is an emerging 3-D IC integration solution because of its advantages over cost, flexibility, reliability, comparable performance, and energy dissipation in comparison to through-silicon-via (TSV). Since the communication bandwidth of TCI is much higher than FPGA internal logic signals, in order to fully utilize its bandwidth, the time-division multiplexing (TDM) scheme is adopted. The experimental results show 25% on average and 58% at maximum path delay reduction over 2-D FPGA when five layers are used in TCI based 3-D FPGA architecture. Although the performance of TCI based 3-D FPGA architecture is 8% below that of TSV based 3-D FPGA on average, TCI based architecture can reduce active area consumed by vertical communication channels by 42% on average in comparison to TSV based architecture and hence leads to better delay and area product.

  • A Memory-Based IPv6 Lookup Architecture Using Parallel Index Generation Units

    Hiroki NAKAHARA  Tsutomu SASAO  Munehiro MATSUURA  Hisashi IWAMOTO  Yasuhiro TERAO  

     
    PAPER-Architecture

      Pubricized:
    2014/11/19
      Vol:
    E98-D No:2
      Page(s):
    262-271

    In the era of IPv6, since the number of IPv6 addresses rapidly increases and the required speed is more than Giga lookups per second (GLPS), an area-efficient and high-speed IP lookup architecture is desired. This paper shows a parallel index generation unit (IGU) for memory-based IPv6 lookup architecture. To reduce the size of memory in the IGU, we use a linear transformation and a row-shift decomposition. A single-memory realization requires O(2l log k) memory size, where l denotes the length of prefix, while the realization using IGU requires O(kl) memory size, where k denotes the number of prefixes. In IPv6 prefix lookup, since l is at most 64 and k is about 340 K, the IGU drastically reduces the memory size. Also, to reduce the cost, we realize the parallel IGU by using both on-chip and off-chip memories. We show a design algorithm for the parallel IGU to store given off-chip and on-chip memories. The parallel IGU has a simple architecture and performs lookup by using complete pipelines those insert the pipeline registers in all the paths. We loaded more than 340 K IPv6 pseudo prefixes on the Xilinx Virtex 6 FPGA with off-chip DDRII+ Static RAMs (SRAMs). Its lookup speed is 1.100 giga lookups per second (GLPS) which is sufficient for the required speed for a next generation 400 Gbps link throughput. As for the normalized area and lookup speed, our implementation outperforms existing FPGA implementations.

  • Acceleration of the Fast Multipole Method on FPGA Devices

    Hitoshi UKAWA  Tetsu NARUMI  

     
    LETTER-Application

      Pubricized:
    2014/11/19
      Vol:
    E98-D No:2
      Page(s):
    309-312

    The fast multipole method (FMM) for N-body simulations is attracting much attention since it requires minimal communication between computing nodes. We implemented hardware pipelines specialized for the FMM on an FPGA device, the GRAPE-9. An N-body simulation with 1.6×107 particles ran 16 times faster than that on a CPU. Moreover the particle-to-particle stage of the FMM on the GRAPE-9 executed 2.5 times faster than on a GPU in a limited case.

  • Asynchronous Cellular Automaton Model of Spiral Ganglion Cell in the Mammalian Cochlea: Theoretical Analyses and FPGA Implementation

    Masato IZAWA  Hiroyuki TORIKAI  

     
    PAPER-Nonlinear Problems

      Vol:
    E98-A No:2
      Page(s):
    684-699

    The mammalian cochlear consists of highly nonlinear components: lymph (viscous fluid), a basilar membrane (vibrating membrane in the viscous fluid), outer hair cells (active dumpers for the basilar membrane), inner hair cells (neural transducers), and spiral ganglion cells (parallel spikes density modulators). In this paper, a novel spiral ganglion cell model, the dynamics of which is described by an asynchronous cellular automaton, is presented. It is shown that the model can reproduce typical nonlinear responses of the spiral ganglion cell in the mammalian cochlea, e.g., spontaneous spiking, parallel spike density modulation, and adaptation. Also, FPGA experiments validate reproductions of these nonlinear responses.

  • Fault-Tolerant FPGA: Architectures and Design for Programmable Logic Intellectual Property Core in SoC

    Motoki AMAGASAKI  Qian ZHAO  Masahiro IIDA  Morihiro KUGA  Toshinori SUEYOSHI  

     
    PAPER-Architecture

      Pubricized:
    2014/11/19
      Vol:
    E98-D No:2
      Page(s):
    252-261

    In this paper, we propose fault-tolerant field-programmable gate array (FPGA) architectures and their design framework for intellectual property (IP) cores in system-on-chip (SoC). Unlike discrete FPGAs, in which the integration scale can be made relatively large, programmable IP cores must correspond to arrays of various sizes. The key features of our architectures are a regular tile structure, spare modules and bypass wires for fault avoidance, and a configuration mechanism for single-cycle reconfiguration. In addition, we utilize routing tools, namely EasyRouter for proposed architecture. This tool can handle various array sizes corresponding to developed programmable IP cores. In this evaluation, we compared the performances of conventional FPGAs and the proposed fault-tolerant FPGA architectures. On average, our architectures have less than 1.82 times the area and 1.11 times the delay compared with traditional island-style FPGAs. At the same time, our FPGA shows a higher fault tolerant performance.

  • Network-Level FPGA Acceleration of Low Latency Market Data Feed Arbitration

    Stewart DENHOLM  Hiroaki INOUE  Takashi TAKENAKA  Tobias BECKER  Wayne LUK  

     
    PAPER-Application

      Pubricized:
    2014/11/19
      Vol:
    E98-D No:2
      Page(s):
    288-297

    Financial exchanges provide market data feeds to update their members about changes in the market. Feed messages are often used in time-critical automated trading applications, and two identical feeds (A and B feeds) are provided in order to reduce message loss. A key challenge is to support A/B line arbitration efficiently to compensate for missing packets, while offering flexibility for various operational modes such as prioritising for low latency or for high data reliability. This paper presents a reconfigurable acceleration approach for A/B arbitration operating at the network level, capable of supporting any messaging protocol. Two modes of operation are provided simultaneously: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods. We also present a model for message feed processing latencies that is useful for evaluating scalability in future applications. We outline a new low latency, high throughput architecture and demonstrate a cycle-accurate testing framework to measure the actual latency of packets within the FPGA. We implement and compare the performance of the NASDAQ TotalView-ITCH, OPRA and ARCA market data feed protocols using a Xilinx Virtex-6 FPGA. For high reliability messages we achieve latencies of 42ns for TotalView-ITCH and 36.75ns for OPRA and ARCA. 6ns and 5.25ns are obtained for low latency messages. The most resource intensive protocol, TotalView-ITCH, is also implemented in a Xilinx Virtex-5 FPGA within a network interface card; it is used to validate our approach with real market data. We offer latencies 10 times lower than an FPGA-based commercial design and 4.1 times lower than the hardware-accelerated IBM PowerEN processor, with throughputs more than double the required 10Gbps line rate.

101-120hit(329hit)