1-16hit |
In this paper, we propose a design method to design asynchronous circuits with bundled-data implementation on commercial Field Programmable Gate Arrays using placement constraints. The proposed method uses two types of placement constraints to reduce the number of delay adjustments to fix timing violations and to improve the performance of the bundled-data implementation. We also propose a floorplan algorithm to reduce the control-path delays specific to the bundled-data implementation. Using the proposed method, we could design the asynchronous circuits whose performance is close to and energy consumption is small compared to the synchronous counterparts with less delay adjustment.
Antoniette MONDIGO Tomohiro UENO Kentaro SANO Hiroyuki TAKIZAWA
Since the hardware resource of a single FPGA is limited, one idea to scale the performance of FPGA-based HPC applications is to expand the design space with multiple FPGAs. This paper presents a scalable architecture of a deeply pipelined stream computing platform, where available parallelism and inter-FPGA link characteristics are investigated to achieve a scaled performance. For a practical exploration of this vast design space, a performance model is presented and verified with the evaluation of a tsunami simulation application implemented on Intel Arria 10 FPGAs. Finally, scalability analysis is performed, where speedup is achieved when increasing the computing pipeline over multiple FPGAs while maintaining the problem size of computation. Performance is scaled with multiple FPGAs; however, performance degradation occurs with insufficient available bandwidth and large pipeline overhead brought by inadequate data stream size. Tsunami simulation results show that the highest scaled performance for 8 cascaded Arria 10 FPGAs is achieved with a single pipeline of 5 stream processing elements (SPEs), which obtained a scaled performance of 2.5 TFlops and a parallel efficiency of 98%, indicating the strong scalability of the multi-FPGA stream computing platform.
Tieyuan PAN Lian ZENG Yasuhiro TAKASHIMA Takahiro WATANABE
In this paper, we propose a fast Maximal Empty Rectangle (MER) enumeration algorithm for online task placement on reconfigurable Field-Programmable Gate Arrays (FPGAs). On the assumption that each task utilizes rectangle-shaped resources, the proposed algorithm can manage the free space on FPGAs by an MER list. When assigning or removing a task, a series of MERs are selected and cut into segments according to the task and its assignment location. By processing these segments, the MER list can be updated quickly with low memory consumption. Under the proof of the upper limit of the number of the MERs on the FPGA, we analyze both the time and space complexity of the proposed algorithm. The efficiency of the proposed algorithm is verified by experiments.
Ya-Shih HUANG Han-Yuan CHANG Juinn-Dar HUANG
The emerging three-dimensional (3D) technology is considered as a promising solution for achieving better performance and easier heterogeneous integration. However, the thermal issue becomes exacerbated primarily due to larger power density and longer heat dissipation paths. The thermal issue would also be critical once FPGAs step into the 3D arena. In this article, we first construct a fine-grained thermal resistive model for 3D FPGAs. We show that merely reducing the total power consumption and/or minimizing the power density in vertical direction is not enough for a thermal-aware 3D FPGA backend (placement and routing) flow. Then, we propose our thermal-aware backend flow named TherWare considering location-based heat balance. In the placement stage, TherWare not only considers power distribution of logic tiles in both lateral and vertical directions but also minimizes the interconnect power. In the routing stage, TherWare concentrates on overall power minimization and evenness of power distribution at the same time. Experimental results show that TherWare can significantly reduce the maximum temperature, the maximum temperature gradient, and the temperature deviation only at the cost of a minor increase in delay and runtime as compared with present arts.
Cyclic Redundancy Check (CRC) is a well known error detection scheme used to detect corruption of digital content in digital networks and storage devices. Since it is a compute-intensive process which adversely affects performance, hardware acceleration using FPGAs has been tried and satisfactory performance has been achieved. However, recent extended usage of networks and storage systems require various correction capabilities for various CRC standards. Traditional hardware designs based on the LFSR (Linear Feedback Shift Register) tend to have fixed structure without such flexibility. Here, fully-adaptable CRC accelerator based on a table-based algorithm is proposed. The table-based algorithm is a flexible method commonly used in software implementations. It has been rarely implemented with the hardware, since it is believed that the operational speed is not enough. However, by using pipelined structure and efficient use of memory modules in FPGAs, it appeared that the table-based fixed CRC accelerators achieved better performance than traditional implementation. Based on the implementation, fully-adaptable CRC accelerator which eliminate the need for many non-adaptable CRC implementations is proposed. The accelerator has ability to process arbitrary number of input data and generates CRC for any known CRC standard, up to 65 bits of generator polynomial, during run-time. Further, we modify Table generation algorithm in order to decrease its space complexity from O(nm) to O(n). On Xilinx Virtex 6 LX550T board, the fully-adaptable accelerators occupy between 1 to 2% area to produce maximum of 289.8 Gbps at 283.1 MHz if BRAM is deployed, or between 1.6 - 14% of area for 418 Gbps at 408.9 MHz if tables are implemented in logic. Proposed architecture enables further expansion of throughput by increasing a number of input bits M processed at a time.
Lihong SHANG Mi ZHOU Yu HU Erfu YANG
Field programmable gate arrays (FPGAs) are widely used in reliability-critical systems due to their reconfiguration ability. However, with the shrinking device feature size and increasing die area, nowadays FPGAs can be deeply affected by the errors induced by electromigration and radiation. To improve the reliability of FPGA-based reconfigurable systems, a permanent fault recovery approach using a domain partition model is proposed in this paper. In the proposed approach, the fault-tolerant FPGA recovery from faults is realized by reloading a proper configuration from a pool of multiple alternative configurations with overlaps. The overlaps are presented as a set of vectors in the domain partition model. To enhance the reliability, a technical procedure is also presented in which the set of vectors are heuristically filtered so that the corresponding small overlaps can be merged into big ones. Experimental results are provided to demonstrate the effectiveness of the proposed approach through applying it to several benchmark circuits. Compared with previous approaches, the proposed approach increased MTTF by up to 18.87%.
Masanori HARIYAMA Shota ISHIHARA Michitaka KAMEYAMA
This paper presents a novel asynchronous architecture of Field-programmable gate arrays (FPGAs) to reduce the power consumption. In the dynamic power consumption of the conventional FPGAs, the power consumed by the switch blocks and clock distribution is dominant since FPGAs have complex switch blocks and the large number of registers for high programmability. To reduce the power consumption of switch blocks and clock distribution, asynchronous bit-serial architecture is proposed. To ensure the correct operation independent of data-path lengths, we use the level-encoded dual-rail encoding and propose its area-efficient implementation. The proposed field-programmable VLSI is implemented in a 90 nm CMOS technology. The delay and the power consumption of the proposed FPVLSI are respectively 61% and 58% of those of 4-phase dual-rail encoding which is the most common encoding in delay insensitive encoding.
Mitsuru TOMONO Masaki NAKANISHI Shigeru YAMASHITA Kazuo NAKAJIMA Katsumasa WATANABE
In a partially reconfigurable FPGA of the future, arbitrary portions of its logic resources and interconnection networks will be reconfigured without affecting the other parts. Multiple tasks will be mapped and executed concurrently in such an FPGA. Efficient execution of the tasks using the limited resources of the FPGA will necessitate effective resource management. A number of online FPGA placement methods have recently been proposed for such an FPGA. However, they cannot handle I/O communications of the tasks. Taking such I/O communications into consideration, we introduce a new approach to online FPGA placement. We present an algorithm for placing each arriving task in an empty area so as to complete all the tasks efficiently. We develop two fitting strategies to effectively handle I/O communications of the tasks. Our experimental results show that properly weighted combinations of these and two other previously proposed strategies enable this algorithm to run very fast and make an effective placement of the tasks. In fact, we show that the overhead associated with the use of this algorithm is negligible as compared to the total execution time of the tasks.
Katsuhiko DEGAWA Takafumi AOKI Tatsuo HIGUCHI
This paper presents a Field-Programmable Digital Filter (FPDF) IC that employs carry-propagation-free redundant arithmetic algorithms for faster computation and multiple-valued current-mode circuit technology for high-density low-power implementation. The original contribution of this paper is to evaluate, through actual chip fabrication, the potential impact of multiple-valued current-mode circuit technology on the reduction of hardware complexity required for DSP-oriented programmable ICs. The prototype FPDF fabrication with 0.6 µm CMOS technology demonstrates that the chip area and power consumption can be reduced to 41% and 71%, respectively, compared with the standard binary logic implementation.
Jacir L. BORDIM Yasuaki ITO Koji NAKANO
The main contribution of this paper is to present an FPGA-based implementation of an instance-specific hardware which accelerates the CKY (Cocke-Kasami-Younger) parsing for context-free grammars. Given a context-free grammar G and a string x, the CKY parsing determines whether G derives x. We have developed a hardware generator that creates a Verilog HDL source to perform the CKY parsing for any given context-free grammar G. The generated source is embedded in an FPGA using the design software provided by the FPGA vendor. We evaluated the instance-specific hardware, generated by our hardware generator, using a timing analyzer and tested it using the Altera FPGAs. The generated hardware attains a speed-up factor of approximately 750 over the software CKY parsing algorithm.
In this paper, we deal with the problem of compatibility class encoding, and propose a novel algorithm for finding a good functional decomposition with application to LUT-based FPGA synthesis. Based on exploration of the design space, we concentrate on extracting a set of components, which can be merged into the minimum number of multiple-output CLBs or LUTs, such that the decomposition constructed from these components is also minimal. In particular, to explore more degrees of freedom, we introduce pliable encoding to take over the conventional rigid encoding when it fails to find a satisfactory decomposition by rigid encoding. Experimental results on a large set of MCNC91 logic synthesis benchmarks show that our method is quite promising.
This paper presents a new approach to simulation of Dynamically Reconfigurable Logic (DRL) systems, which offers better accuracy of modelling dynamic reconfiguration than previously reported techniques. Our method, named Clock Morphing (CM), is based on modelling dynamic reconfiguration via a reconfigured module clock signal, while using a dedicated signal value to indicate dynamic reconfiguration. We discuss problems associated with the other approaches to DRL simulation and describe the main principles behind the proposed technique. We further demonstrate feasibility of a CM DRL simulation on its example implementation in VHDL.
Takafumi AOKI Yoshiki SAWADA Tatsuo HIGUCHI
This paper presents a new number representation called the Signed-Weight (SW) number system, which is useful for designing configurable counter-tree architectures for digital signal processing applications. The SW number system allows the unified manipulation of positive and negative numbers in arithmetic circuits by adjusting the signs assigned to individual digit positions. This makes possible the construction of highly regular arithmetic circuits without introducing irregular arithmetic operations, such as negation and sign extension in the two's complement representation. This paper also presents the design of a Field-Programmable Digital Filter (FPDF) architecture--a special-purpose FPGA architecture for high-speed FIR filtering--using the proposed SW arithmetic system.
Takenori KOUDA Shigeru YAMASHITA Yahiko KAMBAYASHI
In this paper, we will discuss circuit minimization techniques based on the multiple output capability of FPGA blocks. Since previous methods only consider two independent output functions, we will discuss a more complicated case when the two functions are mutually related. We also discuss a method to maximize flexibility of a specified cell output in the given FPGA block. If a set of possible functions for a cell which will not change the FPGA output function is large, we call that the flexibility of this cell is high. The concept of Sets of Pairs of Functions to be Distinguished (SPFDs) introduced by Yamashita et al. is a powerful tool to minimize a given FPGA circuits. In this paper, an extension of the concept, Priority based SPFDs (PSPFDs) is introduced to maximize the flexibility of output functions realized by such internal cells. By using PSPFDs for our new method, we can utilize the multiple output capability very well. Combination with the previous methods with PSPFDs is also shown to be important. We have implemented these methods and applied them to MCNC benchmarks mapped into 5-variable function blocks. To make a comparison with other methods, we have implemented methods using well-known merging algorithms utilizing the same multiple output capability. Experimental results show that our methods can reduce the number of blocks in the initial circuits by 40% on average. This reduction ratio is 16% higher than that of previous methods.
Kouichi NAGAMI Kiyoshi OGURI Tsunemichi SHIOZAWA Hideyuki ITO Ryusuke KONISHI
We propose an architectural reference of programmable devices that we call Plastic Cell Architecture (PCA). PCA is a reference for implementing a device with autonomous reconfigurability, which we also introduce in this paper. This reconfigurability is a further step toward new reconfigurable computing, which introduces variable- and programmable-grained parallelism to wired logic computing. This computing follows the Object-Oriented paradigm: it regards configured circuits as objects. These objects will be described in a new hardware description language dealing with the semantics of dynamic module instantiation. PCA is the fusion of SRAM-based FPGAs and cellular automata (CA), where the CA are dedicated to support run time activities of objects. This paper mainly focus on autonomous reconfigurability and PCA. The following discussions examine a research direction towards general-purpose reconfigurable computing.
Kenneth Carless SMITH P.Glenn GULAK
The evolution of Multiple-Valued Logic (MVL) circuits has been inexorably tied to the rapid technological changes induced by evolving needs and emerging developments in computing methodologies. Unfortunately for MVL, the numbers of designers of technologies and circuits whose lives are dedicated to the improvement of binary techniques, are large and overwhelming. Correspondingly, technological developments in MVL typically await the appearance of a problem or technique in the larger binary world to motivate and/or make possible some new advance. Such opportunities are inevitably quite transient since each such problem is simultaneously attacked by many others of a more conventional bent, and, as well, each technological change begets yet another, quickly. It is in the sensing of this reality that the present paper is written. Correspondingly, its thrust is two-fold: One target is the possibility of encouraging a leap ahead through modest technological projection. The other is the possibility of identifying application areas that already exist in this unbalanced competition, but which are specially suited to multiple-valued solutions. For example, it has been clear for decades that one such area is that of arithmetic. Correspondingly, we in MVL must strive quickly to concentrate our efforts on applications that exploit such demonstrable strengths. Some such applications are includes here; others are visible historically, many probably remain to be found: Search on!