Kazuya KATSUKI Manabu KOTANI Kazutoshi KOBAYASHI Hidetoshi ONODERA
In this paper, we show that speed and yield of reconfigurable devices can be enhanced by utilizing within-die (WID) delay variations. An LUT Array LSI is fabricated to confirm whether FPGAs have clear WID variations to be utilized. We can measure delay variations by counting the number of LUTs a signal propagates within a certain time. Clear die-to-die (D2D) and WID variations are observed. We propose a variation model from the measurement results. Adequacy of the model is discussed from randomness of the random component. Effect of the speed and yield enhancement is confirmed using the proposed model. Yield increases from 80.0% to 100.0% by optimizing configurations.
In this paper, we present an efficient architecture for connected word recognition that can be implemented with field programmable gate array (FPGA). The architecture consists of newly derived two-level dynamic programming (TLDP) that use only bit addition and shift operations. The advantages of this architecture are the spatial efficiency to accommodate more words with limited space and the absence of multiplications to increase computational speed by reducing propagation delays. The architecture is highly regular, consisting of identical and simple processing elements with only nearest-neighbor communication, and external communication occurs with the end processing elements. In order to verify the proposed architecture, we have also designed and implemented it, prototyping with Xilinx FPGAs running at 33 MHz.
Dang Hai PHAM Takanobu TABATA Hirokazu ASATO Satoshi HORI Tomohisa WADA
In this paper, an adaptive array antenna is implemented to enhance the performance of digital TV ISDB-T reception. Issues of realizing the proposed array antenna and its implementation by a joint hardware-software solution are also presented in this paper. Instead of using known reference signals, the proposed method utilizes the GI (Guard Interval) and a periodic property of OFDM signal as a constraint to realize MRC (Maximum Ratio Combining) and SMI (Sample Matrix Inversion) adaptive beam-forming algorithms. Experimental results show that the proposed system drastically improves the quality of reception. Moreover, the proposed system can achieve excellent performance under the conditions of strong interferences.
Yuichi NAKAMURA Kouhei HOSOKAWA
This paper describes a new method for the simulation environment for a custom processor. It is generally very hard to develop an accurate simulator for a custom processor rapidly, even if simple instruction-set-level simulator (ISS). The proposed method uses a field-programmable-gate-array emulator with a PCI interface and debugging GUI software on a PC. Since the emulator implements the processor design at the register-transfer or net-list level, the emulation results are almost the same as the results obtained with the actual processor. To support rich debugging functions like those provided by the conventional software simulator, we use a debugging buffer and break-control circuits. Experimental results show that a simulator constructed by the proposed method can be constructed within several hours and that it can break the processor operation at any specified point and observe the internal signals when the emulated system is running at 1-30 MHz. The accuracy of the constructed simulator is the same as that of RTL simulation and much higher than that of software ISS simulation. We show that we can provide a fast, accurate, and useful simulator for any processor design specified at the register-transfer level.
Shinobu NAGAYAMA Tsutomu SASAO Jon T. BUTLER
This paper presents an architecture and a synthesis method for compact numerical function generators (NFGs) for trigonometric, logarithmic, square root, reciprocal, and combinations of these functions. Our NFG partitions a given domain of the function into non-uniform segments using an LUT cascade, and approximates the given function by a quadratic polynomial for each segment. Thus, we can implement fast and compact NFGs for a wide range of functions. Experimental results show that: 1) our NFGs require, on average, only 4% of the memory needed by NFGs based on the linear approximation with non-uniform segmentation; 2) our NFG for 2x-1 requires only 22% of the memory needed by the NFG based on a 5th-order approximation with uniform segmentation; and 3) our NFGs achieve about 70% of the throughput of the existing table-based NFGs using only a few percent of the memory. Thus, our NFGs can be implemented with more compact FPGAs than needed for the existing NFGs. Our automatic synthesis system generates such compact NFGs quickly.
Akihisa YOKOYAMA Hiroshi HARADA
We previously proposed an architecture for software defined radio called the reconfigurable packet routing-oriented signal processing platform (RPPP). This architecture was suited to wireless signal processing applications, which require radio functions to be selected in real time depending on the transmitted signal. A number of radio standards are used in DSRC systems for vehicle communication and vehicle equipment is required to transmit and receive the radio signals used on each particular occasion. An implementation of RPPP is described in this paper that enables the dynamic handling of two ARIB standards for DSRC. After an explanation of the basic architecture and an analysis of RPPP, the implementation of a reconfigurable DSRC transceiver for ASK and π/4 shift-QPSK is described. The implementation is then discussed, evaluated in terms of the number of logic units needed. We concluded that our platform is 27.6% more efficient in utilizing logic than that achieved with fixed design.
Mitsuru TOMONO Masaki NAKANISHI Shigeru YAMASHITA Kazuo NAKAJIMA Katsumasa WATANABE
In a partially reconfigurable FPGA of the future, arbitrary portions of its logic resources and interconnection networks will be reconfigured without affecting the other parts. Multiple tasks will be mapped and executed concurrently in such an FPGA. Efficient execution of the tasks using the limited resources of the FPGA will necessitate effective resource management. A number of online FPGA placement methods have recently been proposed for such an FPGA. However, they cannot handle I/O communications of the tasks. Taking such I/O communications into consideration, we introduce a new approach to online FPGA placement. We present an algorithm for placing each arriving task in an empty area so as to complete all the tasks efficiently. We develop two fitting strategies to effectively handle I/O communications of the tasks. Our experimental results show that properly weighted combinations of these and two other previously proposed strategies enable this algorithm to run very fast and make an effective placement of the tasks. In fact, we show that the overhead associated with the use of this algorithm is negligible as compared to the total execution time of the tasks.
Masanori HARIYAMA Sho OGATA Michitaka KAMEYAMA
Multi-context FPGAs (MC-FPGAs) have multiple memory bits per configuration bit forming configuration planes for fast switching between contexts. The additional memory planes cause a large overhead in area when a number of contexts are used. To overcome the overhead, a fine-grained MC-FPGA architecture using a floating-gate-MOS functional pass gate (FGFP) is presented which merges threshold operation and storage function on a single floating-gate MOS transistor. The test chip is designed using a 0.35 µm CMOS-EPROM technology. The transistor count of the proposed multi-context switch (MC-switch) is reduced to 13% in comparison with SRAM-based one. The total area of the proposed MC-FPGA is reduced to about 56% of that of a conventional SRAM-based MC-FPGA.
Hongge LI Yoshihiro HAYAKAWA Shigeo SATO Koji NAKAJIMA
In this paper, the authors present a new digital circuit of neuron hardware using a field programmable gate array (FPGA). A new Inverse function Delayed (ID) neuron model is implemented. The Inverse function Delayed model, which includes the BVP model, has superior associative properties thanks to negative resistance. An associative memory based on the ID model with self-connections has possibilities of improving its basin sizes and memory capacity. In order to decrease circuit area, we employ stochastic logic. The proposed neuron circuit completes the stimulus response output, and its retrieval property with negative resistance is superior to a conventional nonlinear model in basin size of an associative memory.
Ryoichiro ATONO Shuichi ICHIKAWA
If a logic circuit was specialized to a specific input, the derived circuit would be faster and smaller than the original. This study presents various designs of a key-specific AES encryption circuit. In our iterative design, 41% of the logic gates and 20% of RAM were reduced, while 24% more performance was derived. In our pipelined design, 54% of the logic gates and 20% of RAM were reduced, while 74% higher performance was achieved. The results on DES encryption circuits are also presented for comparison.
Multi-context FPGAs allow very quick reconfiguration by storing multiple configuration data at the same time. While testing for FPGAs with single-context memories has already been studied by many researchers, testing for multi-context FPGAs has not been proposed yet. This paper presents an architecture of testable multi-context FPGAs. In the proposed multi-context FPGA, configuration data stored in a context can be copied into another context. This paper also shows testing of the proposed multi-context FPGA. The proposed testing uses the testing for the traditional FPGAs with single-context. The testing is capable of detecting single stuck-at faults and single open faults which affect normal operations. The number of test configurations for the proposed testing is at most two more than that for the testing of FPGAs with single-context memories. The area overhead of the proposed architecture is 7% and 4% of the area of a multi-context FPGA without the proposed architecture when the number of contexts in a configuration memory is 8 and 16, respectively.
Kazunori SHIMIZU Tatsuyuki ISHIKAWA Nozomu TOGAWA Takeshi IKENAGA Satoshi GOTO
In this paper, we propose a partially-parallel LDPC decoder which achieves a high-efficiency message-passing schedule. The proposed LDPC decoder is characterized as follows: (i) The column operations follow the row operations in a pipelined architecture to ensure that the row and column operations are performed concurrently. (ii) The proposed parallel pipelined bit functional unit enables the column operation module to compute every message in each bit node which is updated by the row operations. These column operations can be performed without extending the single iterative decoding delay when the row and column operations are performed concurrently. Therefore, the proposed decoder performs the column operations more frequently in a single iterative decoding, and achieves a high-efficiency message-passing schedule within the limited decoding delay time. Hardware implementation on an FPGA and simulation results show that the proposed partially-parallel LDPC decoder improves the decoding throughput and bit error performance with a small hardware overhead.
Canh Quang TRAN Hiroshi KAWAGUCHI Takayasu SAKURAI
A low-power FPGA design approach is proposed based on a fine-grain VDD control scheme called micro-VDD-hopping. Four configurable logic blocks (CLBs) are grouped into one block where VDD is shared. In the micro-VDD-hopping scheme, VDD in each block is changed between VDDH (high VDD) and VDDL (low VDD) spatially and temporally in order to achieve lower power without performance degraded. A low-power level shifter that has less contention is also proposed for low-swing inter-block signals. The FPGA incorporates the Zigzag power-gating scheme, in which special care has been taken to cope with a sneak leakage-path problem. A test chip was fabricated using a 0.35-µm CMOS technology, together with the conventional fixed-VDD FPGA for comparison. Measurement results show that dynamic power in the proposed scheme can be reduced by 86% when a frequency is half of the maximum one. Simulation using a 90-nm CMOS technology shows that leakage power can be reduced by 97%, when the proposed method is used. The area overhead of the proposed FPGA is 2%.
Hui QIN Tsutomu SASAO Yukihiro IGUCHI
This paper addresses a pipelined partial rolling (PPR) architecture for the AES encryption. The key technique is the PPR architecture. With the proposed architecture on the Altera Stratix FPGA, two PPR implementations achieve 6.45 Gbps throughput and 12.78 Gbps throughput, respectively. Compared with the unrolling implementation that achieves a throughput of 22.75 Gbps on the same FPGA, the two PPR implementations improve the memory efficiency (i.e., throughput divided by the size of memory for core) by 13.4% and 12.3%, respectively, and reduce the amount of the memory by 75% and 50%, respectively. Also, the PPR implementation has a up to 9.83% higher memory efficiency than the fastest previous FPGA implementation known to date. In terms of resource efficiency (i.e., throughput divided by the equivalent logic element or slice), one PPR implementation offers almost the same as the rolling implementation, and the other PPR implementation offers a medium value between the rolling implementation and the unrolling implementation that has the highest resource efficiency. However, the two PPR implementations can be implemented on the minimum-sized Stratix FPGA while the unrolling implementation cannot. The PPR architecture fills the gap between unrolling and rolling architectures and is suitable for small and medium-sized FPGAs.
Toshiaki KOIKE Yukinaga SEKI Hidekazu MURATA Susumu YOSHIDA Kiyomichi ARAKI
We developed two types of practical maximum-likelihood detectors (MLD) for multiple-input multiple-output (MIMO) systems, using a field programmable gate array (FPGA) device. For implementations, we introduced two simplified metrics called a Manhattan metric and a correlation metric. Using the Manhattan metric, the detector needs no multiplication operations, at the cost of a slight performance degradation within 1 dB. Using the correlation metric, the MIMO-MLD can significantly reduce the complexity in both multiplications and additions without any performance degradation. This paper demonstrates the bit-error-rate performance of these MLD prototypes at a 1 Gbps-order real-time processing speed, through the use of an all-digital baseband 44 MIMO testbed integrated on the same FPGA chip.
FPGAs (Field-Programmable Gate Arrays) have been widely used as coprocessors to boost the performance of data-intensive applications [1],[2]. However, there are several challenges to further boost FPGA performance: the communication overhead between the host workstation and the FPGAs can be substantial; large-scale applications cannot fit in a single FPGA because of its limited capacity; mapping an application algorithm to FPGAs still remains a daunting job in configurable system design. To circumvent these problems, we propose in this paper the FPGA-based Hierarchical-SIMD (H-SIMD) machine with its codesign of the Pyramidal Instruction Set Architecture (PISA). PISA comprises high-level instructions implemented as FPGA functions of coarse-grain SIMD (Single-Instruction, Multiple-Data) tasks to facilitate ease of program development, code portability across different H-SIMD implementations and high performance. We assume a multi-FPGA board where each FPGA is configured as a separate SIMD machine. Multiple FPGA chips can work in unison at a higher SIMD level, if needed, controlled by the host. Additionally, by using a memory switching scheme and the high-level PISA to partition applications into coarse-grain tasks, host-FPGA communication overheads can be hidden. We enlist the two-dimensional Fast Fourier Transform (2D FFT) to test the effectiveness of H-SIMD. The test results show sustained high performance for this problem. The H-SIMD machine even outperforms a Xeon processor for this problem.
Weisheng CHONG Masanori HARIYAMA Michitaka KAMEYAMA
A low-power field-programmable VLSI (FPVLSI) is presented to overcome the problem of large power consumption in field-programmable gate arrays (FPGAs). To reduce power consumption in routing networks, the FPVLSI consists of cells that are based on a bit-serial pipeline architecture which reduces routing block complexity. Moreover, a level-converter-less multiple-supply-voltage scheme using dynamic circuits is proposed, where the cells in non-critical paths use a low supply voltage for low power under a speed constraint. The FPVLSI is evaluated based on a 0.18-µm CMOS design rule. The power consumption of the FPVLSI using multiple supply voltages is reduced to 17% or less compared to that of the static-circuit-based FPVLSI using multiple supply voltages.
Masanori HARIYAMA Yasuhiro KOBAYASHI Haruka SASAKI Michitaka KAMEYAMA
This paper presents a processor architecture for high-speed and reliable stereo matching based on adaptive window-size control of SAD (Sum of Absolute Differences) computation. To reduce its computational complexity, SADs are computed using images divided into non-overlapping regions, and the matching result is iteratively refined by reducing a window size. Window-parallel-and-pixel-parallel architecture is also proposed to achieve to fully exploit the potential parallelism of the algorithm. The architecture also reduces the complexity of an interconnection network between memory and functional units based on the regularity of reference pixels. The stereo matching processor is implemented on an FPGA. Its performance is 80 times higher than that of a microprocessor (Pentium4@2 GHz), and is enough to generate a 3-D depth image at the video rate of 33 MHz.
MT19937 is a kind of Mersenne Twister, which is a pseudo-random number generator. This study presents new designs for a MT19937 circuit suitable for custom computing machinery for high-performance scientific simulations. Our designs can generate multiple random numbers per cycle (multi-port design). The estimated throughput of a 52-port design was 262 Gbps, which is 115 times higher than the software on a Pentium 4 (2.53 GHz) processor. Multi-port designs were proven to be more cost-effective than using multiple single-port designs. The initialization circuit can be included without performance loss in exchange for a slight increase of logic scale.
Chun-Lung HSU Wen-Tso WANG Ying-Fu HONG
This work presents a frequency-scaling low-power (FSLP) design methodology for managing power consumption of cores in the tile-based network-on-chip (NOC) architecture. A moving picture experts group (MPEG) core is tested using the field-programmable gate array (FPGA) implementation to verify the feasibility of the proposed method. Measurement results show that about 30% power consumption can be saved in the MPEG core and reveal that the proposed FSLP design method can be suitable for cores in the tile-based NOC applications.