1-11hit |
Yi GUO Heming SUN Ping LEI Shinji KIMURA
Approximate computing has emerged as a promising approach for error-tolerant applications to improve hardware performance at the cost of some loss of accuracy. Multiplication is a key arithmetic operation in these applications. In this paper, we propose a low-cost approximate multiplier design by employing new probability-driven inexact compressors. This compressor design is introduced to reduce the height of partial product matrix into two rows, based on the probability distribution of the sum result of partial products. To compensate the accuracy loss of the multiplier, a grouped error recovery scheme is proposed and achieves different levels of accuracy. In terms of mean relative error distance (MRED), the accuracy losses of the proposed multipliers are from 1.07% to 7.86%. Compared with the Wallace multiplier using 40nm process, the most accurate variant of the proposed multipliers can reduce power by 59.75% and area by 42.47%. The critical path delay reduction is larger than 12.78%. The proposed multiplier design has a better accuracy-performance trade-off than other designs with comparable accuracy. In addition, the efficiency of the proposed multiplier design is assessed in an image processing application.
Tongxin YANG Tomoaki UKEZONO Toshinori SATO
Many applications, such as image signal processing, has an inherent tolerance for insignificant inaccuracies. Multiplication is a key arithmetic function for many applications. Approximate multipliers are considered an efficient technique to trade off energy relative to performance and accuracy for the error-tolerant applications. Here, we design and analyze four approximate multipliers that demonstrate lower power consumption and shorter critical path delay than the conventional multiplier. They employ an approximate tree compressor that halves the height of the partial product tree and generates a vector to compensate accuracy. Compared with the conventional Wallace tree multiplier, one of the evaluated 8-bit approximate multipliers reduces power consumption and critical path delay by 36.9% and 38.9%, respectively. With a 0.25% normalized mean error distance, the silicon area required to implement the multiplier is reduced by 50.3%. Our multipliers outperform the previously proposed approximate multipliers relative to power consumption, critical path delay, and design area. Results from two image processing applications also demonstrate that the qualities of the images processed by our multipliers are sufficiently accurate for such error-tolerant applications.
Mona MORADI Reza FAGHIH MIRZAEE Keivan NAVI
This paper presents new Binary Converters (or current-mode compressors) by the usage of carbon nanotube field effect transistors. The new designs are made of three parts: 1) the input currents which are converted to voltage; 2) threshold detectors; and 3) the output current flow paths. In addition, an 8×8-bit multiplier is considered as a bench mark to estimate their efficiency degrees. The first approach is based on high-order Binary Converters, and the second one is only composed of 4BCs and Half Adders.
Taeko MATSUNAGA Shinji KIMURA Yusuke MATSUNAGA
Multi-operand adders that calculate the summation of more than two operands usually consist of compressor trees, which reduce the number of operands to two without any carry propagation, and carry-propagate adders for the two operands in the ASIC implementation. Compressor trees that consist of full adders and half adders cannot be implemented efficiently on LUT-based FPGAs, and carry-chains or dedicated structures have been utilized to produce multi-operand adders on FPGAs. Recent studies indicate that compressor trees can be implemented efficiently on LUTs using Generalized Parallel Counters (GPCs) as the building blocks of compressor trees. This paper addresses the problem of synthesizing compressor trees based on GPCs. Based on the observation that characteristics such as the area, power, and delay correlate roughly to the total number and the maximum level of GPCs, the target problem can be regarded as a minimization problem for the total number of GPCs and the maximum levels of the GPCs, for which an ILP-based approach is proposed. The key point of our formulation is not to model the problem based on the structures of compressor trees like the existing approach, but instead the compression process itself is used to reduce the number of variables and constraints in the ILP formulation. The experimental results demonstrate the advantage of our formulation in terms of the quality and runtime.
Amir FATHI Sarkis AZIZIAN Khayrollah HADIDI Abdollah KHOEI
A novel high speed 4-2 compressor using static and pass-transistor logic, has been designed in a 0.35 µm CMOS technology. In order to reduce gate level delay and increase the speed, some changes are performed in truth table of conventional 4-2 compressor which leaded to the simplification of logic function for all parameters. Therefore, power dissipation is decreased. In addition, because of similar paths from all inputs to the outputs, the delays are the same. So there will be no need for extra buffers in low latency paths to equalize the delays.
This paper proposed a new watermarking algorithm and implementation in hardware, by which the watermarking process and an image compression process can operate in conjunction, in parallel, and/or without degrading the performance of the compression process. The goal of the proposed watermarking scheme is to provide the bases to insist the ownership and to authenticate integrity of the watermark-embedded image by detecting the errors and their positions without the original image (blind watermarking). Our watermarking scheme is to replace the watermark with one or several bit-plane(s) of the DC subband after 2DDWT (2-Dimensional Discrete Wavelet Transform) decomposition which is the basic transformation in DWT-based image compression such as JPEG2000. If more than one bit-plane is involved, the position to embed each watermark bit is randomly selected among the bit-planes by a random number generated with an LFSR (Linear Feedback Shift Register). Experimental results showed that for all the considered attacks except the high compression by JPEG, the error ratios in the extracted watermarks by our algorithm were below 3% and the extracted watermarks were unambiguously recognizable in all the cases. The hardware (FPGA)-implemented result could operate stably in 82 MHz clock frequency. This hardware was merged to DWT-based image compression codec which runs in a real-time in 66 MHz of clock frequency. This resulted in the real-time operation for codec and watermarking together in 66 MHz of clock frequency. The watermarking scheme used 4,037 LABs (24%) of the hardware resource of APEX20KC EP20K400CF672-7 from Altera.
In this paper, a high performance 3232-bit multiplier for a DSP core is proposed. The multiplier is composed of a block of Booth Encoder, a block of data compression, and a block of a 64-bit adder. In the block of Booth encoder, a conditional sign decision Booth encoder that reduces the gate delay and power consumption is proposed. In the block of data compression, 4-2 and 9-2 data compressors based on a novel compound logic are used for the efficient compressing of extra sign bit. In the block of 64-bit adder, an adaptive MUX-based conditional select adder with a separated carry generation block is proposed. The proposed 3232-bit multiplier is designed by a full-custom method and there are about 28,000 transistors in an active area of 900 µm 500 µm with 0.25 µm CMOS technology. From the experimental results, the multiplication time of the multiplier is about 3.2 ns at 2.5 V power supply, and it consumes about 50 mW at 100 MHz.
Currently, a typical 5454 bit multiplier is composed of a parallel structured architecture with the encoder block to implement the Modified Booth's algorithm, a block to implement the data compression, and a 108-bit Carry Look-Ahead (CLA) adder. The key idea in the present paper is a power optimization for the data compressors based on a Window Detector. The role of the Window Detector is detecting the input data, activating a selected operation unit, choosing the optimized output data, and driving the next stage. It can reduce the power consumption drastically because only one selected operation unit (a Window) is activated. The power consumption of the proposed data compressors is reduced by about 33%, compared with that of the conventional multiplier; while the propagation delay is nearly same as that of the conventional one. Furthermore, the power consumption dependent on the input data transition is shown for both the static CMOS logic and the nMOS pass transistor logic.
Hiroshi MAKINO Hiroaki SUZUKI Hiroyuki MORINAKA Yasunobu NAKASE Hirofumi SHINOHARA Koichiro MASHIKO Tadashi SUMI Yasutaka HORIBA
This paper describes the design of a high-speed 4-2 compressor for fast multipliers. Through the survey of the six kinds of representative conventional 4-2 compressor (RBA 1-3 and NBA 1-3) in both the redundant binary (RB) and the normal binary (NB) scheme, we extracted two problems that degrades the operating speed. The first is the use of multi-input complex gates and the second is the existence of transmission gates (TG) at the input and/or output stages. To solve these problems, we propose high-speed 4-2 compressors using the RB scheme, which we call the high-speed redundant binary adders (HSRBAs). Six kinds of HSRBAs, HSRBA 1-6, were derived by making the Boolean equations suitable for high-speed CMOS circuits. Among them, HSRBA2, HSRBA4 and HSRBA6 have no multi-input complex gate and input/output TG, and perform at a delay time of 0.89 ns which is the fastest of all 4-2 compressors. We investigated the logical relation between HSRBAs and conventional 4-2 compressors by analyzing the Boolean equations for each circuit. This investigation shows that all the conventional redundant binary adders RBA1-3 have the same logic structures as HSRBA2. We also showed the conventional normal binary adders NBA1-3 have the same logic structures as HSRBA1, HSRBA3 and HSRBA5, respectively. This implies all 4-2 compressors can be derived from the same equation regardless of RB or NB. We applied the HSRBA2 to a 5454-bit multiplier using 0.5-µm CMOS technology. The multiplication time at the supply voltage of 3.3 V was 8.8 ns. This is the fastest 5454-bit multiplier with 0.5-µm CMOS so far, and 83% of the speed improvement is due to the high speed 4-2 compressor.
This paper proposes a distributed built-in self-test (BIST) technique and its test design platform for VLSIs. This BIST has lower hardware overhead pattern generators, compressors and controller. The platform cuts down on the number of complicated operations needed for the BIST insertion and evaluation, so the BIST implementation turn-around-time (TAT) is dramatically reduced. Experimental results for the 110 k-gate arithmetic execution blocks of an image-processing LSI show that using this BIST structure and platform enables the entire BIST implementation within five days. The implemented BIST has a 1% hardware overhead and 96% fault coverage. This platform will significantly reduce testing costs for time-to-market and mass-produced LSIs.
Youji KANIE Yasushi KUBOTA Shinji TOYOYAMA Yasuaki IWASE Shuhei TSUCHIMOTO
This report describes 4-2 compressors composed of Complementary Pass-Transistor Logic (CPL). We will show that circuit designs of the 4-2 compressors can be optimized for high speed and small size using only exclusive-OR's and multiplexers. According to a circuit simulation with 0.8µm CMOS device parameters, the maximum propagation delay and the average power consumption per unit adder are 1.32 ns and 11.6 pJ, respectively.