1-14hit |
Yi GUO Heming SUN Ping LEI Shinji KIMURA
Approximate multiplier design is an effective technique to improve hardware performance at the cost of accuracy loss. The current approximate multipliers are mostly ASIC-based and are dedicated for one particular application. In contrast, FPGA has been an attractive choice for many applications because of its high performance, reconfigurability, and fast development round. This paper presents a novel methodology for designing approximate multipliers by employing the FPGA-based fabrics (primarily look-up tables and carry chains). The area and latency are significantly reduced by applying approximation on carry results and cutting the carry propagation path in the multiplier. Moreover, we explore higher-order multipliers on architectural space by using our proposed small-size approximate multipliers as elementary modules. For different accuracy-hardware requirements, eight configurations for approximate 8×8 multiplier are discussed. In terms of mean relative error distance (MRED), the error of the proposed 8×8 multiplier is as low as 1.06%. Compared with the exact multiplier, our proposed design can reduce area by 43.66% and power by 24.24%. The critical path latency reduction is up to 29.50%. The proposed multiplier design has a better accuracy-hardware tradeoff than other designs with comparable accuracy. Moreover, image sharpening processing is used to assess the efficiency of approximate multipliers on application.
An analytical performance evaluation model is presented in this paper. A time-splitting transmitter circuit employing a selectively activated flip-driver (SAFD) is presented and its performance is estimated by the new model. The optimal partitioning method which maximizes the performance of a given bus-invert (BI) coding circuit is also presented. When a bus is optimally partitioned, an ordinary BI circuit can reduce the number of bus transitions by about 25%, while an SAFD circuit can remove about 35% of them. The newly developed method is verified by simulations whose results correspond very well to the values predicted by the model.
This paper proposes an energy efficient processor which can be used as a design alternative for the dynamic voltage scaling (DVS) processors in embedded system design. The processor consists of multiple PE (processing element) cores and a selective set-associative cache memory. The PE-cores have the same instruction set architecture but differ in their clock speeds and energy consumptions. Only a single PE-core is activated at a time and the other PE-cores are deactivated using clock gating and signal gating techniques. The major advantage over the DVS processors is a small overhead for changing its performance. The gate-level simulation demonstrates that our processor can change its performance within 1.5 microsecond and dissipates about 10 nano-joule while conventional DVS processors need hundreds of microseconds and dissipate a few micro-joule for the performance transition. This makes it possible to apply our multi-performance processor to many real-time systems and to perform finer grained and more sophisticated dynamic voltage control.
Giuseppe CARUSO Alessio MACCHIARELLA
In this paper, a design methodology for the minimization of various performance metrics of MOS Current-Mode Logic (MCML) circuits is described. In particular, it allows to minimize the delay under a given power consumption, the power consumption under a given delay and the power-delay product. Design solutions can be evaluated graphically or by simple and effective automatic procedures implemented within the MATLAB environment. The methodology exploits the novel concepts of crossing-point current and crossing-point capacitance. A useful feature of it is that it provides the designer with useful insights into the dependence of the performance metrics on design variables and fan-out capacitance. The methodology was validated by designing several MCML circuits in an IBM 130 nm CMOS process.
In this paper, we present a new fast Fourier transform (FFT) algorithm to reduce the table size of twiddle factors required in pipelined FFT processing. The table size is large enough to occupy significant area and power consumption in long-point FFT processing. The proposed algorithm can reduce the table size to half, compared to the radix-22 algorithm, while retaining the simple structure. To verify the proposed algorithm, a 2048-point pipelined FFT processor is designed using a 0.18 µm CMOS process. By combining the proposed algorithm and the radix-22 algorithm, the table size is reduced to 34% and 51% compared to the radix-2 and radix-22 algorithms, respectively. The FFT processor occupies 1.28 mm2 and achieves a signal-to-quantization-noise ratio (SQNR) of more than 50 dB.
As semiconductor processing technology advances, complex, high density circuits can be integrated in a chip. However, increasing energy consumption is becoming one of the most important limiting factors. Power estimation at the early stage of design is essential since design changes at later stages may significantly lengthen the design period and increase the costs. For efficient power estimation, we analyze the "key" control signals of a digital circuit and develop power models for several operational modes. The trade-off between accuracy and complexity can be made by choosing the number and the complexity of the power models. When compared with those of logic simulation based estimation, experimental results show that 13 to 15 times faster power estimation with an estimation error of about 5% is possible. We have also developed new logic-level power modeling techniques in which logic gates are levelized and several levels are selected to build power model tables. This table based method shows significant improvement in estimation accuracy and a slight improvement in efficiency when compared to a well-known previous method. The average estimation error has been reduced from 13.3% to 3.8%.
Ji-Hoon LIM Jong-Chan HA Won-Young JUNG Yong-Ju KIM Jae-Kyung WEE
A novel high-speed and low-voltage CMOS level shifter circuit is proposed. The proposed circuit is suitable for block-level dynamic voltage and frequency scaling (DVFS) environment or multiple-clock and multiple-power-domain logic blocks. In order to achieve high performance in a chip consisting of logic blocks having different VDD voltages, the proposed circuit uses the circuit techniques to reduce the capacitive loading of input signals and to minimize the contention between pull-up and pull-down transistors through positive feedback loop. The techniques improve the slew rate of output signals, so that the level transient delay and duty distortions can be reduced. The proposed level up/down shifters are designed to operate over a wide range of voltage and frequency and verified with Berkeley's 65 nm CMOS model parameters, which can cover a voltage range from 0.6 to 1.6 V and at least frequency range up to 1000 MHz within 3% duty errors. Through simulation with Berkeley's 65 nm CMOS model parameters, the level shifter circuits can solve the duty distortion preventing them from high speed operation within the duty ratio error of 3% at 1 GHz. For verification through performance comparison with reported level shifts, the simulations are carried out with 0.35 µm CMOS technology, 0.13 µm IBM CMOS technology and Berkeley's 65 nm CMOS model parameters. The compared results show that delay time and duty ratio distortion are improved about 68% and 75%, respectively.
Hye-Mi CHOI Ji-Hoon KIM In-Cheol PARK
As turbo decoding is a highly memory-intensive algorithm consuming large power, a major issue to be solved in practical implementation is to reduce power consumption. This paper presents an efficient reverse calculation method to lower the power consumption by reducing the number of memory accesses required in turbo decoding. The reverse calculation method is proposed for the Max-log-MAP algorithm, and it is combined with a scaling technique to achieve a new decoding algorithm, called hybrid log-MAP, that results in a similar BER performance to the log-MAP algorithm. For the W-CDMA standard, experimental results show that 80% of memory accesses are reduced through the proposed reverse calculation method. A hybrid log-MAP turbo decoder based on the proposed reverse calculation reduces power consumption and memory size by 34.4% and 39.2%, respectively.
Shoji KAWAHITO Kazutaka HONDA Masanori FURUTA Nobuhiro KAWAI Daisuke MIYAZAKI
In this paper, low-power design techniques of high-speed A/D converters are reviewed and discussed. Pipeline and parallel-pipeline architectures are treated as these are dominant architectures when required high sampling rate and high resolution with reasonable power dissipation. A systematic approach to the power optimization of pipeline and parallel pipeline ADC's is introduced based on models of noise analysis and response time of a building block in the multiple-stage pipeline ADC. Finally, the theoretical minimum of required power as functions of the sampling rate, resolution and SNR is discussed. The analysis shows that, with the developments of new circuits and systems to approach to the minimum, the power can be further reduced by a factor of more than 1/10 without changing the basic architectures.
Sung Woo CHUNG Gi Ho PARK Sung Bae PARK
Even in embedded processors, the accuracy in a branch prediction significantly affects the performance. In designing a branch predictor, in addition to accuracy, microarchitects should consider area, delay and power consumption. We propose two techniques to reduce the power consumption; these techniques do not requires any additional storage arrays, do not incur additional delay (except just one MUX delay) and never deteriorate accuracy. One is to look up two predictions at a time by increasing the width (decreasing the depth) of the PHT (Prediction History Table). The other is to reduce unnecessary accesses to the BTB (Branch Target Buffer) by accessing the PHT in advance. Analysis results with Samsung Memory Compiler show that the proposed techniques reduce the power consumption of the branch predictor by 15-52%.
Sung Woo CHUNG Gi Ho PARK Sung Bae PARK
This letter proposes a low-power tournament branch predictor, in which the number of accesses to the branch predictors (local predictor or global predictor) is reduced. Analysis results with Samsung Memory Compiler show that the proposed branch predictor reduces the power consumption by 24-45%, compared to the conventional tournament branch predictor, not requiring any additional storage arrays, not incurring any additional delay and never harming accuracy.
The design of the analog part of a mixed analog-digital IC for a commercial wireless burglar alarm system is presented as an example of a very low-power VLSI design for battery-operated systems. The main constraint is battery life, which must be at least five years (with standard camera-battery). An operational amplifier, a power supply monitor and an oscillator are the core of the design. The operational amplifier absorbs 1.5 µA while the entire analog part absorbs 4 µA. Measures on each single part show compliance with specification. Test on working environment show its full functionality. Even though the example is application specific, the design solutions and each single element can also be utilized in many other battery-operated low-frequency devices (e.g. environmental parameter monitoring).
Daisuke MIYAZAKI Shoji KAWAHITO
In this paper, we present a low-power and area-efficient design method of embedded high-speed A/D converters for mixed analog-digital system LSI's. As the A/D converter topology, a 1.5 bit/stage interleaved pipeline A/D converter is employed, because the basic topology covers a wide range of specifications on the conversion frequency and the resolution. The design method determines the minimum DC supply current, the minimum device sizes and the minimum number of channels to meet the precision given by the specification. This paper also points out that the interleaved pipeline structure is very effective for low-power design of high-speed A/D converters whose sampling frequency is over 100 MHz.
This paper proposes a new method for dynamically controlling the clock speed of a processor in order to reduce power consumption without decreasing system performance. It automatically tunes the processor's speed by monitoring its activities and avoiding useless work so as not to exhaust the battery energy. Experiments with performance bottlenecks caused by disk activities show that the proposed method is very effective in comparison with the traditional one, in which the processor's speed is fixed.