Tadayoshi ENOMOTO Nobuaki KOBAYASHI Tomomi EI
To drastically reduce the power dissipation (P) of an absolute difference accumulation (ADA) circuit for H.26x/MPEG4 motion estimation, a fast block-matching (BM) algorithm called the Multiple Block-matching Step (MBS) algorithm has been developed. The MBS algorithm can drastically improve the block matching speed, while achieving the same visual quality as that of a full search (FS) BM algorithm. Power dissipation (P) of a 0.18-µm CMOS absolute difference accumulator (ADA) circuit employing the MBS algorithm is significantly reduced to the range of about 0.3% to 12% that of the same ADA circuit adopting FS.
Nobuaki KOBAYASHI Tomomi EI Tadayoshi ENOMOTO
To drastically reduce the dynamic power (PAT) and the leakage power (PST) of the CMOS MPEG4/H.264 motion estimation (ME) circuits, several power reduction techniques were developed. They were circuit architectures, which were able to reduce the supply voltages (VDD) and numbers of logic gates of not only the whole circuit but the critical path, a fast motion estimation algorithm, and a leakage current reduction circuit. A 0.18-µm CMOS ME circuit has been fabricated by adopting those techniques. At a clock frequency of 160 MHz and VDD of 1.25 V, PAT decreased to 75.9 µW, which was 5.35% that of a conventional ME circuit. PST also decreased to 0.82 nW, which was 3.93% that of the conventional ME circuit.
Discussed here is reduction of power dissipation for multi-media LSIs. First, both active power dissipation Pat and stand-by power dissipation Pst for both CMOS LSIs and GaAs LSIs are summarized. Then, general technologies for reducing Pat are discussed. Also reviewed are a wide variety of approaches (i.e., parallel and pipeline schemes, Chen's fast DCT algorithms, hierarchical search scheme for motion vectors, etc.) for reduction of Pat. The last part of the paper focuses on reduction of Pst. Reducing both Pat and Pst requires that both throughput and active chip areas be either maintained or improved.
Tadayoshi ENOMOTO Nobuaki KOBAYASHI
A motion estimation (ME) multimedia processor was developed by employing dynamic voltage and frequency scaling (DVFS) technique to greatly reduce the power dissipation. To make full use of the advantages of DVFS technique, a fast motion estimation (ME) algorithm was also developed. It can adaptively predict the optimum supply voltage and the optimum clock frequency before ME process starts for each macro-block for encoding. Power dissipation of the 90-nm CMOS DVFS controlled multimedia processor, which contained an absolute difference accumulator as well as a small on-chip DC/DC level converter, a minimum value detector and DVFS controller, was reduced to 38.48 µW, which was only 3.261% that of a conventional multimedia processor.
A fast, low-power 16-bit adder, 32-word register file and 512-bit cache SRAM have been developed using 0.25-µm GaAs HEMT technology for future multi-GHz processors. The 16-bit adder, which uses a negative logic binary look-ahead carry structure based on NOR gates, operates at the maximum clock frequency of 1.67 GHz and consumes 134.4 mW at a supply voltage of 0.6 V. The active area is 1.6 mm2 and there are about 1,230 FETs. A new DC/DC level converter has been developed for use in high-speed, low-power storage circuits such as SRAMs and register files. The level converter can increase the DC voltage, which is supplied to an active-load circuit on request, or supply a minimal DC voltage to a load circuit in the stand-by mode. The power dissipation (P) of the 32-word register file with on-chip DC/DC level converters is 459 mW, a reduction to 25.2% of that of an equivalent conventional register file, while the operating frequency (fc) was 5.17 GHz that is 74.8% of fc for the conventional register file. P for the 512-bit cache SRAM with the new DC/DC level converters is 34.3 mW, 89.7% of the value for an equivalent conventional cache SRAM, with the read-access time of 455 psec, only 1.1% longer than that of the conventional cache SRAM.
Nobuaki KOBAYASHI Tadayoshi ENOMOTO
To completely utilize the advantages of dynamic voltage and frequency scaling (DVFS) techniques, a quantized decoder (QNT-D) was developed. The QNT-D generates a quantized signal processing quantity (Q) using a predicted signal processing quantity (M). Q is used to produce the optimum frequency (opt.fc) and the optimum supply voltage (opt.VD) that are proportional to Q. To develop a DVFS controlled motion estimation (ME) processor, we used both the QNT-D and a fast ME algorithm called A2BC (Adaptively Assigned Breaking-off Condition) to predict M for each macro-block (MB). A DVFS controlled ME processor was fabricated using 90-nm CMOS technology. The total power dissipation (PT) of the processor was significantly reduced and varied from 38.65 to 99.5 µW, only 3.27 to 8.41 % of PT of a conventional ME processor, depending on the test video picture.
Masahide TAKADA Tadayoshi ENOMOTO
This paper reviews the progresses of static random access memories (SRAMs) for the last past 10 years and shows how much memory cell sizes have been reduced and how the chip sizes have increased to increase memory capacities. After a basic SRAM chip organization is briefly described for readers' convenience, various latest and important key technologies for CMOS, bipolar transistor and BiCMOS SRAMs are reviewed. Chip organizations and circuit technologies to improve performance, which have been used in recently reported high performance SRAMs, are also introduced. Future large scale and high speed SRAMs are also forecasted.
Hajime SASAKI Hiroyuki ABE Tadayoshi ENOMOTO Yoichi YANO
In the mid-1990s, ULSI technology will reach Sub-Half-Micron Era and Si chips with 0.30.4 µm design rules will be in the market. The shrinked device size makes it possible to realize a device count of more than 107 and a logic speed of 100200 picosecond per gate. By taking full advantage of these advanced process and device technologies, three basic trends; i.e., (a) higher integration and higher performance, (b) "system-on-chip" or system incorporation onto a single chip, and (c) customization, will be accelerated. Memories will include more logic fuctions on a chip and will become system memories. Microprocessors and ASICs will find a wider variety of applications in data processing, in signal processing and in control. It is prospected definitely that ULSI architecture technology, including system technology and circuit technology, will become more and more important for the progress aiming at the sub-half-micron ULSIs. This paper describes the technology issues that are specific for ULSI memories, microprocessors and ASICs. It overviews technology issues that are common to various ULSI chips, covering design, test, fault-tolerant technique, packaging, and high-speed device circuit. A few of future technological issues, such as Cryo-CMOS/BiCMOS and neural network, are briefly discussed with regard to their potentials as the new elements in ULSI architecture.
Tadayoshi ENOMOTO Nobuaki KOBAYASHI
We developed a self-controllable voltage level (SVL) circuit and applied this circuit to a single-power-supply, six-transistor complementary metal-oxide-semiconductor static random-access memory (SRAM) to not only improve both write and read performances but also to achieve low standby power and data retention (holding) capability. The SVL circuit comprises only three MOSFETs (i.e., pull-up, pull-down and bypass MOSFETs). The SVL circuit is able to adaptively generate both optimal memory cell voltages and word line voltages depending on which mode of operation (i.e., write, read or hold operation) was used. The write margin (VWM) and read margin (VRM) of the developed (dvlp) SRAM at a supply voltage (VDD) of 1V were 0.470 and 0.1923V, respectively. These values were 1.309 and 2.093 times VWM and VRM of the conventional (conv) SRAM, respectively. At a large threshold voltage (Vt) variability (=+6σ), the minimum power supply voltage (VMin) for the write operation of the conv SRAM was 0.37V, whereas it decreased to 0.22V for the dvlp SRAM. VMin for the read operation of the conv SRAM was 1.05V when the Vt variability (=-6σ) was large, but the dvlp SRAM lowered it to 0.41V. These results show that the SVL circuit expands the operating voltage range for both write and read operations to lower voltages. The dvlp SRAM reduces the standby power consumption (PST) while retaining data. The measured PST of the 2k-bit, 90-nm dvlp SRAM was only 0.957µW at VDD=1.0V, which was 9.46% of PST of the conv SRAM (10.12µW). The Si area overhead of the SVL circuits was only 1.383% of the dvlp SRAM.
Tadayoshi ENOMOTO Suguru NAGAYAMA Hiroaki SHIKANO Yousuke HAGIWARA
The delay time (tdT), power dissipation (PT) and circuit volume of a CMOS register array were minimized. Seven test circuits, each of which had a register array and a single clock tree that generated a pair of complement clock pulses, and a conventional register were fabricated using 90-nm CMOS technology. The register array was constructed with M delay flip-flops (FFs) and the clock tree, which consisted of 2 driver stages. Each driver stage had m inverters, each of which drove M/m FFs where M was fixed at 40 and m varied from 1 to 40. The minimum values of tdT and PT were 0.25 ns and 17.88 µW, respectively, and were both obtained when m was 10. These values were 71.4% and 70.4% of tdT and PT for the conventional register, for which m is 40, respectively. The number of inverters in the clock tree when m was 10 was 21 which was only 25.9% that for the conventional register. The measured results agreed well with SPICE-simulated results. Furthermore, for values of M from 20 to 320, both the minimum tdT and the minimum PT were obtained when m was approximately 1.5 times the square root of M.
Tadayoshi ENOMOTO Nobuaki KOBAYASHI
A square-root (SR) algorithm, an SR architecture and a leakage current reduction circuit were developed to reduce dynamic power (PAT) and leakage power (PST), while maintaining the speed of a CMOS SR circuit. Using these techniques, a 90-nm CMOS LSI was fabricated. The PAT of the new SR circuit at a clock frequency (fc) of 490 MHz and a supply voltage (VDD) of 0.75 V was 104.1 µW, i.e., 21.6% that (482.3 µW) of a conventional SR circuit. The PST of the new SR circuit was markedly reduced to 19.51 nW, which was only 1.69% that (1,153 nW) of the conventional SR circuit.
Tadayoshi ENOMOTO Masaaki YASUMOTO Shigeo FUSHIMI
A floating gate tapped CCD with 32 taps for programmable transversal filter applications is presented. A 32-tap minimum phase lowpass filter designed and implemented on this device is discussed in detail. A novel design procedure of this filter is also explained.
Nobuaki KOBAYASHI Tadayoshi ENOMOTO
We developed and applied a new circuit, called the “Self-controllable Voltage Level (SVL)” circuit, not only to expand both “write” and “read” stabilities, but also to achieve a low stand-by power and data holding capability in a single low power supply, 90-nm, 2-kbit, six-transistor CMOS SRAM. The SVL circuit can adaptively lower and higher the word-line voltages for a “read” and “write” operation, respectively. It can also adaptively lower and higher the memory cell supply voltages for the “write” and “hold” operations, and “read” operation, respectively. This paper focuses on the “hold” characteristics and the standby power dissipations (PST) of the developed SRAM. The average PST of the developed SRAM is only 0.984µW, namely, 9.57% of that (10.28µW) of the conventional SRAM at a supply voltage (VDD) of 1.0V. The data hold margin of the developed SRAM is 0.1839V and that of the conventional SRAM is 0.343V at the supply voltage of 1.0V. An area overhead of the SVL circuit is only 1.383% of the conventional SRAM.
Discussed here is progress achieved in the development of video codec LSIs.First, the amount of computation for various standards, and signal handling capability (throughput) and power dissipation for video codec LSIs are described. Then, general technologies for improving throughtput are briefly summarized. The paper also reviews three approaches (i.e., video signal processor, building block and monolithic codes) for implementing video codes standards. The second half of the paper discusses various high-throughput technologies developed for programmable Video Signal Processor (VSP) LSIs. A number of VSP LSIs are introduced, including the world's first programmable VSP, developed in February 1987 and a monolithic codec ship, built in February 1993 that is sufficient in itself for the construction of a video encoder for encoding full-CIF data at 30 frames per second. Technologies for reduction of power dissipation while keeping maintaining throughput are also discussed.
Tadayoshi ENOMOTO Atsunori HIROBE Masahiro FUJII Nobuhide YOSHIDA Shuji ASAI
Four different types of GaAs HEMT DCFL static delay latches based on NOR gates were developed. Eight different types of low-voltage, low-power, high-speed delay flip-flops (D-FFs) were constructed using two delay latches of different types. These delay latches and D-FFs were designed using 0.25-µm n-AlGaAs/i-InGaAs HEMT technology, and their characteristics were evaluated by SPICE simulation. A positive-edge D-FF, called "1P0P," was fabricated and tested. Its operating clock frequency, power dissipation, power delay product, and phase margin were measured as a function of supply voltage VD. The dissipation was almost proportional to VD for VD up to 1.2 V. The "1P0P" D-FF consumed only 2.03 mW at a clock frequency of 5.17 GHz (i.e., at a data rate of 5.17 Gbps) and a VD of 0.6 V, so the power delay product was 0.196 pJ. For a (29-1) pseudo-random signal, a maximum frequency of 7.15 GHz was obtained at a VD of 1.1 V, with dissipation of 6.02 mW and an error rate of less than 10-9. Clear, wide eye openings were obtained at frequencies up to 7.15 GHz. A sufficiently high phase margin of 180 was obtained with a data rate of 5 Gbps at a VD of 0.6 V.
Junichi GOTO Masakazu YAMASHINA Toshiaki INOUE Benjamin S. SHIH Youichi KOSEKI Tadahiko HORIUCHI Nobuhisa HAMATAKE Kouichi KUMAGAI Tadayoshi ENOMOTO Hachiro YAMADA
A programmable clock generator, based on a phase-locked loop (PLL) circuit, has been developed with 0.5 µm CMOS triple-layer Al interconnection technology for use as an on-chip clock generator in a 300-MHz video signal processor. The PLL-based clock generator generates a clock signal whose frequency ranges from 50 to 350 MHz which is an integral multiple, from 2 to 16, of an external clock frequency. In order to achieve stable operation within this wide range, a voltage controlled oscillator (VCO) with selectable low VCO gain characteristics has been developed. Experimental results show that the clock generator generates a 297-MHz clock with a 27-MHz external clock, with jitter of 180 ps and power dissipation of 120 mW at 3.3-V power supply, and it can also oscillate up to 348 MHz with a 31.7-MHz external clock.
Tadayoshi ENOMOTO Toshiyuki OKUYAMA
A 3.2 GHz, 50 mW, 1 V, GaAs clock pulse generator (CG) based on a phase-locked loop (PLL) circuit has been designed for use as an on-chip clock generator in future high speed processor LSIs. 0.5 µm GaAs MESFET and DCFL circuit technologies have been used for the CG, which consists of 224 MESFETs. An "enhanced charge-up current" inverter has been specially designed for a low power and high speed voltage controlled oscillator (VCO). In this new inverter, a voltage controlled dMESFET is combined in parallel with the load dMESFET of a conventional DCFL inverter. This voltage controlled dMESFET produces an additional charge-up current resulting in the new VCO obtaining a much higher oscillation frequency than that of a ring oscillator produced with a conventional inverter. With a single 1 V power supply (Vdd), SPICE calculation results showed that the VCO tuning range was 2.25 GHz to 3.65 GHz and that the average VCO gain was approximately 1.4 GHz/V in the range of a control voltage (Vc) from 0 to 1 V. Simulation also indicated that at a Vdd of 1 V the CG locked on a 50 MHz external clock and generated a 3.2 GHz internal clock (=50 MHz64). The jitter and power dissipation of the CG at 3.2 GHz oscillation and a Vdd of 1 V were less than 8.75 psec and 50 mW, respectively. The typical lock range was 2.90 GHz to 3.59 GHz which corresponded to a pull-in range of 45.3 MHz to 56.2 MHz.