1-12hit |
Hiroyuki KAWAI Yoshitsugu INOUE Junko KOBARA Robert STREITENBERGER Hiroaki SUZUKI Hiroyasu NEGISHI Masatoshi KAMEYAMA Kazunari INOUE Yasutaka HORIBA Kazuyasu FUJISHIMA
This paper describes a kind of 3D graphics geometry processor architecture for high performance/cost 3D graphics, its application to a real chip, and the results of performance evaluation. In order to establish the high speed geometry processing, dedicated hardware is introduced for accelerating special operations, such as power calculations, clip tests, and program address generation. The dedicated hardware consists of a modified floating-point multiplier in a four-parallel SIMD processing core, a clip test unit, and an internal program address generation scheme optimized to geometry processing mode. Special instructions corresponding to the dedicated schemes are also defined and added. The parallelism of the SIMD core is adjusted to a geometry data structure. Employing dedicated hardware and software significantly accelerates these complicated operations deriving from geometry algorithms. The collaboration of the hardware design and the software design considerably reduces instruction step counts for complex processing. Two kinds of program are dealt with in the proposed architecture. One is a special case program containing few conditional jump instructions, and the other is a general case program combining many program routines. The proposed program address generation scheme provides the automatic selection of a program optimized to each geometry processing mode. By this program address generation scheme and the program types, the frequency of the conditional jump operations, that usually disturb a pipeline operation, are minimized under practical use. Additionally, the programmable design and this program address generation scheme facilitate the load balancing of the geometry calculations with the CPU. A programmable geometry processor was fabricated by using 0.35 µm CMOS process as an application of this architecture. One point three million transistors are integrated in a 11.84 12.07 mm2 die. The increase of the gate counts for all the dedicated hardware is a total of 24 K gates and is approximately only a 7.4% increase of the total gate count. This chip operates at 150 MHz, and achieves the processing performance of 5.8 M vertex/sec. The result shows that the proposed programmable architecture (ESIMD: Enhanced SIMD) is 2.3 times more cost effective than a programmable geometry LSI reported previously.
Hiroaki SUZUKI Tadashi DOHI Hiroyuki OKAMURA
In this paper, we consider the similar software cost models with periodic rejuvenation to Garg, Puliafito, Telek and Trivedi (1995) under the cost effectiveness criteria. First, an alternative model as well as the original one are analyzed by Markov regenerative processes. We derive analytically the optimal periodic software rejuvenation policies which maximize the cost-effectiveness in the steady state for two models. Further, we develop statistical non-parametric algorithms to estimate the optimal software rejuvenation policies, provided that the sample data to characterize the system failure times are given. Then, the total time on test (TTT) concept is used. In numerical examples, we compare the periodic software rejuvenation policy with the non-periodic one, and investigate the asymptotic properties of the non-parametric estimators for the optimal software rejuvenation policies through a simulation experiment.
Hiroaki SUZUKI Toshichika SAKAI Hisao HARIGAI Yoichi YANO
A 32-bit RISC microprocessor "V810" that has 5-stage pipeline structure and a 1 Kbyte, direct-mapped instruction cache realizes 2.5 MHz operation at 0.9 V with 2.0 mW power consumption. The supply voltage can be reduced to 0.75 V. To overcome narrow noise margin, all the signals are set to have rail-to-rail swing by pseudo-static circuit technique. The chip is fabricated by a 0.8 µm double metal-layer CMOS process technology to integrate 240,000 transistors on a 7.4 mm7.1 mm die.
Akira YAMADA Yasuhiro NUNOMURA Hiroaki SUZUKI Hisakazu SATO Niichi ITOH Tetsuya KAGEMOTO Hironobu ITO Takashi KURAFUJI Nobuharu YOSHIOKA Jingo NAKANISHI Hiromi NOTANI Rei AKIYAMA Atsushi IWABU Tadao YAMANAKA Hidehiro TAKATA Takeshi SHIBAGAKI Takahiko ARAKAWA Hiroshi MAKINO Osamu TOMISAWA Shuhei IWADE
A high-speed 32-bit RISC microcontroller has been developed. In order to realize high-speed operation with minimum hardware resource, we have developed new design and analysis methods such as a clock distribution, a bus-line layout, and an IR drop analysis. As a result, high-speed operation of 400 MHz has been achieved with power dissipation of 0.96 W at 1.8 V.
Hiroaki SUZUKI Hiroshi MAKINO Koichiro MASHIKO
This paper describes a new floating-point divider (FDIV), in which the key features of redundant binary circuits and an asynchronous clock scheme reduce the delay time and area penalty. The redundant binary representation of +1 = (1, 0), 0 = (0, 0), -1 = (0,1) is applied to the all mantissa division circuits. The simple and unified representation reduces circuit delay for the quotient determination. Additionally, the local clock generator circuit for the asynchronous clock scheme eliminates clock margin overhead. The generator circuit guarantees the worst delay-time operation by the feedback loop of the replica delay paths via a C-element. The internal iterative operation by the asynchronous scheme and the modified redundant-binary addition/subtraction circuit keep the area small. The architecture design avoids extra calculation time for the post processes, whose main role is to produce the floating-point status flags. The FDIV core using proposed technologies operates at 42. 1 ns with 0.35 µm CMOS technology and triple metal interconnections. The small core of 13.5 k transistors is laid-out in a 730µm 910 µm area.
Hiroshi MAKINO Hiroaki SUZUKI Hiroyuki MORINAKA Yasunobu NAKASE Hirofumi SHINOHARA Koichiro MASHIKO Tadashi SUMI Yasutaka HORIBA
This paper describes the design of a high-speed 4-2 compressor for fast multipliers. Through the survey of the six kinds of representative conventional 4-2 compressor (RBA 1-3 and NBA 1-3) in both the redundant binary (RB) and the normal binary (NB) scheme, we extracted two problems that degrades the operating speed. The first is the use of multi-input complex gates and the second is the existence of transmission gates (TG) at the input and/or output stages. To solve these problems, we propose high-speed 4-2 compressors using the RB scheme, which we call the high-speed redundant binary adders (HSRBAs). Six kinds of HSRBAs, HSRBA 1-6, were derived by making the Boolean equations suitable for high-speed CMOS circuits. Among them, HSRBA2, HSRBA4 and HSRBA6 have no multi-input complex gate and input/output TG, and perform at a delay time of 0.89 ns which is the fastest of all 4-2 compressors. We investigated the logical relation between HSRBAs and conventional 4-2 compressors by analyzing the Boolean equations for each circuit. This investigation shows that all the conventional redundant binary adders RBA1-3 have the same logic structures as HSRBA2. We also showed the conventional normal binary adders NBA1-3 have the same logic structures as HSRBA1, HSRBA3 and HSRBA5, respectively. This implies all 4-2 compressors can be derived from the same equation regardless of RB or NB. We applied the HSRBA2 to a 5454-bit multiplier using 0.5-µm CMOS technology. The multiplication time at the supply voltage of 3.3 V was 8.8 ns. This is the fastest 5454-bit multiplier with 0.5-µm CMOS so far, and 83% of the speed improvement is due to the high speed 4-2 compressor.
Hiroshi MAKINO Hiroaki SUZUKI Hiroyuki MORINAKA Yasunobu NAKASE Koichiro MASHIKO Tadashi SUMI
This paper presents a high speed 64-b floating point (FP) multiplier that has a useful function for computer graphics(CG). The critical path delay is minimized by using high speed logic gates and limiting the stage number of series transmission gates (TG's). The high speed redundant binary architecture is applied to the multiplication of significands. This FP multiplier has a special function of "CG multiplication" that directly multiplies a pixel data by an FP data. This multiplier was fabricated by 0.5 µm CMOS technology with triple-level metal of interconnection. The active area size is 4.25.1mm2.The operating cycle time is 3.5 ns at the supply voltage of 3.3 V, which corresponds to the frequency of 286 MHz, Implementation of CG multiplication increases the transistor count only 4%. Also, CG multiplication has no effect on the delay in the critical path.
Hiroyuki MORINAKA Hiroshi MAKINO Yasunobu NAKASE Hiroaki SUZUKI Koichiro MASHIKO Tadashi SUMI
We present a 64-b adder having a 2.6-ns delay time at 3.3 V power supply within 0.27 mm2 using 0.5-µm CMOS technology. We derived our adder design from architectural level considerations. The considerations include not only the gate intrinsic delay but also the wiring delay and the gate capacitance delay. As a result, a 64-b adder, (56-b Carry Look-ahead Adder(CLA) +8-b Carry Select Adder (CSA)), was designed. In this design, a new carry select scheme called Modified Carry Select (MCS) is also proposed.
Tadashi DOHI Hiroaki SUZUKI Kishor S. TRIVEDI
Software rejuvenation is a preventive and proactive solution that is particularly useful for counteracting the phenomenon of software aging. In this paper, we consider both the periodic and non-periodic software rejuvenation policies under different dependability measures. As is well known, the steady-state system availability is the probability that the software system is operating in the steady state and, at the same time, is often regarded as the mean up rate in the system operation period. We show that the mean up rate should be defined as the mean value of up rate, but not as the mean up time per mean operation time. We derive numerically the optimal software rejuvenation policies which maximize the steady-state system availability and the mean up rate, respectively, for each periodic or non-periodic model. Numerical examples show that the real mean up rate is always smaller than the system availability in the steady state and that the availability overestimates the ratio of operative time of the software system.
Hiroaki SUZUKI Woopyo JEONG Kaushik ROY
Demands for the low power VLSI have been pushing the development of aggressive design methodologies to reduce the power consumption drastically. To meet the growing demand, we propose low power adders that adaptively select supply voltages based on the input vector patterns. First, we apply the proposed scheme to the Ripple Carry Adder (RCA). A prototype design by a 0.18 µm CMOS technology shows that the Adaptive VDD 32-bit RCA achieves 25% power improvement over the conventional RCA with similar speed. The proposed adder cancels out the delay penalty, utilizing two innovative techniques: carry-skip techniques on the checking operands, and the use of Complementary Pass Transistor Logic (CPL) with dual supply voltage for level conversion. As an expansion to faster adder architectures, we extend the proposal to the Carry-Select Adders (CSA) composed of the RCA sub-blocks. We achieved 24% power improvement on the 128-bit CSA prototype over a conventional design. The proposed scheme also achieves stand-by leakage power reduction--for 32-bit and 128-bit Adaptive RCA and CSA, respectively, 62% and 54% leakage reduction was possible.
Hiroaki SUZUKI Hiroyuki KAWAI Hiroshi MAKINO Yoshio MATSUDA
A VLIW (Very Long Instruction Word) architecture with a new code compaction method has been proposed. For a 3D-geometry processor, we consider two types of 2-issue VLIW architectures, the floating-point execution accelerating VLIW (FP-VLIW) and the data-move enhancing VLIW (MV-VLIW) architectures, as expansions of a Single-Streaming Single Instruction, Multiple Data (SS-SIMD) architecture. To solve the code bloat problem which is common to VLIW architectures, the proposed method makes it possible to compact original codes into the VLIW codes by software tools and decompact the VLIW codes by a simple hardware decompactor composed of an instruction swap circuit on a chip. Speeds and code densities of the two VLIWs with the code compaction are compared to the SS-SIMD with the same instruction set and the same building blocks. The FP-VLIW shows the fastest speed performance in the evaluation results of the viewperf CDRS-03 benchmark programs. It is 36% faster than the SS-SIMD used as reference. The proposed compaction method keeps the 95% code density of the SS-SIMD. One test program shows that the code density of the MV-VLIW is higher than that of the SS-SIMD. This result demonstrates that the merit of compacting nops can be greater than the VLIW penalty. The FP-VLIW architecture with the code compaction achieves 1.36 times the speed performance without significant code-density deterioration.
Tadahiro KURODA Toshiyuki FUKUNAGA Kenji MATSUO Kazuhiko KASAI Ayako HIRATA Shinji FUJII Masahiro KIMURA Hiroaki SUZUKI
This paper describes a new biasing scheme for sensing circuits, namely an automated bias control (ABC) circuit, for high-performance VLSI's. The ABC circuit can automatically gear the output level of sensing circuits to the input threshold voltage of the succeeding CMOS converters. The sensing performance can be accelerated with the ABC circuit either by reducing excessive signal level margin between the sensing circuits and the CMOS converters or by reducing extra stage of signal amplification. Since feedback control of the ABC circuit ensures a correct dc biasing even under large process deviation and circuit condition changes, wider operation margin can also be obtained. Three successful applications of the ABC circuit are reported: a sense amplifier, an address transition detector (ATD), and an ECL-CMOS input buffer. A 64-kb BiCMOS SRAM employing the proposed sense amplifier and the ATD has been fabricated with a 0.8-µm 9-GHz BiCMOS technology. The SRAM has an address access time of 4.5 ns.