Baoquan ZHONG Zhiqun CHENG Minshi JIA Bingxin LI Kun WANG Zhenghao YANG Zheming ZHU
Kazuya TADA
Suguru KURATOMI Satoshi USUI Yoko TATEWAKI Hiroaki USUI
Yoshihiro NAKA Masahiko NISHIMOTO Mitsuhiro YOKOTA
Hiroki Hoshino Kentaro Kusama Takayuki Arai
Tsuneki YAMASAKI
Kengo SUGAHARA
Cuong Manh BUI Hiroshi SHIRAI
Hiroyuki DEGUCHI Masataka OHIRA Mikio TSUJI
Hiroto Tochigi Masakazu Nakatani Ken-ichi Aoshima Mayumi Kawana Yuta Yamaguchi Kenji Machida Nobuhiko Funabashi Hideo Fujikake
Yuki Imamura Daiki Fujii Yuki Enomoto Yuichi Ueno Yosei Shibata Munehiro Kimura
Keiya IMORI Junya SEKIKAWA
Naoki KANDA Junya SEKIKAWA
Yongzhe Wei Zhongyuan Zhou Zhicheng Xue Shunyu Yao Haichun Wang
Mio TANIGUCHI Akito IGUCHI Yasuhide TSUJI
Kouji SHIBATA Masaki KOBAYASHI
Zhi Earn TAN Kenjiro MATSUMOTO Masaya TAKAGI Hiromasa SAEKI Masaya TAMURA
Misato ONISHI Kazuhiro YAMAGUCHI Yuji SAKAMOTO
Koya TANIKAWA Shun FUJII Soma KOGURE Shuya TANAKA Shun TASAKA Koshiro WADA Satoki KAWANISHI Takasumi TANABE
Shotaro SUGITANI Ryuichi NAKAJIMA Keita YOSHIDA Jun FURUTA Kazutoshi KOBAYASHI
Ryosuke Ichikawa Takumi Watanabe Hiroki Takatsuka Shiro Suyama Hirotsugu Yamamoto
Chan-Liang Wu Chih-Wen Lu
Umer FAROOQ Masayuki MORI Koichi MAEZAWA
Ryo ITO Sumio SUGISAKI Toshiyuki KAWAHARAMURA Tokiyoshi MATSUDA Hidenori KAWANISHI Mutsumi KIMURA
Paul Cain
Arie SETIAWAN Shu SATO Naruto YONEMOTO Hitoshi NOHMI Hiroshi MURATA
Seiichiro Izawa
Hang Liu Fei Wu
Keiji GOTO Toru KAWANO Ryohei NAKAMURA
Takahiro SASAKI Yukihiro KAMIYA
Xiang XIONG Wen LI Xiaohua TAN Yusheng HU
Tohgo HOSODA Kazuyuki SAITO
Yihan ZHU Takashi OHSAWA
Shengbao YU Fanze MENG Yihan SHEN Yuzhu HAO Haigen ZHOU
Kazuaki MURAKAMI Hidetaka MAGOSHI
This paper briefly surveys architectural technologies of recent or future high-performance, low-power processors for improving the performance and power/energy consumption simultaneously. Achieving both high performance and low power at the same time imposes a lot of challenges on processor design, and therefore gives us a lot of opportunities for devising new technologies. The paper also tries to provide some insights into the technology direction in future.
Kunio UCHIYAMA Fumio ARAKAWA Yasuhiko SAITO Koki NOGUCHI Atsushi HASEGAWA Shinichi YOSHIOKA Naohiko IRIE Takeshi KITAHARA Mark DEBBAGE Andy STURGES
A 64-bit architecture for an embedded processor targeted for next-generation digital consumer products has been developed. It has dual-mode instruction sets and is optimized for high multimedia performance, provided by SIMD/floating-point vector instructions in 32-bit length ISA, and small code size, provided by a conventional 16-bit length ISA. Large register files, (64
Hiroshi OKANO Atsuhiro SUGA Hideo MIYAKE Yoshimasa TAKEBE Yasuki NAKAMURA Hiromasa TAKAHASHI
A 5.6 GOPS/1.4 GFLOPS 350 MHz, four-way very long instruction word (VLIW) microprocessor is developed for embedded applications in a 0.18 µm five-layer-metal CMOS process. This processor features a two-way integer pipeline and two-way floating/media pipelines. Each floating pipeline and media pipeline has two-parallel and four-parallel single instruction multiple-data (SIMD) mechanisms, respectively. The processor has separate instruction and data caches, each of 16 KB in size and having four-way set associative. The data cache employs a non-blocking technique and can process two load instructions in parallel. The processor had about a 50% clock net power reduction compared with one without power optimization. 6.7 million transistors are integrated in an area of 7.5 mm
Hideo OHIRA Toshihisa KAMEMARU Hirokazu SUZUKI Ken-ichi ASANO Masahiko YOSHIMOTO
An architectural design of a media processor core optimized for MPEG4/H26x video codec targeted for use in mobile multimedia terminals is presented. The architecture consists of a maximum 6.4 GOPS SIMD (Single Instruction Multiple Data) processor, RISC-processor, VLC-processor, and intelligent DMA controller. The unique SIMD processor completes 2-D DCT processing in 132 clock cycles, or block matching (16 by 16 pixels) in 24 clock-cycles. VLC-processor allows the completion of 8 by 8 block run-level coding in average 10 clock cycles in the case of low bit-rates. The functions of transpose-registers in the SIMD processor, data sub-sampling technique in the DMA, or data-sliding technique between PEs (Processor Elements) in the SIMD processor eliminate a large amount of cycle loss for data handling, and extract the highest level of performance. Through the use of the above architecture and the lower power approach, CIF 30 frames/s MPEG4 Simple Profile video codec @ 100 MHz can be achieved. Estimated dissipation is as low as 280 mW. 300 kgates and 16 kBytes four port SRAM are contained on a 12 mm2 area by using 0.18 µm process technology. The combination of the RISC-processor and SIMD-processor can also operate MPEG4 core profile (shape coding) that requires flexibility and performance.
Eiji ARITA Takashi FUJIWARA Kin-ichiro NISHIYAMA Akiko MAENO Yasuo MATSUNAMI Masahiko NAKAMURA Hirohisa MACHIDA Shuji MURAKAMI Hiroyuki NAKAYAMA Masahiko YOSHIMOTO
A complete single chip multi-format Phase Shift Keying (PSK) demodulator ULSI for Japanese BS digital broadcasting is reported. The carrier recovery system shows the pull-in range up to +/-5 MHz. The clock recovery system cancels the poor group delay characteristic and the orthogonality degradation caused by the analog front end, and improves the BER performance by 0.2 dB. Thus the requirement to the analog front end is relaxed. A digital PLL ensures minimum program clock reference jitter in the output data stream, which simplifies jitter management in the succeeding MPEG2 system decoder. It integrates two 8-bit 60 MHz ADCs, 58 MHz VCO, 1 Mbit SRAM and the 450 K-gate FEC-demodulator core. Implementation of 1 Mbit de-interleaver RAM facilitates the use of a low cost receiver. The 8.8 milion transistor chip occupies the 72 mm2 in a 0.25 µm triple-metal CMOS technology.
Seung-Geun KIM Wooncheol HWANG Youngsun KIM Youngkou LEE Sungsoo CHOI Kiseon KIM
We present a case of design and implementation of a high-speed burst QPSK (Quaternary Phase Shift Keying) receiver. Since the PSK modulation carries its information through the phase, the baseband digital receiver can recover transmitted symbol from the received phase. The implemented receiver estimates symbol time and frequency offset using sampled data over 32 symbols without transmitted symbol information, and embedded RAM is used for received phase delay over estimation time. The receiver is implemented using about 92,000 gates of Samsung KG75 SOG library which uses 0.65 µm CMOS technology. The fabricated chip test result shows that the receiver operates at 40 MHz clock rate on 5.6 V, which is equivalent to the 40 Mbps data rate.
Jeong-Min KIM Yun-Su SHIN In-Gu HWANG Kwang-Sun LEE Sang-Il HAN Sang-Gyu PARK Soo-Ik CHAE
A chip is described that integrates two multimedia VLIW processor cores with a hardware streaming engine. It can implement a real-time videophone, or an MPEG4 codec. Each processor core has identical resources, and shares the memory and system I/O interface units. With its symmetric structure, applications can be executed on either processor without constraints. To accelerate multimedia-specific applications, the architecture of this processor has several features. It merges the features of a RISC and a DSP, its instruction set is extended to accelerate both video and audio applications, and it supports an efficient embedded memory system, to reduce both the bandwidth and the latency for multimedia applications needing frequent memory accesses. The chip size will be 100 mm2 die that contains 700 K logic gates, 60 KB RAM, and 16 KB ROM, in a 0.25-µm CMOS standard cell technology. At 65 MHz operating frequency, it can process H.263 video coding at CIF 15 frames/sec, and G.723.1 audio coding with an 80% processing time allocation.
Kazutoshi KOBAYASHI Makoto EGUCHI Takuya IWAHASHI Takehide SHIBAYAMA Xiang LI Kosuke TAKAI Hidetoshi ONODERA
We propose a vector-pipeline processor VP-DSP for low-rate videophones which can encode and decode 10 frames/sec. of QCIF through a 29.2 kbps low-rate line. We have already fabricated a VP-DSP LSI by a 0.35 µm CMOS process. The area of the VP-DSP core is 4.26 mm2. It works properly at 25 MHz/1.6 V with a power consumption of 49 mW. Its peak performance is up to 400 MOPS, 8.2 GOPS/W.
Hiroshi SEGAWA Yoshinori MATSUURA Satoshi KUMAKI Tetsuya MATSUMURA Stefan SCOTZNIOVSKY Shu MURAYAMA Tetsuro WADA Ayako HARADA Eiji OHARA Ken-ichi ASANO Toyohiko YOSHIDA Yasutaka HORIBA
This paper describes an embedded software scheme for a single-chip MPEG-2 encoder that executes concurrent video, audio, and system encoding in real-time. The software features a scalable module structure, which is hierarchically composed and has expandable plug-in modules. For increased applicability, several task-modules are prepared for the respective video, audio, and system processing. In addition, an effective task management scheme that features polling and interrupt-based task switching has been proposed in order to achieve real-time operation. The software having these features and including all task-modules is implemented on a single media-processor D30V on a single chip MPEG-2 video, audio, and system encoder. This encoder realizes real-time MPEG-2 video encoding, Dolby Digital or MPEG-1 audio encoding, and system encoding that generates TS or PS over 50 Mbps for various applications. Assuming a DVD or DTV encoder system, the software is reconstructed with less than 56.6-kbytes of instruction and 145.6 MIPS performance. The single media-processor with 64-kbytes of instruction RAM and 162 MIPS performance, running at a clock rate of 162 MHz, can successfully accomplish a real-time operation with the proposed embedded software.
Kenji TOGURA Hiroyuki NAKASE Koji KUBOTA Kazuya MASU Kazuo TSUBOUCHI
We have proposed a current-cut switched-current matched filter (CC-SIMF) for direct-sequence code-division multiple-access (DS-CDMA). The 256-chip CC-SIMF can achieve low power consumption of less than 10 mW under high-speed operation of more than 16 Mcps. To reduce the current transfer error accumulation, we propose a parallel SIMF configuration. A 128-chip SIMF using 0.8-µm Complementally Metal Oxide Semiconductor (CMOS) process has been designed and fabricated. Optimization of the current memory cell structure has been described. The correlation operation at 16 Mcps has been obtained using a 128-chip orthogonal m-sequence. The code phase separation performance for path diversity has been clearly observed. The power consumption has been significantly reduced using the current-cut method.
Koichiro MINAMI Masayuki MIZUNO Hiroshi YAMAGUCHI Toshihiko NAKANO Yusuke MATSUSHIMA Yoshikazu SUMI Takanori SATO Hisashi YAMASHIDA Masakazu YAMASHINA
This paper describes a 1-GHz portable digital delay-locked loop (DLL) with 0.15-µm CMOS technology. There are three factors contributing to jitter in digital DLLs. One is supply-noise induced jitter, another is jitter caused by delay time resolution and phase step in the delay line, and the third is jitter caused by the sensitivity of the phase detector. In order to achieve a low jitter digital DLL, we have developed a master-slave architecture that achieves infinite phase capture ranges and low latency, a delay line that improves the delay time resolution, a phase step suppression technique and a dynamic phase detector with increased sensitivity. These techniques were used to fabricate a digital DLL with improved jitter performance. Measured results showed that the DLL successfully achieves 29-ps peak-to-peak jitter with a quiet supply and 0.2-ps/ mV supply sensitivity.
Motokazu OZAWA Masashi IMAI Yoichiro UENO Hiroshi NAKAMURA Takashi NANYA
Wire delays, instead of gate delays, are moving into dominance in modern VLSI design. Current synchronous processors have the critical path not in the ALU function but in the cache access. Since the cache performance enhancement is limited by the memory access delay which mainly consists of wire delays, a reduction in gate delays may no longer imply any enhancement in processor performance. To solve this problem, this paper presents a novel architecture, called the Cascade ALU. The Cascade ALU allows super-scalar processors with future technologies to move the critical path into the ALU part. Therefore the Cascade ALU can enjoy the expected progress in future device speed. Since the delay of the Cascade ALU varies depending on the executed instructions, an asynchronous system is shown to be suitable for implementing the Cascade ALU. However an asynchronous system may have a large handshake overhead, this paper also presents an asynchronous Fine Grain Pipeline technique that hides the handshake overhead. Finally, this paper presents results of performance and area evaluation for an asynchronous implementation of the cascade ALU. The results show that the cascade ALU architecture has a good performance scalability on the reduction of the ALU latency and imposes little area penalty compared with current synchronous processors.
Maria MIRIANASHVILI Kazuo ONO Masashi HOTTA
Loss analysis in bent graded-index optical slab waveguides is given using the modal-matching method. The conformal mapping replaces curved structure by an equivalent straight waveguide with a modified index profile. For this planar waveguide structure, the normal modes are calculated using a multilayer approximation method. The wave incident on the bend is expanded initially into a finite set of normal modes of the equivalent straight structure, and the transverse fields are matched across the junction. The numerical results show the loss formation in the graded-index waveguides and its dependence of the effective index of the corresponding straight waveguide.
The features of the negative resistance in common source and common gate FET configurations for wideband VCO are studied. They are also explained by the simplified three-capacitor model. A design procedure is then developed. The results are applied to a design of wide band oscillator at the several gigahertz region.
Takehiko KATO Yasunori BITO Naotaka IWATA
This paper describes 1.0 V operation power performance of a double doped AlGaAs/InGaAs/AlGaAs heterojunction FET for personal digital cellular phones. The developed FET with a multilayer cap consisting of a highly Si-doped GaAs, an undoped GaAs and a highly Si-doped AlGaAs exhibited an on-resistance of 1.3 Ω
Fukashi MORISHITA Kazutami ARIMOTO Kazuyasu FUJISHIMA Hideyuki OZAKI Tsutomu YOSHIHARA
A novel body potential-controlling technique for floating SOI CMOS circuits is proposed and verified in this study. High-speed operation is realized with a small chip size by using body-floating SOI transistors. The use of this technique allows the threshold voltage of the body-floating transistors to be varied transitionally. Therefore, the standby current of SOI CMOS logic is reduced to less than 1/50th of that required by the non-controlled operation of the body potential, and the logic operates at a high speed during the active period. There is no speed penalty for the recovery operation from the standby mode. This technique supports sub-1 V operation, which will be required by future battery-operated devices with wide-range covering.
Kazunori KAWAMOTO Hitoshi YAMAGUCHI Hiroaki HIMI Seiji FUJINO Isao SHIRAKAWA
EL (Electroluminescent) displays have been applied to automobiles, as their images are very clear and bright. High voltage, high integration and low power dissipation ICs are needed to drive these devices. To meet this, high voltage CMOS ICs using SOI (Silicon On Insulator) substrates are chosen as the driving devices. In this paper, an isolation structure between the output CMOS devices, of high density and high voltage is proposed. Conventional trench dielectric isolation shows degradation of a break down voltage with short distance from trench to source. In this work, the authors make clear the electric field distribution near the isolation, and offer a novel structure of "Field-plate Trench Isolation," which enables to relax the electric field on the silicon surface by shifting a part of electric field into surface oxide. Finally, operation of high voltage and high density, a 200-volt and 32-channel, EL display driver for automotive display panel is confirmed.
Kwang-Yeol YOON Mitsuo TATEIBA Kazunori UCHIDA
We have discussed a ray tracing method to estimate the scattering characteristics from random rough surface. It has been shown from the traced rays that the diffracted rays dominate over the reflected rays. For the field evaluation, we have used the Fresnel function for the diffracted coefficient and the Fresnel's reflection coefficients. Numerical examples have been carried out for the scattering characteristics of an ocean wave-like rough surface and the delay spared characteristics of a building-like surface. In the present work we have demonstrated that the ray tracing method is effective to numerical analysis of a rough surface scattering.
Seiji FUKUSHIMA Hideki FUKANO Kaoru YOSHINO Yutaka MATSUOKA Seiko MITACHI Kiyoto TAKAHATA
A compact optically-fed radio access point module was developed that consists of a uni-traveling-carrier refracting-facet photodiode, a patch antenna, and an optical input interface. An output power from the photodiode was 1.4 dBm at a frequency of 5.88 GHz without any bias voltage.
Shi-Ho KIM Jorgo TSOUHLARAKIS Jan Van HOUDT Herman MAES
A new CMOS DC voltage doubler with nonoverlapping switching control is proposed, in order to eliminate the dynamic current loss during switching as well as the threshold voltage drop of the serial switches. The simulated results at 1.5 V show that the maximum power efficiency is improved with about 30%, whereas the efficiency in the low output current region is larger than 5 times compared to the conventional voltage doublers. This proposed CMOS DC voltage doubler can be used as a VPP generator of low voltage DRAM's.