Akira UTAGAWA Tetsuya ASAI Tetsuya HIROSE Yoshihito AMEMIYA
We present on-chip oscillator arrays synchronized by random noises, aiming at skew-free clock distribution on synchronous digital systems. Nakao et al. recently reported that independent neural oscillators can be synchronized by applying temporal random impulses to the oscillators [1],[2]. We regard neural oscillators as independent clock sources on LSIs; i.e., clock sources are distributed on LSIs, and they are forced to synchronize through the use of random noises. We designed neuron-based clock generators operating at sub-RF region (< 1 GHz) by modifying the original neuron model to a new model that is suitable for CMOS implementation with 0.25-µm CMOS parameters. Through circuit simulations, we demonstrate that i) the clock generators are certainly synchronized by pseudo-random noises and ii) clock generators exhibited phase-locked oscillations even if they had small device mismatches.
Yuko HASHIZUME Yasuhiro TAKASHIMA Yuichi NAKAMURA
In deep-submicron technologies, process variations can significantly affect the performance and yield of VLSI chips. As a countermeasure to the variations, post-silicon tuning has been proposed. Deskew, where the clock timing of flip-flops (FFs) is tuned by inserted programmable delay elements (PDEs) into the clock tree, is classified into this method. We propose a novel deskew method that decides the delay values of the elements by measuring a small amount of FFs' clock timing and presuming the rest of FFs' clock timings based on a statistical model. In addition, our proposed method can determine the discrete PDE delay value because the rewriting constraint satisfies the condition of total unimodularity.
Takefumi YOSHIKAWA Yoshihide KOMATSU Tsuyoshi EBUCHI Takashi HIRATA
A transceiver macro for high-speed data transmission via cable in vehicles is proposed. The transceiver uses ac coupling and bi-directional interface topology for protecting LSIs against unexpected short of cable and harness/chassis and has a spread-spectrum-clocking (SSC) generator that reduces noise due to electromagnetic interference. A driver current control has been used for fast switching of data direction on ac-coupled interfaces. An adaptive bandwidth control has been used in a Δ ∑ PLL to improve SCC significantly. A test chip has been fabricated and shows stable and bi-directional data communication with data rate of 162 to 972 Mbps through 20-m cable. Thanks to an optimum calibration of the SSC-PLL bandwidth, it reduces peak power at 33 kHz by -23 dB and provides 2% modulation at a data rate of 810 Mbps.
Gyu-Ho LIM Sung-Young SONG Jeong-Hun PARK Long-Zhen LI Cheon-Hyo LEE Tae-Yeong LEE Gyu-Sam CHO Mu-Hun PARK Pan-Bong HA Young-Hee KIM
A cross-coupled charge pump with internal pumping capacitor, which is advantageous from a point of minimizing TFT-LCD driver IC module, is newly proposed in this paper. By using NMOS and PMOS diodes connected to boosting nodes from VIN nodes, the pumping node is precharged to the same value at the pumping node in starting pumping. Since the first-stage charge pump is designed differently from the other stage pumps, a back current of pumped charge from charge pumping node to input stage is prevented. As a pumping clock driver is located in front of pumping capacitor, the driving capacity is improved by reducing a voltage drop of the pumping clock line from parasitic resistor. Finally, a layout area is decreased more compared with the conventional cross-coupled charge pump by using a stack-MIM capacitor. A proposed charge pump for TFT-LCD driver IC is designed with 0.13 µm triple-well DDI process, fabricated, and tested.
Chang-Kyung SEONG Seung-Woo LEE Woo-Young CHOI
We propose a new Clock and Data Recovery (CDR) circuit for burst-mode applications. It can recover clock signals after two data transitions and endure long sequence of consecutive identical digits. Two Digital Phase Aligners (DPAs), triggered by rising or falling edges of input data, recover clock signals, which are then combined by a phase interpolator. This configuration reduces the RMS jitters of the recovered clock by 30% and doubles the maximum run length compared to a previously reported DPA CDR. A prototype chip is demonstrated with 0.18-µm CMOS technology. Measurement results show that the chip operates without any bit error for 1.25-Gb/s 231-1 PRBS with 200-ppm frequency offset and recovers clock and data after two clock cycles.
Tadayoshi ENOMOTO Suguru NAGAYAMA Hiroaki SHIKANO Yousuke HAGIWARA
The delay time (tdT), power dissipation (PT) and circuit volume of a CMOS register array were minimized. Seven test circuits, each of which had a register array and a single clock tree that generated a pair of complement clock pulses, and a conventional register were fabricated using 90-nm CMOS technology. The register array was constructed with M delay flip-flops (FFs) and the clock tree, which consisted of 2 driver stages. Each driver stage had m inverters, each of which drove M/m FFs where M was fixed at 40 and m varied from 1 to 40. The minimum values of tdT and PT were 0.25 ns and 17.88 µW, respectively, and were both obtained when m was 10. These values were 71.4% and 70.4% of tdT and PT for the conventional register, for which m is 40, respectively. The number of inverters in the clock tree when m was 10 was 21 which was only 25.9% that for the conventional register. The measured results agreed well with SPICE-simulated results. Furthermore, for values of M from 20 to 320, both the minimum tdT and the minimum PT were obtained when m was approximately 1.5 times the square root of M.
Kazunori SHIMIZU Nozomu TOGAWA Takeshi IKENAGA Satoshi GOTO
Reducing the power dissipation for LDPC code decoder is a major challenging task to apply it to the practical digital communication systems. In this paper, we propose a low power LDPC code decoder architecture based on an intermediate message-compression technique which features as follows: (i) An intermediate message compression technique enables the decoder to reduce the required memory capacity and write power dissipation. (ii) A clock gated shift register based intermediate message memory architecture enables the decoder to decompress the compressed messages in a single clock cycle while reducing the read power dissipation. The combination of the above two techniques enables the decoder to reduce the power dissipation while keeping the decoding throughput. The simulation results show that the proposed architecture improves the power efficiency up to 52% and 18% compared to that of the decoder based on the overlapped schedule and the rapid convergence schedule without the proposed techniques respectively.
Thomas Edison YU Tomokazu YONEDA Danella ZHAO Hideo FUJIWARA
The rapid advancement of VLSI technology has made it possible for chip designers and manufacturers to embed the components of a whole system onto a single chip, called System-on-Chip or SoC. SoCs make use of pre-designed modules, called IP-cores, which provide faster design time and quicker time-to-market. Furthermore, SoCs that operate at multiple clock domains and very low power requirements are being utilized in the latest communications, networking and signal processing devices. As a result, the testing of SoCs and multi-clock domain embedded cores under power constraints has been rapidly gaining importance. In this research, a novel method for designing power-aware test wrappers for embedded cores with multiple clock domains is presented. By effectively partitioning the various clock domains, we are able to increase the solution space of possible test schedules for the core. Since previous methods were limited to concurrently testing all the clock domains, we effectively remove this limitation by making use of bandwidth conversion, multiple shift frequencies and properly gating the clock signals to control the shift activity of various core logic elements. The combination of the above techniques gains us greater flexibility when determining an optimal test schedule under very tight power constraints. Furthermore, since it is computationally intensive to search the entire expanded solution space for the possible test schedules, we propose a heuristic 3-D bin packing algorithm to determine the optimal wrapper architecture and test schedule while minimizing the test time under power and bandwidth constraints.
Tomokazu YONEDA Kimihiko MASUDA Hideo FUJIWARA
This paper presents a power-constrained test scheduling method for multi-clock domain SoCs that consist of cores operating at different clock frequencies during test. In the proposed method, we utilize virtual TAM to solve the frequency gaps between cores and the ATE. Moreover, we present a technique to reduce power consumption of cores during test while the test time of the cores remain the same or increase a little by using virtual TAM. Experimental results show the effectiveness of the proposed method.
Weixiang SHEN Yici CAI Xianlong HONG Jiang HU
As power consumption of the clock tree dominates over 40% of the total power in modern high performance VLSI designs, measures must be taken to keep it under control. One of the most effective methods is based on clock gating to shut off the clock when the modules are idle. However, previous works on gated clock tree power minimization are mostly focused on clock routing and the improvements are often limited by the given registers placement. The purpose of this work is to navigate the registers during placement to further reduce the clock tree power based on clock gating. Our method performs activity-aware register clustering that reduces the clock tree power not only by clumping the registers into a smaller area, but also by pulling the registers with the similar activity patterns closely to shut off the clock more time for the resultant subtrees. In order to reduce the impact of signal nets wirelength and power due to register clustering, we apply the timing and activity based net weighting in [14], which reduces the nets switching power by assigning a combination of activity and timing weights to the nets with higher switching rates or more critical timing. To tradeoff the power dissipated by the clock tree and the control signal, we extend the idea of local ungating in [6] and propose an algorithm of gate control signal optimization, which still sets the gate enable signal high if a register is active for a number of consecutive clock cycles. Experimental results on a set of MCNC benchmarks show that our approach is able to reduce the power and total wirelength of clock tree greatly with minimal overheads.
Ching-Yuan YANG Chih-Hsiang CHANG Wen-Ger WONG
A high-speed triangular-modulated spread-spectrum clock generator using a fractional phase-locked loop is presented. The fractional division is implemented by a nested fractional topology, which is constructed from a dual-modulus divide-by-(N-1/16)/N divider to divide the VCO outputs as a first division period and a fractional control circuit to establish a second division period to cause the overall fractional division. The dual-modulus divider introduces a delay-locked-loop network to achieve phase compensation. Operating at the frequency of 3.2 GHz, the measured peak power reduction is around 16 dB for a deviation of 0.37% and a frequency modulation of 33 kHz. The circuit occupies 1.41.4 mm2 in a 0.18-µm CMOS process and consumes 52 mW.
Akihide SAI Daisuke KUROSE Takafumi YAMAJI Tetsuro ITAKURA
Sampling clock jitter degrades the dynamic range of an analog-to-digital converter (ADC). In this letter, a low-power low-noise clock signal generator for ADCs is described. As a clock signal generator, a ring-VCO-based charge pump PLL is used to reduce power dissipation within a given jitter specification. The clock signal generator is fabricated on a CMOS chip with 200-MSPS 10-bit ADC. The measured results show that the ADC keeps a 60-MHz input bandwidth and 53-dB dynamic range and a next-generation mobile wireless terminal can be realized with the ADCs and the on-chip low-power clock generator.
Junichi FUJIKATA Kenichi NISHI Akiko GOMYO Jun USHIDA Tsutomu ISHI Hiroaki YUKAWA Daisuke OKAMOTO Masafumi NAKADA Takanori SHIMIZU Masao KINOSHITA Koichi NOSE Masayuki MIZUNO Tai TSUCHIZAWA Toshifumi WATANABE Koji YAMADA Seiichi ITABASHI Keishi OHASHI
LSI on-chip optical interconnections are discussed from the viewpoint of a comparison between optical and electrical interconnections. Based on a practical prediction of our optical device development, optical interconnects will have an advantage over electrical interconnects within a chip that has an interconnect length less than about 10 mm at the hp32-22 nm technology node. Fundamental optical devices and components used in interconnections have also been introduced that are small enough to be placed on top of a Si LSI and that can be fabricated using methods compatible with CMOS processes. A SiON waveguide showed a low propagation loss around 0.3 dB/cm at a wavelength of 850 nm, and excellent branching characteristics were achieved for MMI (multimode interference) branch structures. A Si nano-photodiode showed highly enhanced speed and efficiency with a surface plasmon antenna. By combining our Si nano-photonic devices with the advanced TIA-less optical clock distribution circuits, clock distribution above 10 GHz can be achieved with a small footprint on an LSI chip.
Chia-Chun TSAI Jan-Ou WU Trong-Yen LEE
This study has demonstrated that the clock tree construction in an SoC should be expanded to consider the intrinsic delay and skew of each IP's clock sink. A novel algorithm, called GDME, is proposed to combine grey relational clustering and DME approach for solving the problem of clock tree construction. Grey relational analysis can cluster the best pair of clock sinks and that guide a tapping point search for a DME algorithm for constructing a clock tree with zero skew and minimal delay. Experimentally, the proposed algorithm always obtains an RC- or RLC-based clock tree with zero skew and minimal delay for all the test cases and benchmarks. Experimental results demonstrate that the GDME improves up to 3.74% for total average in terms of total wire length compared with other DME algorithms. Furthermore, our results for the zero-skew RLC-based clock trees compared with Hspice are 0.017% and 0.2% lower for absolute average in terms of skew and delay, respectively.
An injection-locked clock recovery circuit (CRC) with quadrature outputs based on multiplexed oscillator is presented. The CRC can operate at a half-rate speed to provide an adequate locking range with reasonable jitter and power consumption because both clock edges sample the data waveforms. Implemented by 0.18-µm CMOS technique, experimental results demonstrate that it can achieve the phase noise of the recovered clock about -121.55 dBc/Hz at 100-kHz offset and -129.58 dBc/Hz at 1-MMz offset with 25 MHz lock range, while operating at the input data rate of 1.55 Gb/s.
Yow-Tyng NIEH Shih-Hsu HUANG Sheng-Yu HSU
Although much research effort has been devoted to the minimization of total power consumption caused by the clock tree, no attention has been paid to the minimization of the peak current caused by it. In this paper, we propose an opposite-phase clock scheme to reduce the peak current incurred by the clock tree. Our basic idea is to balance the charging and discharging activities. According to the output operation, the clock buffers that transit simultaneously are divided into two groups: half of the clock buffers transit at the same phase of the clock source, while the other half transit at the opposite phase of the clock source. As a consequence, the opposite-phase clock scheme significantly reduces the peak current caused by the clock tree. Experimental data show that our approach can be applied at different design stages in the existing design flow.
Bakhtiar Affendi ROSDI Atsushi TAKAHASHI
A new algorithm is proposed to reduce the area of a pipelined circuit using a combination of multi-clock cycle paths, clock scheduling and delay balancing. The algorithm analyzes the circuit and replaces intermediate registers with delay elements under the condition that the circuit works correctly at given target clock-period range with the smaller area. Experiments with pipelined multipliers verify that the proposed algorithm can reduce the area of a pipelined circuit without degrading performance.
Kohei HOSOKAWA Katsunori TANAKA Yuichi NAKAMURA
FPGA-based hardware emulators are often used for the verification of LSI functions. They generally have dedicated external memories, such as SDRAMs, to compensate for the lack of memory capacity in FPGAs. In such a case, access between the FPGAs and the dedicated external memory may represent a major bottleneck with respect to emulation speed since the dedicated external memory may have to emulate a large number of memory blocks. In this paper, we propose three methods, "Dynamic Clock Control (DCC)," "Memory Mapping Optimization (MMO)," and "Efficient Access Scheduling (EAS)," to avoid this bottleneck. DCC controls an emulation clock dynamically in accord with the number of memory accesses within one emulation clock cycle. EAS optimizes the ordering of memory access to the dedicated external memory, and MMO optimizes the arrangement of the dedicated external memory addresses to which respective memories will be emulated. With them, emulation speed can be made 29.0 times faster, as evaluated in actual LSI emulations.
Xu ZHANG Xiaohong JIANG Susumu HORIGUCHI
The evolution of VLSI chips towards larger die size, smaller feature size and faster clock speed makes the clock distribution an increasingly important issue. In this paper, we propose a new clock distribution network (CDN), namely Variant X-Tree, based on the idea of X-Architecture proposed recently for efficient wiring within VLSI chips. The Variant X-Tree CDN keeps the nice properties of equal-clock-path and symmetric structure of the typical H-Tree CDN, but results in both a lower maximal clock delay and a lower clock skew than its H-Tree counterpart, as verified by an extensive simulation study that incorporates simultaneously the effects of process variations and on-chip inductance. We also propose a closed-form statistical models for evaluating the skew and delay of the Variant X-Tree CDN. The comparison between the theoretical results and the simulation results indicates that the proposed statistical models can be used to efficiently and rapidly evaluate the performance of the variant X-Tree CDNs.
Shinsaku KIYOMOTO Kazuhide FUKUSHIMA Toshiaki TANAKA Kouichi SAKURAI
In this paper, we examine the effectiveness of clock control in protecting stream ciphers from a distinguishing attack, and show that this form of control is effective against such attacks. We model two typical clock-controlled stream ciphers and analyze the increase in computational complexity for these attacks due to clock control. We then analyze parameters for the design of clock-controlled stream ciphers, such as the length of the LFSR used for clock control. By adopting the design criteria described in this paper, a designer can find the optimal length of the clock-control sequence LFSR.