Shinya ABE Masanori HASHIMOTO Takao ONOYE
Influence of manufacturing variability on circuit performance has been increasing because of finer manufacturing process and lowered supply voltage. In this paper, we focus on mesh-style clock distribution which is believed to be effective for reducing clock skew, and we evaluate clock skew considering manufacturing and design variabilities. Considering MOS transistor variation -- random and spatially-correlated variation -- and non-uniform flip-flop (FF) placement, we demonstrate that spatially-correlated variation and severe non-uniform FF distribution can be major sources of clock skew. We also examine the dependency of clock skew on design parameters, and reveal that finer clock mesh does not necessarily reduce clock skew.
As well as the schedule affects system performance, the control skew, i.e., the arrival time difference of control signals between registers, can be utilized for improving the system performance, enhancing robustness against delay variations, etc. The simultaneous optimization of the control step assignment and the control skew assignment is more powerful technique in improving performance. In this paper, firstly, we prove that, even if the execution sequence of operations which are assigned to the same resource is fixed, the simultaneous optimization problem under a fixed clock period is NP-hard. Secondly, we propose a heuristic algorithm for the simultaneous control step and skew optimization under given clock period, and we show how much the simultaneous optimization improves system performance. This paper is the first one that uses the intentional skew to shorten control steps under a specified clock period. The proposed algorithm has the potential to play a central role in various scenarios of skew-aware high level synthesis.
Gunok JUNG Chunghee KIM Kyoungkuk CHAE Giho PARK Sung Bae PARK
This letter presents point diffusion clock network (PDCN) with local clock tree synthesis (CTS) scheme. The clock network is implemented with ten times wider metal line space than typical mesh networks for low power and utilized to nine times smaller area CTS execution for minimized clock skew amount. The measurement results show that skew amount of PDCN with local CTS is reduced to 36% and latency is shrunk to 45% of the amount in a 4.81 mm2 CortexA-8 core with 65 nm Samsung process.
Yuko HASHIZUME Yasuhiro TAKASHIMA Yuichi NAKAMURA
In deep-submicron technologies, process variations can significantly affect the performance and yield of VLSI chips. As a countermeasure to the variations, post-silicon tuning has been proposed. Deskew, where the clock timing of flip-flops (FFs) is tuned by inserted programmable delay elements (PDEs) into the clock tree, is classified into this method. We propose a novel deskew method that decides the delay values of the elements by measuring a small amount of FFs' clock timing and presuming the rest of FFs' clock timings based on a statistical model. In addition, our proposed method can determine the discrete PDE delay value because the rewriting constraint satisfies the condition of total unimodularity.
Akira UTAGAWA Tetsuya ASAI Tetsuya HIROSE Yoshihito AMEMIYA
We present on-chip oscillator arrays synchronized by random noises, aiming at skew-free clock distribution on synchronous digital systems. Nakao et al. recently reported that independent neural oscillators can be synchronized by applying temporal random impulses to the oscillators [1],[2]. We regard neural oscillators as independent clock sources on LSIs; i.e., clock sources are distributed on LSIs, and they are forced to synchronize through the use of random noises. We designed neuron-based clock generators operating at sub-RF region (< 1 GHz) by modifying the original neuron model to a new model that is suitable for CMOS implementation with 0.25-µm CMOS parameters. Through circuit simulations, we demonstrate that i) the clock generators are certainly synchronized by pseudo-random noises and ii) clock generators exhibited phase-locked oscillations even if they had small device mismatches.
Shunji KOZAKI Kazuto MATSUO Yasutomo SHIMBARA
Scalar multiplication methods using the Frobenius maps are known for efficient methods to speed up (hyper)elliptic curve cryptosystems. However, those methods are not efficient for the cryptosystems constructed on fields of small extension degrees due to costs of the field operations. Iijima et al. showed that one can use certain automorphisms on the quadratic twists of elliptic curves for fast scalar multiplications without the drawback of the Frobenius maps. This paper shows an extension of the automorphisms on the Jacobians of hyperelliptic curves of arbitrary genus.
Chia-Chun TSAI Jan-Ou WU Trong-Yen LEE
This study has demonstrated that the clock tree construction in an SoC should be expanded to consider the intrinsic delay and skew of each IP's clock sink. A novel algorithm, called GDME, is proposed to combine grey relational clustering and DME approach for solving the problem of clock tree construction. Grey relational analysis can cluster the best pair of clock sinks and that guide a tapping point search for a DME algorithm for constructing a clock tree with zero skew and minimal delay. Experimentally, the proposed algorithm always obtains an RC- or RLC-based clock tree with zero skew and minimal delay for all the test cases and benchmarks. Experimental results demonstrate that the GDME improves up to 3.74% for total average in terms of total wire length compared with other DME algorithms. Furthermore, our results for the zero-skew RLC-based clock trees compared with Hspice are 0.017% and 0.2% lower for absolute average in terms of skew and delay, respectively.
Xu ZHANG Xiaohong JIANG Susumu HORIGUCHI
The evolution of VLSI chips towards larger die size, smaller feature size and faster clock speed makes the clock distribution an increasingly important issue. In this paper, we propose a new clock distribution network (CDN), namely Variant X-Tree, based on the idea of X-Architecture proposed recently for efficient wiring within VLSI chips. The Variant X-Tree CDN keeps the nice properties of equal-clock-path and symmetric structure of the typical H-Tree CDN, but results in both a lower maximal clock delay and a lower clock skew than its H-Tree counterpart, as verified by an extensive simulation study that incorporates simultaneously the effects of process variations and on-chip inductance. We also propose a closed-form statistical models for evaluating the skew and delay of the Variant X-Tree CDN. The comparison between the theoretical results and the simulation results indicates that the proposed statistical models can be used to efficiently and rapidly evaluate the performance of the variant X-Tree CDNs.
Hidehiro TOYODA Shinji NISHIMURA Michitaka OKUNO Matsuaki TERADA
A high-speed physical-layer architecture for next-generation higher-speed Ethernet for VSR and backplane applications was developed. VSR and backplane networks provide 100-Gb/s data transmission in "mega data centers" and blade servers, which have new and broad potential markets of LAN technologies. It supports 100-Gb/s-throughput, high-reliability, and low-latency data transmission, making it well suited to VSR and backplane applications for intra-building and intra-cabinet networks. Its links comprise ten 10-Gb/s high-speed serial lanes. Payload data are transmitted by ribbon fiber cables for very short reach and by copper channels for the backplane board. Ten lanes convey 320-bit data synchronously (32 bits10 lanes) and parity data of forward-error correction code (newly developed (544, 512) code FEC), providing highly reliable (BER<1E-22) data transmission with a burst-error correction with low latency (31.0 ns on the transmitter (Tx) side and 111.6 ns on the receiver (Rx) side). A 64B/66B code-sequence-based skew compensation mechanism, which provides low-latency compensation for the lane-to-lane skew (less than 51 ns), is used for parallel transmission. Testing this physical-layer architecture in an ASIC showed that it can provide 100-Gb/s data transmission with a 772-kgate circuit, which is small enough for implementation in a single LSI.
Jan-Ou WU Chia-Chun TSAI Chung-Chieh KUO Trong-Yen LEE
In nature an unbalanced clock tree exists in a SoC because the clock sinks of IPs have distinct input capacitive loads and internal delays. The construction of a bottom-up RLC clock tree with minimal clock delay and zero skew is crucial to ensure good SoC performance. This study proves that an RLC clock tree construction always has no zero skew owing to skew upward propagation. Specifically, this study proposes the insertion of two unit-size buffers associated with the binary search for a tapping point into each pair of subtrees to interrupt the non-zero skew upward propagation. This technique enables reliable construction of a buffered RLC clock tree with zero skew. The effectiveness of the proposed approach is demonstrated by assessing benchmarks.
An all-digital clock deskew buffer with variable duty cycles is presented. The proposed circuit aligns the input and output clocks with two cycles. A pulsewidth detector using the sequential time-to-digital conversion is employed to detect the duty cycle. The output clock with adjustable duty cycles can be generated. The proposed circuit has been fabricated in a 0.35 µm CMOS technology. The measured duty cycle of the output clock can be adjusted from 30% to 70% in steps of 10%. The operation frequency range is from 400 MHz to 600 MHz.
Zheng LIU Masanori FURUTA Shoji KAWAHITO
The RC mismatch among S/H stages for time-interleaved ADCs causes a phase error and a gain error and the phase error is dominant. The paper points out that clock skew and the phase error caused by the RC mismatch have similar effects on the sampling error and then can be compensated with the clock skew compensation. Simulation results agree well with the theoretical analysis. With the phase error compensation of RC mismatch, the SNDR in 14b ADC can be improved by more than 15 dB in the case that the bandwidth of S/H circuits is 3 times the sampling frequency. This paper also proposes a method of clock skew and RC mismatch compensation in time-interleaved sample-and-hold (S/H) circuits by sampling clock phase adjusting.
Hidehiro TOYODA Shinji NISHIMURA Michitaka OKUNO Kouji FUKUDA Kouji NAKAHARA Hiroaki NISHI
A high-speed physical-layer architecture for Ethernet is described that supports 100-Gb/s throughput and 40-km transmission, making it well suited for next-generation metro-area and intrabuilding networks. Its links comprise 1210-Gb/s synchronized parallel optical lanes. Ethernet data frames are transmitted by coarse wavelength division multiplexing link and bundled optical fibers. Ten of the lanes convey 640-bit data synchronously (64 bits10 lanes). One conveys forward error correction code ((132 b, 140 b) Hamming code), providing highly reliable (BER < 10-12) data transmission, and the other conveys parity data, enabling fault-lane recovery. A newly developed 64B/66B code-sequence-based deskewing mechanism is used that provides low-latency compensation for the lane-to-lane skew, which is less than 88 ns. Testing of this physical-layer architecture in a field programmable gate array circuit demonstrated that it can provide 100-Gb/s data communication with a 590 k gate circuit, which is small enough for implementation in a single LSI circuit.
Takashi SATO Junji ICHIMIYA Nobuto ONO Koutaro HACHIYA Masanori HASHIMOTO
This paper quantitatively analyzes thermal gradient of SoC and proposes a thermal flattening procedure. First, the impact of dominant parameters, such as area occupancy of memory/logic block, power density, and floorplan on thermal gradient are studied quantitatively. Temperature difference is also evaluated from timing and reliability standpoints. Important results obtained here are 1) the maximum temperature difference increases with higher memory area occupancy and 2) the difference is very floorplan sensitive. Then, we propose a procedure to amend thermal gradient. A slight floorplan modification using the proposed procedure improves on-chip thermal gradient significantly.
Yongqiang LU Chin-Ngai SZE Xianlong HONG Qiang ZHOU Yici CAI Liang HUANG Jiang HU
With VLSI design development, the increasingly severe power problem requests to minimize clock routing wirelength so that both power consumption and power supply noise can be alleviated. In contrast to most of traditional works that handle this problem only in clock routing, we propose to navigate standard cell register placement to locations that enable further less clock routing wirelength and power. To minimize adverse impacts to conventional cell placement goals such as signal net wirelength and critical path delay, the register placement is carried out in the context of a quadratic placement. The proposed technique is particularly effective for the recently popular prescribed skew clock routing. Experiments on benchmark circuits show encouraging results.
Masanori HASHIMOTO Tomonori YAMAMOTO Hidetoshi ONODERA
This paper discusses clock skew due to manufacturing variability and environmental change. In clock tree design, transition time constraint is an important design parameter that controls clock skew and power dissipation. In this paper, we evaluate clock skew under several variability models, and demonstrate relationship among clock skew, transition time constraint and power dissipation. Experimental results show that constraint of small transition time reduces clock skew under manufacturing and supply voltage variabilities, whereas there is an optimum constraint value for temperature gradient. Our experiments in a 0.18 µm technology indicate that clock skew is minimized when clock buffer is sized such that the ratio of output and input capacitance is four.
Eero AHO Jarno VANNE Kimmo KUUSILINNA Timo D. HAMALAINEN
Parallel memories increase memory bandwidth with several memory modules working in parallel and can be used to feed a processor with only necessary data. The Configurable Parallel Memory Architecture (CPMA) enables a multitude of access formats and module assignment functions to be used within a single hardware implementation, which has not been possible in prior embedded parallel memory systems. This paper focuses on address computation in CPMA, which is implemented using several configurable computation units in parallel. One unit is dedicated for each type of access formats and module assignment functions that the implementation supports. Timing and area estimates are given for a 0.25-micron CMOS process. The utilized resources are shown to be linearly proportional to the number of memory modules.
Young-Soo SOHN Seung-Jun BAE Hong-June PARK Soo-In CHO
A CMOS DFE (decision feedback equalization) receiver with a clock-data skew compensation was implemented for the SSTL (stub-series terminated logic) SDRAM interface. The receiver consists of a 2 way interleaving DFE input buffer for ISI reduction and a X2 over-sampling phase detector for finding the optimum sampling clock position. The measurement results at 1.2 Gbps operation showed the increase of voltage margin by about 20% and the decrease of time jitter in the recovered sampling clock by about 40% by equalization in an SSTL channel with 2 pF 4 stub load. Active chip area and power consumption are 3001000 µm2 and 142 mW, respectively, with a 2.5 V, 0.25 µm CMOS process.
Hidehiro TOYODA Hiroaki NISHI Shinji NISHIMURA Hisaaki KANAI Katsuyoshi HARASAWA
The first practical approach to 100-Gigabit Ethernet, i.e., Ethernet with a throughput of 100-Gb/s, is proposed for use in the next generation of LANs for GRID computing and large-capacity data centers. New structures, including a coding architecture, de-skewing method and high-speed packaging techniques, are introduced to the PHY layer to obtain the required data rate. Our form of 100-Gigabit Ethernet uses 10-Gb/s 10-channel CWDM or parallel-optical links. The coding architecture is formed of 64B/66B codes, modified for the CWDM and parallel links. In the de-skewing of the parallel signals, specially designed IDLE characters are used to compensate for skewing of data in the respective signal lanes. Advanced packaging techniques, which suppress the propagation loss and reflection of the 10-Gb/s lanes to obtain high-speed, good integrity and low-noise signaling, are proposed and evaluated. The proposed architectural features make this 100-Gigabit Ethernet concept practical for next-generation LANs.
Hidehiro TAKATA Rei AKIYAMA Tadao YAMANAKA Haruyuki OHKUMA Yasue SUETSUGU Toshihiro KANAOKA Satoshi KUMAKI Kazuya ISHIHARA Atsuo HANAMI Tetsuya MATSUMURA Tetsuya WATANABE Yoshihide AJIOKA Yoshio MATSUDA Syuhei IWADE
An on-chip, 64-Mb, embedded, DRAM MPEG-2 encoder LSI with a multimedia processor has been developed. To implement this large-scale and high-speed LSI, we have developed the hierarchical skew control of multi-clocks, with timing verification, in which cross-talk noise is considered, and simple measures taken against the IR drop in the power lines through decoupling capacitors. As a result, the target performance of 263 MHz at 1.5 V has been successfully attained and verified, the cross-talk noise has been considered, and, in addition, it has become possible to restrain the IR drop to 166 mV in the 162 MHz operation block.