Shinichi NISHIZAWA Toru NAKURA
We propose an open source cell library characterizer. Recently, free and open-sourced silicon design communities are attracted by hobby designers, academies and industries. These open-sourced silicon designs are supported by free and open sourced EDAs, however, in our knowledge, tool-chain lacks cell library characterizer to use original standard cells into digital circuit design. This paper proposes an open source cell library characterizer which can generate timing models and power models of standard cell library.
Massive amounts of computation involved in real-time evaluation of deep neural networks pose a serious challenge in battery-powered systems, and neuromorphic systems specialized in neural networks have been developed. This paper first shows the portion of active neurons at a time dwindles as going toward the output layer in recent large-scale deep convolutional neural networks. Spike-based, asynchronous neuromorphic systems take advantage of the sparse activation and reduce dynamic power consumption, while synchronous systems may waste much dynamic power even for the sparse activation due to clocks. We thus propose a clock gating-based dynamic power reduction method that exploits the sparse activation for synchronous neuromorphic systems. We apply the proposed method to a building block of a recently proposed synchronous neuromorphic computing system and demonstrate up to 79% dynamic power saving at a negligible overhead.
Youngjoo LEE Jaehwan JUNG In-Cheol PARK
This paper presents a novel low-power decoder architecture for the (36420, 32778) binary LDPC code targeting energy-efficient NAND-flash-based mobile devices. The proposed energy-scalable decoding algorithm reduces the operating bit-width of decoding function units at the early-use stage where the channel condition is good enough to lower the precision of computation. Based on a flexible adder structure, the decoding energy of the proposed LDPC decoder can be reduced by freezing the unnecessary parts of hardware resources. A prototype 4KB LDPC decoder is designed in a 65nm CMOS technology, which achieves an average decoding throughput of 8.13Gb/s with 1.2M equivalent gates. The power consumption of the decoder ranges from 397mW to 563mW depending on operating conditions.
An analytical performance evaluation model is presented in this paper. A time-splitting transmitter circuit employing a selectively activated flip-driver (SAFD) is presented and its performance is estimated by the new model. The optimal partitioning method which maximizes the performance of a given bus-invert (BI) coding circuit is also presented. When a bus is optimally partitioned, an ordinary BI circuit can reduce the number of bus transitions by about 25%, while an SAFD circuit can remove about 35% of them. The newly developed method is verified by simulations whose results correspond very well to the values predicted by the model.
Ming-Der SHIEH Shih-Hao FANG Shing-Chung TANG Der-Wei YANG
Partially parallel decoding architectures are widely used in the design of low-density parity-check (LDPC) decoders, especially for quasi-cyclic (QC) LDPC codes. To comply with the code structure of parity-check matrices of QC-LDPC codes, many small memory blocks are conventionally employed in this architecture. The total memory area usually dominates the area requirement of LDPC decoders. This paper proposes a low-complexity memory access architecture that merges small memory blocks into memory groups to relax the effect of peripherals in small memory blocks. A simple but efficient algorithm is also presented to handle the additional delay elements introduced in the memory merging method. Experiment results on a rate-1/2 parity-check matrix defined in the IEEE 802.16e standard show that the LDPC decoder designed using the proposed memory access architecture has the lowest area complexity among related studies. Compared to a design with the same specifications, the decoder implemented using the proposed architecture requires 33% fewer gates and is more power-efficient. The proposed new memory access architecture is thus suitable for the design of low-complexity LDPC decoders.
Su-Hon LIN Ming-Hwa SHEU Chao-Hsiang WANG
The moduli set (2n, 2n+1-1, 2n-1) which is free of (2n+1)-type modulus is profitable to construct a high-performance residue number system (RNS). In this paper, we derive a reduced-complexity residue-to-binary conversion algorithm for the moduli set (2n, 2n+1-1, 2n-1) by using New Chinese Remainder Theorem (CRT). The resulting converter architecture mainly consists of simple adder and multiplexer (MUX) which is suitable to realize an efficient VLSI implementation. For the various dynamic range (DR) requirements, the experimental results show that the proposed converter can significantly achieve at least 23.3% average Area-Time (AT) saving when comparing with the latest designs. Based on UMC 0.18 µm CMOS cell-based technology, the chip area for 16-bit residue-to-binary converter is 931931 µm2 and its working frequency is about 135 MHz including I/O pad.
This paper proposes a novel boundary scan test scheme for intellectual property (IP) core identification via watermarking. The core concept is embedding a watermark identification circuit (WIC) and a test circuit into the IP core at the behavior design level. The procedure depends on current IP-based design flow. This scheme can detect the identification of the IP provider without the need to examine the microphotograph after the chip has been manufactured and packaged. This scheme can successfully survive synthesis, placement, and routing and identify the IP core at various design levels. Experimental results have demonstrated that the proposed approach has the potential to solve the IP identification problem.
Jianping HU Tiefeng XU Hong LI
This paper presents a novel low-power register file based on adiabatic logic. The register file consists of a storage-cell array, address decoders, read/write control circuits, sense amplifiers, and read/write drivers. The storage-cell array is based on the conventional memory cell. All the circuits except the storage-cell array employ CPAL (complementary pass-transistor adiabatic logic) to recover the charge of large node capacitance on address decoders, bit-lines and word-lines in fully adiabatic manner. The minimization of energy consumption was investigated by choosing the optimal size of CPAL circuits for large load capacitance. The power consumption of the proposed adiabatic register file is significantly reduced because the energy transferred to the large capacitance buses is mostly recovered. The energy and functional simulations are performed using the net-list extracted from the layout. HSPICE simulation results indicate that the proposed register file attains energy savings of 65% to 85% as compared to the conventional CMOS implementation for clock rates ranging from 25 to 200 MHz.
In this paper, we investigate a low-power architecture for designs modeled as an Extended Finite State Machine (EFSM). It is based on the general dynamic power management concept, in which the redundant computation can be dynamically disabled to reduce the overall power dissipation. The contribution of this paper is mainly a systematic procedure to identify almost maximal amount of redundant computation in a design given as an EFSM. There are two levels of redundant computation to be exploited--one is based on the machine state information, while the other is based on the transition information. After the extraction of the redundant computation, a low-power architecture using input gating is proposed to synthesize the final circuit. We tested the technique on a design computing a number's modulo inverse. Experimental results show that 31% power reduction can be achieved at the costs of 2% timing penalty and 16% area overhead.
Yeu-Horng SHIAU Jer Min JOU Chin-Chi LIU
In this paper, two efficient VLSI architectures for biorthogonal wavelet transform are proposed. One is constructed by the filter bank implementation and another is constructed by the lifting scheme. In the filter bank implementation, due to the symmetric property of biorthogonal wavelet transform, the proposed architecture uses fewer multipliers than the orthogonal wavelet transform. Besides, the polyphase decomposition is adopted to speed up the processing by a factor of 2. In the lifting scheme implementation, the pipeline-scheduling technique is employed to optimize the architecture. Both two architectures are with advantages of lower implementation complexity and higher throughput rate. Moreover, they can also be applied to realize the inverse DWT efficiently. Based on the above properties, the two architectures can be applied to time-critical image compressions, such as JPEG2000. Finally, the architecture constructed by the lifting scheme is implemented into a single chip on 0.35 µm 1P4M CMOS technology, and its area and working performance are 5.005 5.005 mm2 and 50 MHz, respectively.
In this paper, a high-performance pipelining architecture for 2-D inverse discrete wavelet transform (IDWT) is proposed. We use a tree-block pipeline-scheduling scheme to increase computation performance and reduce temporary buffers. The scheme divides the input subbands into several wavelet blocks and processes these blocks one by one, so the size of buffers for storing temporal subbands is greatly reduced. After scheduling the data flow, we fold the computations of all wavelet blocks into the same low-pass and high-pass filters to achieve higher hardware utilization and minimize hardware cost, and pipeline these two filters efficiently to reach higher throughput rate. For the computations of N N-sample 2-D IDWT with filter length of size K, our architecture takes at most (2/3)N2 cycles and requires 2N(K-2) registers. In addition, each filter is designed regularly and modularly, so it is easily scalable for different filter lengths and different levels. Because of its small storage, regularity, and high performance, the architecture can be applied to time-critical image decompression.
The design of the analog part of a mixed analog-digital IC for a commercial wireless burglar alarm system is presented as an example of a very low-power VLSI design for battery-operated systems. The main constraint is battery life, which must be at least five years (with standard camera-battery). An operational amplifier, a power supply monitor and an oscillator are the core of the design. The operational amplifier absorbs 1.5 µA while the entire analog part absorbs 4 µA. Measures on each single part show compliance with specification. Test on working environment show its full functionality. Even though the example is application specific, the design solutions and each single element can also be utilized in many other battery-operated low-frequency devices (e.g. environmental parameter monitoring).
In this paper, a VLSI architecture for lifting-based discrete wavelet transform (LDWT) is presented. Our architecture folds the computations of all resolution levels into the same low-pass and high-pass units to achieve higher hardware utilization. Due to the regular and flexible structure of the design, its area is independent of the length of the 1-D input sequence, and its latency is independent of the number of resolution levels. For the computations of analysis process of N-sample 1-D 3-level LDWT, our design takes about N clock cycles and requires 2 multipliers, 4 adders, and 22 registers. It is fabricated with TSMC 0.35-µm cell library and has a die size of 1.21.2 mm2. The power dissipation of the chip is about 0.4 W at the clock rate of 80 MHz.
This paper presents an event-driven approach to the timing verification of latch-synchronized systems. The proposed method performs critical path extraction and timing error detection at the same time, and extracts the critical path only if necessary. By doing so, the complexity of analysis is reduced and efficiency is greatly improved over the conventional approaches which detect timing errors after extracting the complete critical paths of the system. Experimental results show that, compared to the existing methods, it provides a more than 12-fold improvement in speed on the average for ISCAS benchmark circuits, and the relative efficiency of analysis improves as the circuit size grows.
We give a tutorial on high-level synthesis of VLSI. The evolution of digital system synthesis techniques and the need for higher level design automation tools are first discussed. We then point out essential issues to the successful development and acceptance by the designers of a high-level synthesis system. Techniques that have been proposed for various subtasks of high-level synthesis are surveyed. Possible applications of the high level synthesis in area other than chip design are forecast. Finally, we point out several directions for possible future research.
Kalman filter is an essential tool in signal processing, modern control and communications. The filter estimates the states of a given system from noisy measurements, using a mean-square error criterion. Although Kalman filter has been shown to be very versatile, it has always been computationally intensive since a great number of matrix computations must be performed at each iteration. Thus the exploitation of this technique in broadband real time applications is restricted. The solution to these limitations appears to be in VLSI (very large scale integration) architectures for the parallel processing of data, in the form of systolic architectures. Systolic arrays are networks of simple processing cells connected only to their nearest neighbors. Each cell consists of some simple logic and has a small amount of local memory. Overall data flows through the array are synchronously controlled by a single main clock pulse. In parallel with the development of Kalman filter, the square root covariance and the square root information methods have been studied in the past. These square root methods are reported to be more accurate, stable and efficient than the original algorithm presented by Kalman. However it is known that standard SRIF is less efficient than the other algorithms, simply because standard SRIF has additional matrix inversion computation and matrix multiplication which are difficult to implement in terms of speed and accuracy. To solve this problem, we use the modified Faddeeva algorithm in computing matrix inversion and matrix multiplication. The proposed algorithm avoids the direct matrix inversion computation and matrix multiplication, and performs these matrix manipulations by Gauss elimination. To evaluate the proposed method, we constructed an efficient systolic architecture for standard SRIF using the COMPASS design tools. Actual VLSI design and its simulation are done on the circuits of four type processors that perform Gauss elimination and the modified Givens rotation.
Shoichiro YAMADA Masahiro KASAI
This paper deals with the wire length expressions using differentiable nonlinear functions, as a result they can be used in analytical placement methods. These expressions can be applicable to clique, bipartite-graph, and half-perimeter net models, and quadratic and Manhattan metrics to estimate the wire lengths.
This paper studies the problem of embedding a graph into a book with nodes on a line along the spine of the book and edges on the pages in such a way that no edge crosses another. Atneosen as well as Bernhart and Kainen has shown that every graph can be embedded into a 3-page book when each edge can be embedded in more than one page. The time complexity of Bernhart and Kainen's method is Ω(ν(G)), where ν(G) is the crossing number of a graph G. A new 0(mn) algorithm is derived in this paper for embedding a graph G=(V, E), where m=│E│ and n= │V│ . The number of points at which edges cross over the spine in embedding a complete graph into a 3-page book is also investigated.
Yoichi SHIRAISHI Jun'ya SAKEMI Kazuyuki FUKUDA
A global routing problem is formulated as a multi-commodity network flow problem. The formulation gives no restriction over the shape of a routing pattern and makes it possible to obtain the optimal solution by using a mathematical programming method. Moreover, it can be naturally extended to the problem even optimizing routing length objectives for net delay and clock skew perfomances by using the goal programming method. An approximation algorithm solving the multi-commodity network flow problem is proposed by adding a merge step of wires whose source-sink pairs are exactly the same and a step restricting an area for searching routes. Experimental results show that this global routing algorithm connected with a line-search detailed router can generate a complete routing for interblock routing problems with more than 2400 wires in two industrial chips. The total amount of procassing time for both problems is about 90 minutes on a mainframe computer.
Ritsu KUSABA Hiroshi MIYASHITA Takumi WATANABE
This paper describes a new automated approach to generating the patterns of CMOS leaf cells from transistor-level connectivity data. This method can generate CMOS leaf cells that are configurable to a macro cell satisfying user-specified constraints. The user-specified constraints include the aspect ratio and port positions of the macro cell. We propose a top-down method for converting the macro cell level constratints to leaf cell level ones. Using this method, a variety of customized macro cells can be designed in a short turn-around time. The method consists of four processes--diffusion sharing, initial placement, placement improvement and routing--which culminate in the automatic generation of symbolic representations. Using a compactor, those symbolic representations can be converted to physical patterns which are gathered into a macro cell by a macro generator. We define various objective functions to improve unit pair placement. We also introduce five ways to optimize leaf cell area: 1) multi-row division, 2) gate division 3) rotation, 4) power line and diffusion overlapping and 5) reconstruction of hierarchical structure. The proposed approach has been applied to various kinds of CMOS leaf cells. Experimental results show that the generated cells have almost the same areas as those generated by conventional bottom-up approaches in leaf and macro cell layouts. This approach offers a further advantage in that the various-sized macro cells required by layout disigners can also be generated.