Hirofumi HAMAMURA Hiroaki KOMATSU
This paper describes special-purpose hardware for large-scale logic simulation, called SP2, which executes an event driven algorithm and can simulate up to sixteen million gates. SP2 was developed, in 1992, for system verification of large-scale computer designs as a successor to SP1, which was developed in 1987. SP2 provides enhanced performance, throughput, and delay accuracy over SP1. Since 1992, SP2 has been widely used for system-level simulation of mainframes, super computers, UNIX servers and microprocessors. It is used as a powerful simulator, in all stages of design verification, or in early stages, before regression testing, by using emulators.
In this paper a low-complexity and high-resolution algorithm to estimate the magnitude of complex numbers is presented. Starting from a review of previous art, the new algorithm has been derived to improve precision performance without any penalty in hardware complexity. As a case example, a semi-custom VLSI implementation for 10 bit 2's complement input data has been performed. A mean square error and mean error performance improvement of nearly one order of magnitude has been demonstrated for an hardware complexity increase of roughly 34% with respect to previously presented solutions.
Yasuaki WATANABE Naofumi TAKAGI Kazuyoshi TAKAGI
A VLSI algorithm for division in GF(2m) with the canonical basis representation is proposed. It is based on the extended Binary GCD algorithm for GF(2m), and performs division through iteration of simple operations, such as shifts and bitwise exclusive-OR operations. A divider in GF(2m) based on the algorithm has a linear array structure with a bit-slice feature and carries out division in 2m clock cycles. The amount of hardware of the divider is proportional to m and the depth is a constant independent of m.
Kei SAKAGUCHI Jun-ichi TAKADA Kiyomichi ARAKI
Implementation of Multi-Input Multi-Output (MIMO) channel sounder is considered, taking hardware cost and realtime measurement into account. A remarkable difference between MIMO and conventional Single-Input Multi-Output (SIMO) channel sounding is that the MIMO sounder needs some kind of multiplexing to distinguish transmitting antennas. We compared three types of multiplexing TDM, FDM, and CDM for the sounding purpose, then we chose FDM based technique to achieve cost effectiveness and realtime measurement. In the framework of FDM, we have proposed an algorithm to estimate MIMO channel parameters. Furthermore the proposed algorithm was implemented into the hardware, and the validity of the proposed algorithm was evaluated through measurements in an anechoic chamber.
Jun MATSUOKA Yoshifumi SEKINE Katsutoshi SAEKI Kazuyuki AIHARA
A number of studies have recently been published concerning chaotic neuron models and asynchronous neural networks having chaotic neuron models. In the case of large-scale neural networks having chaotic neuron models, the neural network should be constructed using analog hardware, rather than by computer simulation via software, due to the high speed and high integration of analog circuits. In the present study, we discuss the circuit structure of a chaotic neuron model, which is constructed on the basis of the mathematical model of an asynchronous chaotic neuron. We show that the pulse-type hardware chaotic neuron model can be constructed on the basis of the mathematical model of an asynchronous chaotic neuron. The proposed model is an effective model for the cell body section of the pulse-type hardware chaotic neuron model for ICs. In addition, we show the bifurcation structure of our composed model, and discuss the bifurcation routes and return maps thereof.
Nozomu TOGAWA Takashi SAKURAI Masao YANAGISAWA Tatsuo OHTSUKI
This letter proposes a hardware/software partitioning algorithm for digital signal processor cores with two register files. Given a compiled assembly code and a timing constraint of execution time, the proposed algorithm generates a processor core configuration with a new assembly code running on the generated processor core. The proposed algorithm considers two register files and determines the number of registers in each of register files. Moreover the algorithm considers two or more types of functional units for each arithmetic or logical operation and assigns functional units with small area to a processor core without causing performance penalty. A generated processor core will have small area compared with processor cores which have a single register file or those which consider only one type of functional units for each operation. The experimental results demonstrate the effectiveness and efficiency of the proposed algorithm.
Nozomu TOGAWA Yoshiharu KATAOKA Yuichiro MIYAOKA Masao YANAGISAWA Tatsuo OHTSUKI
Hardware/software partitioning is one of the key processes in a hardware/software cosynthesis system for digital signal processor cores. In hardware/software partitioning, area and delay estimation of a processor core plays an important role since the hardware/software partitioning process must determine which part of a processor core should be realized by hardware units and which part should be realized by a sequence of instructions based on execution time of an input application program and area of a synthesized processor core. This paper proposes area and delay estimation equations for digital signal processor cores. For area estimation, we show that total area for a processor core can be derived from the sum of area for a processor kernel and area for additional hardware units. Area for a processor kernel can be mainly obtained by minimum area for a processor kernel and overheads for adding hardware units and registers. Area for a hardware unit can be mainly obtained by its type and operation bit width. For delay estimation, we show that critical path delay for a processor core can be derived from the delay of a hardware unit which is on the critical path in the processor core. Experimental results demonstrate that errors of area estimation are less than 2% and errors of delay estimation are less than 2 ns when comparing estimated area and delay with logic-synthesized area and delay.
A system level approach for a memory power reduction is proposed in this paper. The basic idea is allocating frequently executed object codes into a small subprogram memory and optimizing supply voltage and threshold voltage of the subprogram memory. Since large scale memory contains a lot of direct paths from power supply to ground, power dissipation caused by subthreshold leakage current is more serious than dynamic power dissipation. Our approach optimizes the size of subprogram memory, supply voltage, and threshold voltage so as to minimize memory power dissipation including static power dissipation caused by leakage current. A heuristic algorithm which determines code allocation, supply voltage, and threshold voltage simultaneously so as to minimize power dissipation of memories is proposed as well. Our experiments with some benchmark programs demonstrate significant energy reductions up to 80% over a program memory which does not employ our approach.
We present a scalable parallel rasterizer based on our interleaved scanline rasterization. The sorting overhead of a conventional scanline-based parallel rendering approach has been studied and removed by implementing a scanline assignment hardware. All advantages of the scanline-based parallel rendering are kept such that a good scalability and a small memory usage are achieved. Our architecture is evaluated precisely by a discrete event-based simulation, and the rendering performance and utilization are shown for a various number of rasterizers. The simulation results show more than 8 Mtriangles/s of performance with 64 rasterization engines running at 10 MHz.
Nozomu TOGAWA Masayuki IENAGA Masao YANAGISAWA Tatsuo OHTSUKI
This paper proposes an area/time optimizing algorithm in a high-level synthesis system for control-based hardwares. Given a call graph whose node corresponds to a control flow of an application program, the algorithm generates a set of state-transition graphs which represents the input call graph under area and timing constraint. In the algorithm, first state-transition graphs which satisfy only timing constraint are generated and second they are transformed so that they can satisfy area constraint. Since the algorithm is directly applied to control-flow graphs, it can deal with control flows such as bit-wise processes and conditional branches. Further, the algorithm synthesizes more than one hardware architecture candidates from a single call graph for an application program. Designers of an application program can select several good hardware architectures among candidates depending on multiple design criteria. Experimental results for several control-based hardwares demonstrate effectiveness and efficiency of the algorithm.
A digit-recurrence algorithm for cube rooting is proposed. In cube rooting, the digit-recurrence equation of the residual includes the square of the partial result of the cube root. In the proposed algorithm, the square of the partial result is kept, and the square, as well as the residual, is updated by addition/subtraction, shift, and multiplication by one or two digits. Different specific versions of the algorithm are possible, depending on the radix, the digit set of the cube root, and etc. Any version of the algorithm can be implemented as a sequential (folded) circuit or a combinational (unfolded) circuit, which is suitable for VLSI realization.
A combination of a software and a systolic hardware implementation for the Quasi Arithmetic compression algorithm is presented. The hardware is implemented as a pipeline hardware implementation. The implementation doesn't change the the algorithm. It just split it into two parts. The combination of parallel software and pipeline hardware can give very fast compression without decline of the compression efficiency.
Koyo NITTA Toshihiro MINAMI Toshio KONDO Takeshi OGURA
This paper describes a unique motion estimation and compensation (ME/MC) hardware architecture for a scene-adaptive algorithm. By statistically analyzing the characteristics of the scene being encoded and controlling the encoding parameters according to the scene, the quality of the decoded image can be enhanced. The most significant feature of the architecture is that the two modules for ME/MC can work independently. Since a time interval can be inserted between the operations of the two modules, a scene-adaptive algorithm can be implemented in the architecture. The ME/MC architecture is loaded on a single-chip MPEG-2 video encoder.
Dae-Hyun LEE In-Cheol PARK Chong-Min KYUNG
This paper presents an efficient approach for a hardware/software partitioning problem: synthesis of an application-specific coprocessor which accelerates an embedded software running on a main processor. Given a set of data flow graphs (DFGs), most of previous hardware/software partitioning approaches have focused on mapping DFGs to hardware or software. Their common weaknesses are that 1) they ignore various implementation alternatives in realizing DFGs as hardware based on the assumption that only a single hardware implementation exists for a DFG, and that 2) they don't consider the effect of merging on hardware area when synthesizing a coprocessor by merging DFGs. To deal with the first issue, we formulate both the mapping of DFGs to hardware or software and the selection of the appropriate hardware implementation for each DFG as a single integer programming problem, and then apply an iterative algorithm based on the Kernighan and Lin's heuristic to solve the problem. To reduce the CPU time, we have devised data structures that quickly calculate costs of hardware implementations. To deal with the second issue, our method links DFGs with dummy nodes to produce a single large DFG, and then synthesizes a target coprocessor by globally scheduling the DFG and allocating its datapath. Experimental results demonstrate that our approach outperforms the previous approach based on genetic algorithm (GA) in both the coprocessor area and the CPU time.
Hidehisa NAGANO Akihiro MATSUURA Akira NAGOYA
This paper proposes a method for implementing a metric computation accelerator for fractal image compression using reconfigurable hardware. The most time-consuming part in the encoding of this compression is computation of metrics among image blocks. In our method, each processing element (PE) configured for an image block accelerates these computations by pipeline processing. Furthermore, by configuring the PE for a specific image block, we can reduce the number of adders, which are the main computing elements, by a half even in the worst case.
In the beginning of the new century, many information appliance (IA) products will replace traditional electronic appliances to help people in smart, efficient, and low-cost ways. These successful products must be capable of communicating multimedia information, which is embedded into the electronic appliances with high integration, innovation, and power-throughput tradeoff. In this paper, we develop a codesign procedure to analyze, compare, and emulate the multimedia communication applications to find the candidate implementations under different criteria. The experimental results demonstrate that in general, memory technology dominates the optimal tradeoff and ALU improvements impact greatly on particular applications. The results also show that the proposed procedure is effective and quite efficient.
Barry SHACKLEFORD Etsuko OKUSHI Mitsuhiro YASUDA Hisao KOIZUMI Katsuhiko SEO Hiroto YASUURA
The problem of synthesizing a minimum-cost logic network is formulated for a genetic algorithm (GA). When benchmarked against a commercial logic synthesis tool, an odd parity circuit required 24 basic cells (BCs) versus 28 BCs for the design produced by the commercial system. A magnitude comparator required 20 BCs versus 21 BCs for the commercial system's design. Poor temporal performance, however, is the main disadvantage of the GA-based approach. The design of a hardware-based cost function that would accelerate the GA by several thousand times is described.
Jih-Ming FU Trong-Yen LEE Pao-Ann HSIUNG Sao-Jie CHEN
Most of current codesign tools or methodologies only support validation in the form of cosimulation and testing of design alternatives. The results of hardware-software codesign of a distributed system are often not verified, because they are not easily verifiable. In this paper, we propose a new formal coverification approach based on linear hybrid automata, and an algorithm for automatically converting codesign results to the linear hybrid automata framework. Our coverification approach allows automatic verification of real-time constraints such as hard deadlines. Another advantage is that the proposed approach is suitable for verifying distributed systems with arbitrary communication patterns and system architecture. The feasibility of our approach is demonstrated through several application examples. The proposed approach has also been successfully used in verifying deadline violations when there are inter-task communications between tasks with different period lengths.
Naofumi HOMMA Takafumi AOKI Tatsuo HIGUCHI
This paper presents an efficient graph-based evolutionary optimization technique called Evolutionary Graph Generation (EGG), and its application to the design of fast constant-coefficient multipliers using parallel counter-tree architecture. An important feature of EGG is its capability to handle the general graph structures directly in evolution process instead of encoding the graph structures into indirect representations, such as bit strings and trees. This paper also addresses the major problem of EGG regarding the significant computation time required for verifying the function of generated circuits. To solve this problem, a new functional verification technique for arithmetic circuits is proposed. It is demonstrated that the EGG system can create efficient multiplier structures which are comparable or superior to the known conventional designs.
Ji-Bing WANG Ming ZHAO Xi-Bin XU Yan YAO
In recent years, the concept of the software radio has been put forward by the international communication society. It is well known that software radio will play an important role in third generation wireless communication systems. But until now there is not an acceptable concept of software radio. How to make software radio be applicable authentically, and how to develop its ascendancy? This paper introduces some new ideas about the key issues of software radio, including software radio architecture and its hardware platform, and it focuses on the design considerations of the hardware platform. Conventional software radio systems use pipeline architecture, which is not scalable and cannot fulfill the inherent requirements of software radios. In this paper a new layer structure of the hardware platform is proposed. It is an open architecture with flexibility and scalability. Then three schemes for hardware platform realization are introduced: bus architecture, switched network architecture, and fat tree architecture. An extensive analysis on advantages and disadvantages of each architecture is given. Then an application example is proposed. The switched network architecture is applied in the cellular wireless communication systems. The basestation is divided into four components according to their functions: antenna, IF, baseband, and control, which are connected by the ATM network. We call this virtualization of wireless communication systems. This will bring great benefits such as fast handoff, easily realization of different macrodiversity algorithm.