Hiroshi TAKAHASHI Shigeshi ABIKO Kenichi TASHIRO Kaoru AWAKA Yutaka TOYONOH Rimon IKENO Shigetoshi MURAMATSU Yasumasa IKEZAKI Tsuyoshi TANAKA Akihiro TAKEGAMA Hiroshi KIMIZUKA Hidehiko NITTA Miki KOJIMA Masaharu SUZUKI James Lowell LARIMER
A new high-speed and low-power digital signal processor (DSP) core, C55x, was developed for next generation applications such as 3G cellular phone, PDA, digital still camera (DSC), audio, video, embedded modem, DVD, and so on. To support such MIPS-rich applications, a packet size of an instruction fetch increased from 16-bit to 32-bit comparing with the world's most popular C54x DSP core, while maintaining complete software compatibility with the legacy DSP code. An on-chip instruction buffer queue (IBQ) automatically unpacks the packets and issues multiple instructions in parallel for the efficient use of circuit resources. The efficiency of the parallelism has been further improved by additional hardwares such as second 1717-bit MAC, a 16-bit ALU, and three temporary registers that can be used for simple computations. Four 40-bit accumulators make it possible to execute more operation per cycle with dramatically reduced overall power consumption. These new architecture allows two times efficiency of instruction per cycle (IPC) than the previous DSP core on typical applications at the same MHz. The new DSP core was designed for TI's two 130 nm technologies, one with high-VT for low-leakage and middle-performance operation at 1.5 V, and the other with low-VT for high-performance and low-VDD operation at 1.2 V, to provide best choices for any applications with a single layout data base. With the low-leakage process, the DSP core operates at over 200 MHz with 188 µA/MHz (at 75% Dual MAC + 25% ADD) active power and less than 1.63 µA standby current. The high-performance process provides it with 300 MHz with 169 µA/MHz active power and less than 680 µA standby current. The new core was designed by a semi-custom approach (ASIC + custom library) using 5-level Cu metal system with low-k dielectric material of fluorosilicate glass (FSG), and about one million transistors are contained in the core. The total balance of its power, performance, area, and leakage current (PPAL) is well suitable to most of next generation applications. In this paper, we will discuss features of the new DSP core, including circuit design techniques for high-speed and low-power, and present an example product.
This paper presents a novel algorithm which generates a beam pattern having maximum gain towards target direction. The new technique utilizes a Generalized Conjugate Gradient Method (CGM) based on the conventional CGM for obtaining the optimal weight vector. The proposed method finds a weight vector that maximizes the SINR (Signal to Interference plus Noise Ratio). Based on the an analysis of the results of various computer simulations, it is observed that the proposed algorithm is suitable for the IS2000 1X mobile communication environments.
Nozomu TOGAWA Koichi TACHIKAKE Yuichiro MIYAOKA Masao YANAGISAWA Tatsuo OHTSUKI
This letter proposes a new hardware/software partitioning algorithm for processor cores with SIMD instructions. Given a compiled assembly code including SIMD instructions and a timing constraint, the proposed algorithm synthesizes an area-optimized processor core with a new assembly code. Firstly, we assume for each operation type a super SIMD functional unit which can execute all the SIMD instructions. Secondly we reduce a SIMD instruction or "sub-function" of each super functional unit, one by one, while the timing constraint is satisfied. At the same time, we update the assembly code so that it can run on the new processor configuration. By repeating this process, we finally find SIMD functional unit configuration as well as a processor core architecture. The promising experimental results are also shown.
Nozomu TOGAWA Kyosuke KASAHARA Yuichiro MIYAOKA Jinku CHOI Masao YANAGISAWA Tatsuo OHTSUKI
A packed SIMD type operation or a SIMD operation is n-parallel b/n-bit sub-operations executed by the modified n-bit functional unit. Such a functional unit is called a SIMD functional unit and a processor core which can execute SIMD operations is called a SIMD processor core. SIMD operations can be effectively applied to image processing applications. This paper focuses on hardware/software cosynthesis of SIMD processor cores and particularly proposes a new simulator generator which simulates pipelined instructions for a SIMD processor. Generally, a SIMD functional unit has many options and then we can have so many different SIMD functional unit instances. However, since our hardware/software cosynthesis system synthesizes a special-purpose processor core for an input application program, it uses very limited SIMD functional unit instances. In the proposed approach, we consider a SIMD operation to be a set of SIMD sub-operations. By adding up the appropriate SIMD sub-operations, we construct a single SIMD operation. Then a SIMD functional unit behavior can be characterized by a collection of SIMD operations. This approach has the advantage that: if we have a small number of behavior libraries for SIMD sub-operations, we can instantiate a particular SIMD functional unit behavior. Experimental results demonstrate the effectiveness of the proposed approach.
Hiroshi TAKAHASHI Rimon IKENO Yutaka TOYONOH Akihiro TAKEGAMA Yasumasa IKEZAKI Tohru URASAKI Hitoshi SATOH Masayasu ITOIGAWA Yoshinari MATSUMOTO
High-speed and low-power DSPs have been developed for versatile hand set applications. The DSP contains a 16-bit fixed point DSP core with multiple buses, highly tuned instruction sets and a low-power architecture, featuring CPU power with 404.5 µ W/MHz, chip power with 2.08 mW/MHz at peak and 200 µA stand-by current and 160 MHz/160 MIPS performance by a single DSP core, and also operates at 0.68 V within the temperature range from -40C to 125C in the worst case (Weak corner) even using much higher I-off current process compared to a conventional process to obtain a faster operating frequency. In this paper, we discuss circuit design techniques to continue scaling down valuable IP cores keeping the same functionality, better speed performance, and lower power dissipation with much lower voltage operation capability. For further power reduction by DSP software, Run-time Power Control (RPC) has been demonstrated in an MP3 player using 100 MHz/100 MIPS DSP at 1.8 V, which is a real-time application running on an Internet audio evaluation module experimentally and we obtained 32-60% power reduction on various music source data.
Hirohisa GAMBE Teruo ISHIHARA Yasuji OTA Norichika KUMAMOTO Yoshio KUNIYASU
The progress made in large-scale integration of the baseband circuits of digital cellular phones now makes it possible to implement a voice CODEC and its related functions in the baseband LSI rather than through a general-purpose digital signal processor. This paper describes an improved hardware solution that enables efficient application of the PSI-CELP CODEC-- the most complex CODEC for mobile systems--to the PDC half-rate system through its implementation as a DSP macro in a low-voltage, large-scale LSI. Specific circuit blocks are added as hardware engines to a general-purpose DSP-oriented core. These specific engines were implemented as peripheral circuits for a DSP macro that can be used as a single DSP with an added I/O circuit and is suitable for use in future highly integrated mobile baseband chips. With the assistance of these hardware engines and some additional ALU instructions to achieve efficient programming, the machine speed required for the CODEC can be relatively slow, thus allowing the same architecture to be repeatedly used without needing to set the transistor threshold voltage too low even when the use of deeper sub-micron technologies require a chip to run at a lower supply voltage. We evaluated this DSP-macro architecture using a 0.35 µm CMOS technology test chip. Then we developed a commercial base version using 0.25 µm technology and verified that it can operate at 1.2 V and that the PSI-CELP CODEC can be done at 40 MIPS with power consumption of 11 mW. We also verified that the circuit design can be applied up to 0.18 µm technology with a single threshold voltage of 0.3 V. Thus, the design of the DSP macro incorporating the hardware engines provides a great deal of flexibility that should allow its use in chips based on future technologies and the voice CODEC firmware can be effectively re-used. Although the DSP macro architecture was designed mainly through PSI-CELP application analysis, it can process other voice CODECs such as the AMR CODEC for third-generation mobile applications as well as some other mobile baseband functions such as channel CODECs. This approach can also be refined to permit its application to, for example, high-quality audio CODECs.
Hiroshi HARADA Masayuki FUJISE
In this paper, we newly developed a small-size software radio terminal that can realize global positioning service (GPS) navigation system, vehicle information and communication system (VICS), electronic toll collection system (ETC), AM/FM radio broadcasting services on middle wave (MW) and very high frequency (VHF) bands, FM multiplex broadcasting system, and several modulation schemes such as BPSK, ASK, QPSK, GMSK, and π/4QPSK by downloading software to realize each system from wired and wireless networks. The developed terminal realizes simultaneous multiple services when users would like to use several radio communication services in the driving situation by using our proposed multitask algorithm. The developed terminal has a size of 17.5 cm wide, 19.0 cm deep, and 5 cm high and worked at DC-12.0 V and around 2 A. The size and electrical power consumption are quite small and low and acceptable for consumers such as car drivers. In this paper, we introduce the configuration and proposed key technologies in our developed terminal and measure the software configuration time.
This paper presents a new DSP-oriented code optimization method to enhance performance by exploiting the specific architectural features of digital signal processors. In the proposed method, a source code is translated into the static single assignment form while preserving the high-level information related to loops and the address computation of array accesses. The information is used in generating hardware loop instructions and parallel instructions provided by most digital signal processors. In addition to the conventional control-data flow graph, a new graph is employed to make it easy to find auto-modification addressing modes efficiently. Experimental results on benchmark programs show that the proposed method is effective in improving performance.
Tetsuya YAMADA Makoto ISHIKAWA Yuji OGATA Takanobu TSUNODA Takahiro IRITA Saneaki TAMAKI Kunihiko NISHIYAMA Tatsuya KAMEI Ken TATEZAWA Fumio ARAKAWA Takuichiro NAKAZAWA Toshihiro HATTORI Kunio UCHIYAMA
A 32-bit embedded RISC microprocessor core integrating a DSP has been developed using a 0.18-µm five-layer-metal CMOS technology. The integrated DSP has a single-MAC and exploits CPU resources to reduce hardware. The DSP occupies only 0.5 mm2. The processor core includes a large on-chip 128 kB SRAM called U-memory. A large capacity on-chip memory decreases the amount of traffic with an external memory. And it is effective for low-power and high-performance operation. To realize low-power dissipation for the U-memory access, the active ratio of U-memory's access is reduced. The critical path is a load path from the U-memory, and we optimized the path through the whole chip. The chip achieves 0.79 mA/MHz executing Dhrystone 1.1 at 108 MHz, which is suitable for mobile applications.
The C166S V2 is Infineon Technologies' latest generation 16-bit microcontroller core, member of the C166 family. This new core architecture is a huge step forward in performance and DSP capabilities: With its single cycle engine and enhanced MAC unit running at up to 200 MHz it more than doubles the performance of the fastest C166 based controllers (C166S V1) running at the same speed. Furthermore the instruction set is fully compatible with the previous C166 cores. This architecture is specifically suited for real-time embedded systems with high requirements for performance and signal processing functionality with tight cost and power budgets. As a fully synthesizable core, and with a large selection of peripherals available, the C166 V2 provides a straightforward path to the required specific systems-on-chip.
Johannes KNEIP Matthias WEISS Wolfram DRESCHER Volker AUE Jurgen STROBEL Thomas OBERTHUR Michael BOLLE Gerhard FETTWEIS
This paper presents the HiperSonic 1, a multi-standard, application-specific signal processor, designed to execute the baseband conversion algorithms in IEEE802.11a- and HIPERLAN/2-based 5 GHz wireless LAN applications. In contrast to widely existing, dedicated implementations, most of the computational effort here was mapped onto a configurable, data- and instruction-parallel DSP core. The core is supplemented by mixed signal A/D, D/A converters and hardware accelerators. Memory and register architecture, instruction set and peripheral interfaces of the chip were carefully optimized for the targeted applications, leading to a sound combination of flexibility, die area and power consumption. The 120 MHz, 7.6 million-transistor solution was implemented in 0.18 µm CMOS and performs IEEE802.11a or HiperLAN/2 compliant baseband processing at data rates up to 60 Mbit/s.
A novel scheduling method for asynchronous multirate/multi-task processing by programmable digital signal processors (DSPs) has been developed. This mixed scheduling method combines static and dynamic scheduling, and avoids runtime overheads due to interrupts in context switching to realizes asynchronous multirate systems. The processing delay introduced when using static scheduling with static buffering is avoided by introducing deadline scheduling in the static schedule design. In the developed software design system, a block-diagram description language is extended to describe asynchronous multi-task processing. The scheduling method enables asynchronous multirate processing, such as arbitrary-sampling-ratio rate conversion, asynchronous interface, and multimedia applications, to be efficiently realized by programmable DSPs.
Nozomu TOGAWA Takashi SAKURAI Masao YANAGISAWA Tatsuo OHTSUKI
This letter proposes a hardware/software partitioning algorithm for digital signal processor cores with two register files. Given a compiled assembly code and a timing constraint of execution time, the proposed algorithm generates a processor core configuration with a new assembly code running on the generated processor core. The proposed algorithm considers two register files and determines the number of registers in each of register files. Moreover the algorithm considers two or more types of functional units for each arithmetic or logical operation and assigns functional units with small area to a processor core without causing performance penalty. A generated processor core will have small area compared with processor cores which have a single register file or those which consider only one type of functional units for each operation. The experimental results demonstrate the effectiveness and efficiency of the proposed algorithm.
Nozomu TOGAWA Yoshiharu KATAOKA Yuichiro MIYAOKA Masao YANAGISAWA Tatsuo OHTSUKI
Hardware/software partitioning is one of the key processes in a hardware/software cosynthesis system for digital signal processor cores. In hardware/software partitioning, area and delay estimation of a processor core plays an important role since the hardware/software partitioning process must determine which part of a processor core should be realized by hardware units and which part should be realized by a sequence of instructions based on execution time of an input application program and area of a synthesized processor core. This paper proposes area and delay estimation equations for digital signal processor cores. For area estimation, we show that total area for a processor core can be derived from the sum of area for a processor kernel and area for additional hardware units. Area for a processor kernel can be mainly obtained by minimum area for a processor kernel and overheads for adding hardware units and registers. Area for a hardware unit can be mainly obtained by its type and operation bit width. For delay estimation, we show that critical path delay for a processor core can be derived from the delay of a hardware unit which is on the critical path in the processor core. Experimental results demonstrate that errors of area estimation are less than 2% and errors of delay estimation are less than 2 ns when comparing estimated area and delay with logic-synthesized area and delay.
Naoki MIZUTANI Shogo MURAMATSU Hisakazu KIKUCHI
A unified polyphase representation of analysis and synthesis filter banks is introduced in this paper, and then the efficient implementation on digital signal processors (DSP) is investigated. Especially, the number of memory accesses, power consumption, processing accuracy and the required instruction cycles are discussed. Firstly, a unified representation is given, and then two types of procedures, SIMO system-based and MISO system-based procedures, are shown, where SIMO and MISO are abbreviations for single-input/multiple-output and multiple-input/single-output, respectively. These procedures are compared to each other. It is shown that the number of data load in SIMO system-based procedure is a half of that in MISO system-based procedure for two-channel filter banks. The implementation of M-channel filter banks is also discussed.
Nobuhiko SUGINO Akinori NISHIHARA
Digital signal processors (DSPs) usually employ indirect addressing using address registers (ARs) to indicate their memory addresses, which often introduces overhead codes in AR updates for next memory accesses. Reduction of such overhead code is one of the important issues in automatic generation of highly-efficient DSP codes. In this paper, a new automatic address allocation method incorpolated with computational order rearrangement at local commutative parts is proposed. The method formulates a given memory access sequence by a graph representation, where several strategies to handle freedom in memory access orders at the computational commutative parts are introduced and examined. A compiler scheme is also extended such that computational order at the commutative parts is rearranged according to the derived memory allocation. The proposed methods are applied to an existing DSP compiler for µPD77230(NEC), and codes generated for several examples are compared with memory allocations by the conventional methods.
Wataru KOBAYASHI Noriaki SAKAMOTO Takao ONOYE Isao SHIRAKAWA
This paper describes a realtime 3D sound localization algorithm to be implemented with the use of a low power embedded DSP. A distinctive feature of this implementation approach is that the audible frequency band is divided into three, in accordance with the analysis of the sound reflection and diffraction effects through different media from a certain sound source to human ears. In the low, intermediate, and high frequency subbands, different schemes of the 3D sound localization are devised by means of an IIR filter, parametric equalizers, and a comb filter, respectively, so as to be run realtime on a low power embedded DSP. This algorithm aims at providing a listener with the 3D sound effects through headphones at low cost and low power consumption.
Tatsuo WATANABE Nagisa ISHIURA
This letter presents a method which attempts to minimize the number of spill codes to resolve usage conflicts of distributed registers in application specific DSPs. It searches for a set of ordering restrictions among operations which sequentialize the lifetimes of the values residing in the same register as much as possible. Experimental results show that the proposed analysis method reduces the number of register spills into 28%.
Byung In MOON Dong Ryul RYU Jong Wook HONG Tae Young LEE Sangook MOON Yong Surk LEE
We have designed a 32-bit RISC microprocessor with 16-/32-bit fixed-point DSP functionality. This processor, called YD-RISC, combines both general-purpose microprocessor and digital signal processor (DSP) functionality using the reduced instruction set computer (RISC) design principles. It has functional units for arithmetic operation, digital signal processing (DSP) and memory access. They operate in parallel in order to remove stall cycles after DSP or load/store instructions, which usually need one or more issue latency cycles in addition to the first issue cycle. High performance was achieved with these parallel functional units while adopting a sophisticated five-stage pipeline structure. The pipelined DSP unit can execute one 32-bit multiply-accumulate (MAC) or 16-bit complex multiply instruction every one or two cycles through two 17-b 17-b multipliers and an operand examination logic circuit. Power-saving techniques such as power-down mode and disabling execution blocks allow low power consumption. In the design of this processor, we use logic synthesis and automatic place-and-route. This top-down approach shortens design time, while a high clock frequency is achieved by refining the processor architecture.
Boon-Keat TAN Ryuji YOSHIMURA Toshimasa MATSUOKA Kenji TANIGUCHI
A new architecture-based Dynamically Programmable Arithmetic Array processor (DPAA) is proposed for general purpose Digital Signal Processing applications. Parallelism and pipelining are achieved by using DPAA, which consists of various basic arithmetic blocks connected through a code-division multiple access bus interface. The proposed architecture poses 100% interconnection flexibility because connections are done virtually through code matching instead of physical wire connections. Compared to conventional multiplexing architectures, the proposed interconnection topology consumes less chip area and thus, more arithmetic blocks can be incorporated. A 16-bit prototype chip incorporating 10 multipliers and 40 other arithmetic blocks had been implemented into a 4.5 mm 4.5 mm chip with 0.6 µm CMOS process. DPAA also features its simple programmability, as numerical formula can be used to configure the processor without programming languages or specialized CAD tools.