Daisuke SUZUKI Tsutomu MATSUMOTO
This paper describes a modular exponentiation processing method and circuit architecture that can exhibit the maximum performance of FPGA resources. The modular exponentiation architecture proposed by us comprises three main techniques. The first one is to improve the Montgomery multiplication algorithm in order to maximize the performance of the multiplication unit in an FPGA. The second one is to balance and improve the circuit delay. The third one is to ensure scalability of the circuit. Our architecture can perform fast operations using small-scale resources; in particular, it can complete a 512-bit modular exponentiation as fast as in 0.26 ms with the smallest Virtex-4 FPGA, XC4VF12-10SF363. In fact the number of SLICEs used is approx. 4200, which proves the compactness of our design. Moreover, the scalability of our design also allows 1024-, 1536-, and 2048-bit modular exponentiations to be processed in the same circuit.
Bing-Fei WU Hao-Yu HUANG Yen-Lin CHEN Hsin-Yuan PENG Jia-Hsiung HUANG
This study presents several optimization approaches for the MPEG-2/4 Audio Advanced Coding (AAC) Low Complexity (LC) encoding and decoding processes. Considering the power consumption and the peripherals required for consumer electronics, this study adopts the TI OMAP5912 platform for portable devices. An important optimization issue for implementing AAC codec on embedded and mobile devices is to reduce computational complexity and memory consumption. Due to power saving issues, most embedded and mobile systems can only provide very limited computational power and memory resources for the coding process. As a result, modifying and simplifying only one or two blocks is insufficient for optimizing the AAC encoder and enabling it to work well on embedded systems. It is therefore necessary to enhance the computational efficiency of other important modules in the encoding algorithm. This study focuses on optimizing the Temporal Noise Shaping (TNS), Mid/Side (M/S) Stereo, Modified Discrete Cosine Transform (MDCT) and Inverse Quantization (IQ) modules in the encoder and decoder. Furthermore, we also propose an efficient memory reduction approach that provides a satisfactory balance between the reduction of memory usage and the expansion of the encoded files. In the proposed design, both the AAC encoder and decoder are built with fixed-point arithmetic operations and implemented on a DSP processor combined with an ARM-core for peripheral controlling. Experimental results demonstrate that the proposed AAC codec is computationally effective, has low memory consumption, and is suitable for low-cost embedded and mobile applications.
Fast Fourier Transform (FFT) is an important algorithm in many digital signal processing applications, and it often requires parallel implementation for high throughput. In this paper, we first present the SmartCell coarse-grained reconfigurable architecture targeted for stream processing. A SmartCell prototype integrates 64 processing elements, configurable interconnections, and dedicated instruction and data memories into a single chip, which is able to provide high performance parallel processing while maintaining post-fabrication flexibility. Subsequently, we present a parallel FFT architecture targeted for multi-core platforms computing systems. This algorithm provides an optimized data flow pattern that reduces both communication and configuration overheads. The proposed parallel FFT algorithm is then mapped onto the SmartCell prototype device. Results show that the parallel FFT implementation on SmartCell is about 14.9 and 2.7 times faster than network-on-chip (NoC) and MorphoSys implementations, respectively. SmartCell also achieves the energy efficiency gains of 2.1 and 28.9 when compared with FPGA and DSP implementations.
Koichi ISHIHARA Takayuki KOBAYASHI Riichi KUDO Yasushi TAKATORI Akihide SANO Yutaka MIYAMOTO
In this paper, we use frequency-domain equalization (FDE) to create coherent optical single-carrier (CO-SC) transmission systems that are very tolerant of chromatic dispersion (CD) and polarization mode dispersion (PMD). The efficient transmission of a 25-Gb/s NRZ-QPSK signal by using the proposed FDE is demonstrated under severe CD and PMD conditions. We also discuss the principle of FDE and some techniques suitable for implementing CO-SC-FDE. The results show that a CO-SC-FDE system is very tolerant of CD and PMD and can achieve high transmission rates over single mode fiber without optical dispersion compensation.
Ultra fast switching speed of superconducting digital circuits enable realization of Digital Signal Processors with performance unattainable by any other technology. Based on rapid-single-flux technology (RSFQ) logic, these integrated circuits are capable of delivering high computation capacity up to 30 GOPS on a single processor and very short latency of 0.1 ns. There are two main applications of such hardware for practical telecommunication systems: filters for superconducting ADCs operating with digital RF data and recursive filters at baseband. The later of these allows functions such as multiuser detection for 3G WCDMA, equalization and channel precoding for 4G OFDM MIMO, and general blind detection. The performance gain is an increase in the cell capacity, quality of service, and transmitted data rate. The current status of the development of the RSFQ baseband DSP is discussed. Major components with operating speed of 30 GHz have been developed. Designs, test results, and future development of the complete systems including cryopackaging and CMOS interface are reviewed.
In this letter, an efficient hardware platform for the digital signal processing for OFDM communication systems is presented. The hardware platform consists of a single FPGA having 900 K gates, two DSPs with maximum 8,000 MIPS at 1 GHz clock, 2-channel ADC and DAC supporting maximum 125 MHz sampling rate, and flexible data bus architecture, so that a wide variety of baseband signal processing algorithms for practical OFDM communication systems may be implemented and tested. The IEEE 802.16d software modem is also presented in order to verify the effectiveness and usefulness of the designed platform.
Dang Hai PHAM Jing GAO Takanobu TABATA Hirokazu ASATO Satoshi HORI Tomohisha WADA
In our application targeted here, four on-glass antenna elements are set in an automobile to improve the reception quality of mobile ISDB-T receiver. With regard to the directional characteristics of each antenna, we propose and implement a joint Pre-FFT adaptive array antenna and Post-FFT space diversity combining (AAA-SDC) scheme for mobile ISDB-T receiver. By applying a joint hardware and software approach, a flexible platform is realized in which several system configuration schemes can be supported; the receiver can be reconfigured on the fly. Simulation results show that the AAA-SDC scheme drastically improves the performance of mobile ISDB-T receiver, especially in the region of large Doppler shift. The experimental results from a field test also confirm that the proposed AAA-SDC scheme successfully achieves an outstanding reception rate up to 100% while moving at the speed of 80 km/h.
Tetsuya OSHIKATA Hirofumi MATSUO
This paper presents a partially resonant active filter based on a digital PWM control circuit with a DSP that can improve the power factor and input current harmonic distortion factor of distributed power supply systems in communications buildings. The steady-state and dynamic characteristics of this active filter are analyzed experimentally and the relationship between the control variables of digital control circuit with the DSP and performance characteristics such as regulation of the output voltage, input power factor, input current harmonic distortion factor, boundaries of stabilities and transient response are defined. Using the partially resonant circuit, the efficiency is over 91%, which is 0.9 point higher than that of non-resonant circuit and the high frequency switching noise is suppressed. Furthermore, the digital control strategy with the DSP proposed in this paper can realize the superior transient response of input current and output voltage for the step change of load, the power factor over 0.99 and total harmonic distortion factor less than 1.1%.
Hideyuki FURUHASHI Yoshinobu KAJIKAWA Yasuo NOMURA
In this paper, we propose a low complexity realization method for compensating for nonlinear distortion. Generally, nonlinear distortion is compensated for by a linearization system using a Volterra kernel. However, this method has a problem of requiring a huge computational complexity for the convolution needed between an input signal and the 2nd-order Volterra kernel. The Simplified Volterra Filter (SVF), which removes the lines along the main diagonal of the 2nd-order Volterra kernel, has been previously proposed as a way to reduce the computational complexity while maintaining the compensation performance for the nonlinear distortion. However, this method cannot greatly reduce the computational complexity. Hence, we propose a subband linearization system which consists of a subband parallel cascade realization method for the 2nd-order Volterra kernel and subband linear inverse filter. Experimental results show that this proposed linearization system can produce the same compensation ability as the conventional method while reducing the computational complexity.
Kazutami ARIMOTO Toshihiro HATTORI Hidehiro TAKATA Atsushi HASEGAWA Toru SHIMIZU
Many embedded system application in ubiquitous network strongly require the high performance SoC with overcoming the physical limitations in the advanced CMOS. To develop these SoC, the continuous design efforts have been done. The initial efforts are the primitive level circuit technique and power switching control method for suppressing the standby currents. However, the additional physical limitations and system enhancements becomes main factors, the new design efforts have been proposed. These design efforts are the application-oriented technologies from the system level to device level. This paper introduces the self voltage controlled technique to cancel the PVT (process, voltage, and temperature) variation, power distribution and power management for cellular phone application, parallel algorithm and optimized layout DSP, and massively parallel fine-grained SIMD processor for next multimedia application. The high performance SoC for the embedded are achieved by providing the components of the system level IPs and making the application oriented SoC platform.
In this paper a new approach for employing the digital signal processing capabilities in the design of the multi-bit continuous time (CT) Delta Sigma modulators (DSM's) is presented. It proposes the discrete time (DT) pre-filtering before the DAC for solving the known problems of the CT DSM's.
Dang Hai PHAM Takanobu TABATA Hirokazu ASATO Satoshi HORI Tomohisa WADA
In this paper, an adaptive array antenna is implemented to enhance the performance of digital TV ISDB-T reception. Issues of realizing the proposed array antenna and its implementation by a joint hardware-software solution are also presented in this paper. Instead of using known reference signals, the proposed method utilizes the GI (Guard Interval) and a periodic property of OFDM signal as a constraint to realize MRC (Maximum Ratio Combining) and SMI (Sample Matrix Inversion) adaptive beam-forming algorithms. Experimental results show that the proposed system drastically improves the quality of reception. Moreover, the proposed system can achieve excellent performance under the conditions of strong interferences.
Weon Heum PARK Myung Hoon SUNWOO Seong Keun OH
This paper proposes efficient DSP instructions and their hardware architecture for the Viterbi algorithm. The implementation of the Viterbi algorithm on a DSP chip has been attracting more interest for its flexibility, programmability, etc. The proposed architecture can reduce the Trace Back (TB) latency and can support various wireless communication standards. The proposed instructions perform the Add Compare Select (ACS) and TB operations in parallel and the architecture has special hardware, called the Offset Calculation Unit (OCU), which automatically calculates data addresses for acceleration of the trellis butterfly computations. When the constraint length K is 5, the proposed architecture can reduce the decoding cycles about 17% compared with Carmel DSP and about 45% compared with TMS320C55x.
Ioannis PAPAIOANNOU Chrissavgi DRE
In this paper the development of the control plane for the frame decoding functionality of an IEEE 802.16 Wireless MAN system is described. It is implemented in two ways. The first implementation is based on a general-purpose microprocessor, and specifically the one provided in the TMS320C64xx Texas family devices. The second implementation is based on an Intel's IXP2400 Network Processor chip and the preceding functions are implemented by writing embedded software for that part. The two implementations are compared and the comparison leads to some very useful results. The development of time critical tasks of a MAC protocol stack in software and mainly based on a Network Processor opens paths for very effective system architectures, where the Network Processor runs full the networking and the MAC/DLC processing of such telecom systems. The main question is: Can lower MAC be executed on a Network Processor or not? This manuscript attempts to give an answer to this question.
Ryota KIMURA Ryuhei FUNADA Hiroshi HARADA Manabu SAWADA Shoji SHINODA
This paper proposes a simple timing synchronization method in order to design a timing synchronization circuit with low-complex and low-volume digital signal processing (DSP) for orthogonal frequency division multiplexing (OFDM) packet transmission systems. The proposed method utilizes the subtraction process for acquirement of a timing metric of fast Fourier transform (FFT) window, whereas the conventional methods utilize the multiplication process. This paper adopts the proposed method to a standardized OFDM format, IEEE 802.11a, and elucidates that the proposed one shows good transmission performance as well as the conventional one in fast time-variant multi-path Rayleigh fading channels by computer simulation.
Takahiro KUMURA Norio KAYAMA Shinichi SHIONOYA Kazuo KUMAGIRI Takao KUSANO Makoto YOSHIDA Masao IKEKAWA Ichiro KURODA Takao NISHITANI
This paper provides a performance evaluation of our audio and video CODEC by using a method for rapidly verifying and evaluating overall performance on real-time workloads of system LSIs integrated with SPXK5SC DSP cores. The SPXK5SC have been developed as a DSP core well-suited to system LSIs. Despite the fact that it is very important to evaluate the overall performance of target LSIs on real workloads before actual LSI fabrication, software simulators are too slow to deal with real workloads and full hardware prototyping is unable to respond well to design improvements. Therefore, we have developed a hardware emulation approach to be used on system LSIs integrated with a SPXK5SC DSP core in order to evaluate the overall performance of audio/video CODEC on a target system. Our emulation system using a DSP core TEG, which has a bus interface, and an FPGA is suitable for overall system evaluation on real-time workloads as well as architectural investigation. In this paper, we discuss the use of the emulation system in evaluating performance during AV CODEC execution. In addition, an architecture design based on our emulation system is also described.
Takefumi MIYOSHI Nobuhiko SUGINO
A novel unified phase compiler framework for embedded VLIWs and DSPs is shown. In this compiler, a given program is represented in 3-D representation space, which enables quantitatively estimating required resources and elapsed time. Transformation of a 3-D representation graph that corresponds to a code optimization method for a specific processor architecture is also proposed. The proposal compiler and the code optimization methods are compared with an ordinary compiler in terms of their generated codes. The results demonstrate their effectiveness.
Gweon-Do JO Min-Joung SHEEN Seung-Hwan LEE Kyoung-Rok CHO
As the code division multiple access (CDMA) based third generation cellular infrastructure requires high performance signal processing in a baseband modem, an application-specific integrated circuit or a field-programmable gate array has commonly been used for chip rate processing. In this paper, the use of digital signal processors (DSP) is explored for a cdma2000 and a wideband CDMA channel modem with the goal of increasing flexibility. The design concepts of the prototype software-defined radio platform we implemented to estimate the potential and feasibility of commercial SDR platforms are presented. We discuss the hardware and software architecture of the platform, considerations for reconfigurability, and the test results. We also address practical issues for real-time chip rate processing and optimization schemes of DSP software, and provide detailed measurement results of DSP performance.
This paper presents a VLSI design methodology for the MAC-level DWT/IDWT processor based on a novel limited-resource scheduling algorithm. The r-split Fully-specified Signal Flow Graph (FSFG) of limited-resource FIR filtering has been developed for the scheduling of the MAC-level DWT/IDWT signal processing. Given a set of architecture constraints and DWT parameters, the scheduling algorithm can generate four scheduling matrices that drive the data path to perform the DWT computation. Because the memory for the inter-octave is considered with the register of FIR filter, the memory size is less than the traditional architecture. Besides, based on the limited-resource scheduling algorithm, an automated DWT processor synthesizer has been developed and generates constrained DWT processors in the form of silicon intelligent property (SIP). The DWT SIP can be embedded into a SOC or mapped to program codes for commercial off-the-shelf (COTS) DSP processors with programmable devices. As a result, it has been successfully proven that a variety of DWT SIPs can be efficiently realized by tuning the parameters and applied for signal processing applications.
A very long instruction word (VLIW) digital signal processor (DSP), called ODiN, which could execute six instructions in a single cycle simultaneously, is designed and fabricated using 0.25 µm 1-ploy 5-metal standard cell static CMOS process. The ODiN core delivers maximum 600 MIPS with 100 MHz system clock. In order to achieve high performance operation, the designed core includes compact register files, orthogonal instruction set, single cycle operations for most instructions, and parallel processing based on software scheduling. In addition, a Viterbi decoder processor and a FFT processor that are embedded make it possible to implement software defined radio (SDR) applications efficiently.