Masaki TSUKAMOTO Yoshiko HANADA Masahiro NAKAO Keiji YAMAMOTO
The Order/Radix Problem (ORP) is an optimization problem that can be solved to find an optimal network topology in distributed memory systems. It is important to find the optimum number of switches in the ORP. In the case of a regular graph, a good estimation of the preferred number of switches has been proposed, and it has been shown that simulated annealing (SA) finds a good solution given a fixed number of switches. However, generally the optimal graph does not necessarily satisfy the regular condition, which greatly increases the computational costs required to find a good solution with a suitable number of switches for each case. This study improved the new method based on SA to find a suitable number of switches. By introducing neighborhood searches in which the number of switches is increased or decreased, our method can optimize a graph by changing the number of switches adaptively during the search. In numerical experiments, we verified that our method shows a good approximation for the best setting for the number of switches, and can simultaneously generate a graph with a small host-to-host average shortest path length, using instances presented by Graph Golf, an international ORP competition.
Hao SAN Rompei SUGAWARA Masao HOTTA Tatsuji MATSUURA Kazuyuki AIHARA
A 12-bit 1.25MS/s cyclic analog-to-digital converter (ADC) is designed and fabricated in 90nm CMOS technology, and only occupies an active area as small as 0.037mm2. The proposed ADC is composed of a non-binary AD convertion stage, and a on-chip non-binary-to-binary digital block includes a built-in radix-value self-estimation scheme. Therefore, althouh a non-binary convertion architechture is adopted, the proposed ADC is the same as other stand-alone binary ADCs. The redundancy of non-binary 1-bit/step architecture relaxes the accuracy requirement on analog components of ADC. As a result, the implementation of analog circuits such as amplifier and comparator becomes simple, and high-density Metal-Oxide-Metal (MOM) capacitors can be used to achieve a small chip area. Furthermore, the novel radix-value self-estimation technique can be realized by only simple logic circuits without any extra analog input, so that the total active area of ADC is dramatically reduced. The prototype ADC achieves a measured peak signal-to-noise-and-distortion-ratio (SNDR) of 62.3dB using a poor DC gain amplifier as low as 45dB and MOM capacitors without any careful layout techniques to improve the capacitor matching. The proposed ADC dissipated 490µW in analog circuits at 1.4V power supply and 1.25Msps (20MHz clocking). The measured DNL is +0.94/-0.71LSB and INL is +1.9/-1.2LSB at 30kHz sinusoidal input.
Rompei SUGAWARA Hao SAN Kazuyuki AIHARA Masao HOTTA
Proof-of-concept cyclic analog-to-digital converters (ADCs) have been designed and fabricated in 90-nm CMOS technology. The measurement results of an experimental prototype demonstrate the effectiveness of the proposed switched-capacitor (SC) architecture to realize a non-binary ADC based on β expansion. Different from the conventional binary ADC, a simple 1-bit/step structure for an SC multiplying digital-to-analog converter (MDAC) is proposed to present residue amplification by β (1 < β < 2). The redundancy of non-binary ADCs with radix β tolerates the non-linear conversion errors caused by the offsets of comparators, the mismatches of capacitors, and the finite DC gains of amplifiers, which are used in the MDAC. We also employed a radix value estimation algorithm to obtain an effective value of β for non-binary encoding; it can be realized by merely adding a simple conversion sequence and digital circuits. As a result, the power penalty of a high-gain wideband amplifier and the required accuracy of the circuit elements for a high-resolution ADC were largely relaxed so that the circuit design was greatly simplified. The implemented ADC achieves a measured peak signal-to-noise-and-distortion-ratio (SNDR) of 60.44dB, even with an op-amp with a poor DC gain (< 50dB) while dissipating 780µW in analog circuits at 1.4V and occupying an active area of 0.25 × 0.26mm2.
Rie SUZUKI Tsubasa MARUYAMA Hao SAN Kazuyuki AIHARA Masao HOTTA
In this paper, a robust cyclic ADC architecture with β-encoder is proposed and circuit scheme using switched-capacitor (SC) circuit is introduced. Different from the conventional binary ADC, the redundancy of proposed cyclic ADC outputs β-expansion code and has an advantage of error correction. This feature makes ADC robust against the offset of comparator capacitor mismatch and finite DC gain of amplifier in multiplying-DAC (MDAC). Because the power penalty of high-gain wideband amplifier and the required accuracy of circuit elements for high resolution ADC can be relaxed, the proposed architecture is suitable for deep submicron CMOS technologies beyond 90 nm. We also propose a β-value estimation algorithm to realize high accuracy ADC based on β-expansion. The simulation results show the effectiveness of proposed architecture and robustness of β-encoder.
Dongpei LIU Hengzhu LIU Botao ZHANG Jianfeng ZHANG Shixian WANG Zhengfa LIANG
High-performance FFT processor is indispensable for real-time OFDM communication systems. This paper presents a CORDIC based design of variable-length FFT processor which can perform various FFT lengths of 64/128/256/512/1024/2048/4096/8192-point. The proposed FFT processor employs memory based architecture in which mixed radix 4/2 algorithm, pipelined CORDIC, and conflict-free parallel memory access scheme are exploited. Besides, the CORDIC rotation angles are generated internally based on the transform of butterfly counter, which eliminates the need of ROM making it memory-efficient. The proposed architecture has a lower hardware complexity because it is ROM-free and with no dedicated complex multiplier. We implemented the proposed FFT processor and verified it on FPGA development platform. Additionally, the processor is also synthesized in 0.18 µm technology, the core area of the processor is 3.47 mm2 and the maximum operating frequency can be up to 500 MHz. The proposed FFT processor is better trade off performance and hardware overhead, and it can meet the speed requirement of most modern OFDM system, such as IEEE 802.11n, WiMax, 3GPP-LTE and DVB-T/H.
Rongchun LI Yong DOU Yuanwu LEI Shice NI Song GUO
This paper presents a parameterized multi-standard adaptive radix-4 Viterbi decoder with high throughput and low complexity. The proposed Viterbi decoder supports constraint lengths ranging from 3-9, code rates in the range of 1/2-1/3, and arbitrary truncation lengths. We present a novel fabric of Add-Compare-Select Unit (ACSU) and methods of unsigned quantization and efficient normalization that shorten the critical path. The decoder achieves a low bit error ratio in multiple standards, such as GPRS, WiMax, LTE, CDMA, and 3G. The proposed decoder is implemented on Xilinx XC5VLX330 device and the frequency achieved is 181.7 MHz. The throughput of the proposed decoder can reach 363 Mbps, which is superior to the other current multi-standard Viterbi decoders or radix-4 Viterbi decoders on the FPGA platform.
In this paper, a new DA (Distributed Arithmetic) filter implementation technique is presented. Contrary to the conventional DA technique using ROM, the proposed implementation technique avoids the use of ROM since it does not need the combinations of filter coefficients. Furthermore, by using the Radix-16 modified Booth algorithm, implementation complexity of the proposed structure can be reduced. Through the HDL coding and synthesis, it was shown that 41.6% of implementation area can be reduced.
Kazuyuki TANIMURA Ryuta NARA Shunitsu KOHARA Youhua SHI Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI
Modular multiplication is the most dominant arithmetic operation in elliptic curve cryptography (ECC), that is a type of public-key cryptography. Montgomery multiplier is commonly used to compute the modular multiplications and requires scalability because the bit length of operands varies depending on its security level. In addition, ECC is performed in GF(P) or GF(2n), and unified architecture for multipliers in GF(P) and GF(2n) is required. However, in previous works, changing frequency is necessary to deal with delay-time difference between GF(P) and GF(2n) multipliers because the critical path of the GF(P) multiplier is longer. This paper proposes unified dual-radix architecture for scalable Montgomery multiplications in GF(P) and GF(2n). This proposed architecture unifies four parallel radix-216 multipliers in GF(P) and a radix-264 multiplier in GF(2n) into a single unit. Applying lower radix to GF(P) multiplier shortens its critical path and makes it possible to compute the operands in the two fields using the same multiplier at the same frequency so that clock dividers to deal with the delay-time difference are not required. Moreover, parallel architecture in GF(P) reduces the clock cycles increased by dual-radix approach. Consequently, the proposed architecture achieves to compute a GF(P) 256-bit Montgomery multiplication in 0.28 µs. The implementation result shows that the area of the proposal is almost the same as that of previous works: 39 kgates.
Amir Sabbagh MOLAHOSSEINI Chitra DADKHAH Keivan NAVI Mohammad ESHGHI
In this paper, the new residue number system (RNS) moduli sets {22n, 2n -1, 2n+1 -1} and {22n, 2n -1, 2n-1 -1} are introduced. These moduli sets have 4n-bit dynamic range and well-formed moduli which can result in high-performance residue to binary converters as well as efficient RNS arithmetic unit. Next, efficient residue to binary converters for the proposed moduli sets based on mixed-radix conversion (MRC) algorithm are presented. The converters are ROM-free and they are realized using carry-save adders and modulo adders. Comparison with the other residue to binary converters for 4n-bit dynamic range moduli sets shown that the presented designs based on new moduli sets {22n, 2n -1, 2n+1 -1} and {22n, 2n -1, 2n-1 -1} are improved the conversion delay and result in hardware savings. Also, the proposed moduli sets can lead to efficient binary to residue converters, and they can speed-up internal RNS arithmetic processing, compared with the other 4n-bit dynamic range moduli sets.
This paper proposes a new sieving algorithm that employs a bucket sort as a part of a factoring algorithm such as the number field sieve. The sieving step requires an enormous number of memory updates; however, these updates usually cause cache hit misses. The proposed algorithm significantly reduces the number of cache hit misses when the size of the sieving region is roughly less than the square of the cache size, and the memory updates are several times faster than the straightforward implementation according to the PC experiments.
This paper presents a novel high-speed, low-complexity two-parallel 128-point radix-24 FFT/IFFT processor for MB-OFDM ultrawideband (UWB) systems. The proposed high-speed, low-complexity FFT architecture can provide a higher throughput rate and low hardware complexity by using a two-parallel data-path scheme and a single-path delay-feedback (SDF) structure. The radix-24 FFT algorithm is also realized in our processor to reduce the number of complex multiplications. The proposed FFT/IFFT processor has been designed and implemented with 0.18-µm CMOS technology in a supply voltage of 1.8 V. The proposed two-parallel FFT/IFFT processor has a throughput rate of up to 900 Msample/s at 450 MHz while requiring much smaller hardware complexity and low power consumption.
Yibo FAN Takeshi IKENAGA Satoshi GOTO
With the increase of key length used in public cryptographic algorithms such as RSA and ECC, the speed of Montgomery multiplication becomes a bottleneck. This paper proposes a high speed design of Montgomery multiplier. Firstly, a modified scalable high-radix Montgomery algorithm is proposed to reduce critical path. Secondly, a high-radix clock-saving dataflow is proposed to support high-radix operation and one clock cycle delay in dataflow. Finally, a hardware-reused architecture is proposed to reduce the hardware cost and a parallel radix-16 design of data path is proposed to accelerate the speed. By using HHNEC 0.25 µm standard cell library, the implementation results show that the total cost of Montgomery multiplier is 130 KGates, the clock frequency is 180 MHz and the throughput of 1024-bit RSA encryption is 352 kbps. This design is suitable to be used in high speed RSA or ECC encryption/decryption. As a scalable design, it supports any key-length encryption/decryption up to the size of on-chip memory.
Yukihiro IGUCHI Tsutomu SASAO Munehiro MATSUURA
In arithmetic circuits for digital signal processing, radixes other than two are often used to make circuits faster. In such cases, radix converters are necessary. However, in general, radix converters tend to be complex. This paper considers design methods for p-nary to binary converters. First, it considers Look-Up Table (LUT) cascade realizations. Then, it introduces a new design technique called arithmetic decomposition by using LUTs and adders. Finally, it compares the amount of hardware and performance of radix converters implemented by FPGAs. 12-digit ternary to binary converters on Cyclone II FPGAs designed by the proposed method are faster than ones by conventional methods.
In this paper, we propose CRadix sort, a new string sorting algorithm based on MSD radix sort. CRadix sort causes fewer cache misses than MSD radix sort by uniquely associating a small block of main memory called the key buffer to each key and temporarily storing a portion of each key into its corresponding key buffer. Experimental results in running time comparisons with other string sorting algorithms are provided for showing the effectiveness of CRadix sort.
This paper proposes the new radix-24 FFT algorithm and an efficient pipeline FFT architecture based on the algorithm for wideband OFDM systems. The proposed pipeline architecture has the same number of multipliers as that of the radix-22 algorithm. However, the multiplication complexity is reduced more than 30% by using the newly proposed CSD constant multipliers instead of the programmable multipliers. From the synthesis simulations of a standard 0.35 µm CMOS SAMSUNG process, the proposed CSD constant complex multiplier achieved a reduction of more than 60% of the power consumption/area when compared with the conventional programmable complex multiplier.
Tsuyoshi TAKAGI David REIS, Jr. Sung-Ming YEN Bo-Ching WU
Recently, the radix-3 representation of integers is used for the efficient implementation of pairing based cryptosystems. In this paper, we propose non-adjacent form of radix-r representation (rNAF) and efficient algorithms for generating rNAF. The number of non-trivial digits is (r-2)(r+1)/2 and its average density of non-zero digit is asymptotically (r-1)/(2r-1). For r=3, the non-trivial digits are {2, 4} and the non-zero density is 0.4. We then investigate the width-w version of rNAF for the general radix-r representation, which is a natural extension of the width-w NAF. Finally we compare the proposed algorithms with the generalized NAF (gNAF) discussed by Joye and Yen. The proposed scheme requires a larger table but its non-zero density is smaller even for large radix. We explain that gNAF is a simple degeneration of rNAF--we can consider that rNAF is a canonical form for the radix-r representation. Therefore, rNAF is a good alternative to gNAF.
This paper proposes a new modified radix-24 FFT algorithm and an efficient pipeline FFT architecture based on this algorithm for OFDM systems. This pipeline FFT architecture has the same number of multipliers as that of the radix-22 algorithm. However, the multiplication complexity could be reduced by more than 30% by replacing one half of the programmable multipliers by the newly proposed CSD constant multipliers. From the synthesis simulations of a standard 0.35 µm CMOS SAMSUNG process, a proposed CSD constant complex multiplier achieved more than 60% area efficiency when compared to the conventional programmable complex multiplier. This promoted efficiency could be used to the design of a long length FFT processor in wireless OFDM applications, which needs more power and area efficiency.
Youn-Seog CHANG Sin-Chong PARK
In this study, we analyze the memory-based architecture of the FFT processor using the radix-4, and propose a novel mechanism that improves the throughput while simultaneously decreasing the area using single-port memories with several banks.
Algorithms are presented for the four elementary arithmetic operations, to perform reliable floating-point arithmetic operations. These arithmetic operations can be achieved by applying residue techniques to the weighted number systems and performed with no accuracy lost in the process of the computing. The arithmetic operations presented can be used as elementary tools (on many existing architectures) to ensure the reliability of numerical computations. Simulation results especially for the solutions of ill-conditioned problems are given with emphasis on the practical usability of the tools.
Sang-Hoon LEE Seung-Jun BAE Hong-June PARK
The radix-64 encoding scheme was used to reduce the number of partial products in the 5454 CMOS parallel multiplier. The transistor counts, the chip area and the power-delay product were reduced by 28%, 22%, and 17%, respectively, compared to any of the published 5454 CMOS parallel multipliers. A redundant binary (RB) number system was used to represent any of the 65 multiplying coefficients as a RB number which consists of two of 9 fundamental multiplying coefficients and their complements. The resultant RB partial products were added by using optimized RB adders. The total transistor count of the proposed multiplier was 43,579. The chip area in 0.25 µm CMOS process with 5 metal layers was 0.99 mm2. The power consumption and the multiplication time were 111 mW and 6.9 ns, respectively.