In this comment we point out that the mapping from carry-propagation adders to carry-save adders in the context of shift-and-add multiplication is inconsistent. Based on this it is shown that the implementation in Ref.[1] does not achieve any complexity reduction in practice.
Ultra fast switching speed of superconducting digital circuits enable realization of Digital Signal Processors with performance unattainable by any other technology. Based on rapid-single-flux technology (RSFQ) logic, these integrated circuits are capable of delivering high computation capacity up to 30 GOPS on a single processor and very short latency of 0.1 ns. There are two main applications of such hardware for practical telecommunication systems: filters for superconducting ADCs operating with digital RF data and recursive filters at baseband. The later of these allows functions such as multiuser detection for 3G WCDMA, equalization and channel precoding for 4G OFDM MIMO, and general blind detection. The performance gain is an increase in the cell capacity, quality of service, and transmitted data rate. The current status of the development of the RSFQ baseband DSP is discussed. Major components with operating speed of 30 GHz have been developed. Designs, test results, and future development of the complete systems including cryopackaging and CMOS interface are reviewed.
Masayuki YOSHINO Katsuyuki OKEYA Camille VUILLAUME
We present a novel approach for computing 2n-bit Montgomery multiplications with n-bit hardware Montgomery multipliers. Smartcards are usually equipped with such hardware Montgomery multipliers; however, due to progresses in factoring algorithms, the recommended bit length of public-key schemes such as RSA is steadily increasing, making the hardware quickly obsolete. Thanks to our double-size technique, one can re-use the existing hardware while keeping pace with the latest security requirements. Unlike the other double-size techniques which rely on classical n-bit modular multipliers, our idea is tailored to take advantage of n-bit Montgomery multipliers. Thus, our technique increases the perenniality of existing products without compromises in terms of security.
Yasuhiro TAKAHASHI Toshikazu SEKINE Michio YOKOYAMA
This paper presents the implementation of a 31-tap FIR Hilbert transform digital filter chip used in the digital-IF receivers, to confirm the effectiveness of our new design method. Our design method that we previously reported is based on a computation sharing multiplier using a new horizontal and vertical common subexpression techniques. A 31-tap FIR Hilbert transform digital filter was implemented and fabricated in 0.35 µm CMOS standard cell library. The chip's core contains approximately 33k transistors and occupies 0.86 mm2. The chip also has an operating speed of 70 MHz over. The implementation results show that the proposed Hilbert transformer has a smallest cost factor and so that is a high performance filter.
Erik DAHMEN Katsuyuki OKEYA Tsuyoshi TAKAGI
The most time consuming operation to verify a signature with the Elliptic Curve Digital Signature Algorithm is a multi-scalar multiplication with two scalars. Efficient methods for its computation are the Shamir method and the Interleave method, whereas the performance of those methods can be improved by using general base-2 representations of the scalars. In exchange for the speed-up, those representations require the precomputation of several points that must be stored. In the case of two precomputed points, the Interleave method and the Shamir method provide the same, optimal efficiency. In the case of more precomputed points, only the Interleave method can be sped-up in an optimal way and is currently more efficient than the Shamir method. This paper proposes a new general base-2 representation of the scalars that can be used to speed up the Shamir method. It requires the precomputation of ten points and is more efficient than any other representation that also requires ten precomputed points. Therefore, the proposed method is the first to improve the Shamir method such that it is faster than the Interleave method.
Che-Wun CHIOU Chiou-Yng LEE An-Wen DENG Jim-Min LIN
Because fault-based attacks on cryptosystems have been proven effective, fault diagnosis and tolerance in cryptography have started a new surge of research and development activity in the field of applied cryptography. Without magnitude comparisons, the Montgomery multiplication algorithm is very attractive and popular for Elliptic Curve Cryptosystems. This paper will design a Montgomery multiplier array with a bit-parallel architecture in GF(2m) with concurrent error detection capability to protect it against fault-based attacks. The robust Montgomery multiplier array with concurrent error detection requires only about 0.2% extra space overhead (if m=512 is as an example) and requires four extra clock cycles compared to the original Montgomery multiplier array without concurrent error detection.
Lei PAN Wenhui ZHANG Arthur ASUNCION Ming Kin LAI Michael B. DILLENCOURT Lubomir F. BIC Laurence T. YANG
The Navigational Programming (NavP) methodology is based on the principle of self-migrating computations. It is a truly incremental methodology for developing parallel programs: each step represents a functioning program, and each intermediate program is an improvement over its predecessor. The transformations are mechanical and straightforward to apply. We illustrate our methodology in the context of matrix multiplication, showing how the transformations lead from a sequential program to a fully parallel program. The NavP methodology is conducive to new ways of thinking that lead to ease of programming and high performance. Even though our parallel algorithm was derived using a sequence of mechanical transformations, it displays certain performance advantages over the classical handcrafted Gentleman's Algorithm.
Masanobu KATAGI Toru AKISHITA Izuru KITAMURA Tsuyoshi TAKAGI
It has recently been reported that the performance of hyperelliptic curve cryptosystems (HECC) is competitive to that of elliptic curve cryptosystems (ECC). Concerning the security of HECC, the theta divisors play an important role. The scalar multiplication using a random base point is vulnerable to an exceptional procedure attack, which is a kind of side-channel attacks, using theta divisors. In the case of cryptographic protocols of the scalar multiplication using fixed base point, however, the exceptional procedure attack is not applicable. First, we present novel efficient scalar multiplication using theta divisors, which is the positive application of theta divisors on HECC. Second, we develop a window-based method using theta divisors that is secure against side-channel attacks. It is not obvious how to construct a base point D such that all pre-computed points are theta divisors. We present an explicit algorithm for generating such divisors.
This paper focuses on algorithms for an efficient scalar multiplication. It proposes two algorithms for computing points of the form 2kP in affine coordinates. One works for k=2, and the other works for an arbitrary natural number k. The efficiency of these algorithms is based on a trade-off between a field inversion and several field multiplications. Montgomery trick is used to implement this trade-off. Since a field inversion is usually more expensive than 10 field multiplications, the proposed algorithms are efficient in comparison with existing ones.
The Gallant-Lambert-Vanstone method [14](GLV method for short) is a scalar multiplication method for elliptic curve cryptography (ECC). In WAP WTLS [49], SEC 2 [44], ANSI X9.62 [1] and X9.63 [2], several domain parameters for applications of the GLV method are described. Curves with those parameters have efficiently-computable endomorphisms. Recently the GLV method for Jacobians of hyperelliptic curve (HEC) has also been studied. In this paper, we discuss applications of the GLV method to curves with real multiplication (RM). It is the first time to use RM for efficient scalar multiplication as far as we know. We describe the general algorithm for using such RM, and we show that some genus 2 curves with RM have enough effciency to be used in the GLV method as in the previous CM case. Moreover, we will see that such RM curves can be obtained abundantly unlike the previously proposed CM curves of genus 2.
Marcelo E. KAIHARA Naofumi TAKAGI
A hardware algorithm for modular multiplication/division which performs modular division, Montgomery multiplication, and ordinary modular multiplication is proposed. The modular division in our algorithm is based on the extended Euclidean algorithm. We employ our newly proposed computation method that consists of processing the multiplier from the most significant digit first to calculate Montgomery multiplication. Finally, the ordinary modular multiplication is based on shift-and-add multiplication. Each of these three operations is carried out through the iteration of simple operations such as shifts and additions/subtractions. To avoid carry propagation in all additions and subtractions, the radix-2 signed-digit representation is employed. A modular multiplier/divider based on the algorithm has a linear array structure with a bit-slice feature and carries out n-bit modular multiplication/division in O(n) clock cycles, where the length of the clock cycle is constant and independent of n. This multiplier/divider can be implemented using a hardware amount only slightly larger than that of the modular divider.
Shoichi MASUI Kenji MUKAIDA Masahiko TAKENAKA Naoya TORII
High-speed, area-efficient, and low-power Montgomery modular multipliers for RSA algorithm have been developed for digital signature and user authentication in high-speed network systems and smart card LSIs. The multiplier-accumulators (MAC) in the developed Montgomery modular multipliers have a non-identical multiplicand/multiplier word length organization. This organization can eliminate the bandwidth bottleneck associated with a data memory, and enables to use a single-port memory for area and power reductions. The developed MAC is faster than the conventional identical word length organization due to the shortened critical path. For smart card applications, an area-efficient architecture with 42 kgates can produce 1.2 digital signatures in a second for 2,048-bit key length with the power consumption of 6.8 mW.
Masaki GONDA Kazuto MATSUO Kazumaro AOKI Jinhui CHAO Shigeo TSUJII
Genus 3 hyperelliptic curve cryptosystems are capable of fast-encryption on a 64-bit CPU, because a 56-bit field is enough for their definition fields. Recently, Kuroki et al. proposed an extension of the Harley algorithm, which had been known as the fastest addition algorithm of divisor classes on genus 2 hyperelliptic curves, on genus 3 hyperelliptic curves and Pelzl et al. improved the algorithm. This paper shows an improvement of the Harley algorithm on genus 3 hyperelliptic curves using Toom's multiplication. The proposed algorithm takes only I + 70M for an addition and I + 71M for a doubling instead of I + 76M and I + 74M respectively, which are the best possible of the previous works, where I and M denote the required time for an inversion and a multiplication over the definition field respectively. This paper also shows 2 variations of the proposed algorithm in order to adapt the algorithm to various platforms. Moreover this paper discusses finite field arithmetic suitable for genus 3 hyperelliptic curve cryptosystems and shows implementation results of the proposed algorithms on a 64-bit CPU. The implementation results show a 160-bit scalar multiplication can be done within 172 µs on a 64-bit CPU Alpha EV68 1.25 GHz.
This paper proposes fast elliptic curve multiplication algorithms resistant against side channel attacks, based on the Montgomery-type scalar multiplication. The proposed scalar multiplications can be applied to all curves over prime fields, e.g., any standardized curves over finite fields with characteristic larger than 3. The method utilizes the addition formulas xECDBL and xECADD assembled by only x-coordinates of points, and is applicable for any types of curves over finite fields. Then, we encapsulate two addition formulas into one formula xECADDDBL, which accomplishes a faster computation because several auxiliary variables of two formulas can be shared. We also develop a novel addition chain for the new formula, with which we can compute scalar multiplications. The improvement of our scalar multiplications over previous Coron's dummy operation method is about 18% for a 160-bit scalar multiplication. Our method requires no table-up of precomputed points and it is suitable for the implementation on memory constraint computing architectures, e.g., smart cards. Moreover, we optimize the proposed algorithms for parallelized implementations with SIMD operations. Compared with the similar scheme proposed by Fischer et al., our scheme is about 16% faster.
Yuuki TOMOEDA Hideyuki MIYAKE Atsushi SHIMBO Shinichi KAWAMURA
At CHES 2000, Schindler introduced a timing attack that enables the factorization of an RSA-modulus if RSA implementations use the Chinese Remainder Theorem and Montgomery multiplication. In this paper we introduce another approach for deriving the secret prime factor by focusing on the conditional branch Schindler used in his attack. One of the countermeasures against Schindler's attack is the blinding method. If input data are blinded with a fixed value or short-period random numbers, Schindler's attack does not work but our method can still factorize the RSA-modulus.
Montgomery-form elliptic curves have the advantage of faster arithmetic than Weierstrass-form elliptic curves. The dominant operation of the Elliptic Curve Cryptosystem (ECC) is scalar multiplication of points on an elliptic curve, and it usually includes scalar multiplication of a fixed base point of ECC. For Weierstrass-form elliptic curves, accelerating methods of scalar multiplication by using a pre-computed table of the fixed point have been widely studied. However, such methods cannot naturally expand to Montgomery-form elliptic curves. In this paper, we propose a fast scalar multiplication method on Montgomery-form elliptic curves by using a pre-computed table for the first time. Our method is 1.6 times as fast as the known method for Montgomery-form elliptic curves under the practical conditions that the size of the definition field is 160 bits and the memory size used for the pre-computed table is 3.2 KB.
Dejiang JIN Sotirios G. ZIAVRAS
The multiplication of large spare matrices is a basic operation in many scientific and engineering applications. There exist some high-performance library routines for this operation. They are often optimized based on the target architecture. For a parallel environment, it is essential to partition the entire operation into well balanced tasks and assign them to individual processing elements. Most of the existing techniques partition the given matrices based on some kind of workload estimation. For irregular sparse matrices on PC clusters, however, the workloads may not be well estimated in advance. Any approach other than run-time dynamic partitioning may degrade performance. In this paper, we apply our super-programming approach to parallel large matrix multiplication on PC clusters. In our approach, tasks are partitioned into super-instructions that are dynamically assigned to member computer nodes. Thus, the load balancing logic is separated from the computing logic; the former is taken over by the runtime environment. Our super-programming approach facilitates ease of program development and targets high efficiency in dynamic load balancing. Workloads can be balanced effectively and the optimization overhead is small. The results prove the viability of our approach.
Jiin-Chiou CHENG Wen-Chung KUO Chi-Sung LAIH
In, Okeya and Sakurai proposed the recovery of the y-coordinate on a Montgomery-form elliptic curve. With their method, it can calculate efficiently coordinates of scalar multiplication of point, in which we need only x-coordinate and finally, (x,y) of the terminal point can be recovered. The method is very suitable for some applications such as ECDSA-V and MQV, etc. Unfortunately, there is a significant fault in that paper. Thus, many results about computation amount are wrong due to the significant fault. First, we will show this fault, and then raise the correction of the significant fault. Finally, Table A
Wu-Chuan YANG Peng-Yueh HSEIH Chi-Sung LAIH
The efficient squaring algorithm is an important role in large integer arithmetic. All multiplication algorithms can be used for squaring large integers, but their performance can be greatly improved by using the standard squaring algorithm. The standard squaring algorithm is quite well-known, but unfortunately there is an improper carry handling bug in it. Recently, Guajardo and Paar proposed a modified squaring algorithm to fix the bug in the standard squaring algorithm. In this paper, we first point out that there is still an error-indexing bug in the Guajardo-Paar squaring algorithm. Then, we propose a new efficient squaring algorithm that not only avoids the bugs in both the standard squaring algorithm and the Guajardo-Paar squaring algorithm but also improves the performance in squaring computation. Our analyses and our simulations indicate that the proposed squaring algorithm is about 2.5 times faster in comparison with the standard multiplication algorithm in Pentium Series CPU. The performance of 1024-bit RSA cryptosystem can be saved 34.3% by using the proposed squaring algorithm to replace the standard multiplication.
The Single Instruction, Multiple Data (SIMD) architecture enables computation in parallel on a single processor. The SIMD operations are implemented on some processors such as Pentium 3/4, Athlon, SPARC, or even on smart cards. This paper proposes efficient algorithms for assembling an elliptic curve addition (ECADD), doubling (ECDBL), and k-iterated ECDBL (k-ECDBL) with SIMD operations. We optimize the number of auxiliary variables and the order of basic field operations used for these addition formulas. If an addition chain has k-bit zero run, we can replace k-time ECDBLs to the proposed faster k-ECDBL and the total efficiency of the scalar multiplication can be improved. Using the singed binary chain, we can compute a scalar multiplication about 10% faster than the previously fastest algorithm proposed by Aoki et al. Combined with the sliding window method or the width-w NAF window method, we also achieve about 10% faster parallelized scalar multiplication algorithms with SIMD operations. For the implementation on smart cards, we establish two fast parallelized scalar multiplication algorithms with SIMD resistant against side channel attacks.