In this paper, we present a scheme to compute either AB or AB2 multiplications over GF(2m) and propose a bit-parallel systolic architecture based on the proposed algorithm. The AB multiplication algorithm is derived in the same form as the formula of AB2 multiplication algorithm, and an architecture that can perform AB multiplication by adding very little extra hardware to AB2 multiplier is designed. Therefore, the proposed architecture can be effectively applied to hardware constrained applications that cannot deploy AB2 multiplier and AB multiplier separately.
Xianghong HU Hongmin HUANG Xin ZHENG Yuan LIU Xiaoming XIONG
Elliptic curve cryptography (ECC), one of the asymmetric cryptography, is widely used in practical security applications, especially in the Internet of Things (IoT) applications. This paper presents a low-power reconfigurable architecture for ECC, which is capable of resisting simple power analysis attacks (SPA) and can be configured to support all of point operations and modular operations on 160/192/224/256-bit field orders over GF(p). Point multiplication (PM) is the most complex and time-consuming operation of ECC, while modular multiplication (MM) and modular division (MD) have high computational complexity among modular operations. For decreasing power dissipation and increasing reconfigurable capability, a Reconfigurable Modular Multiplication Algorithm and Reconfigurable Modular Division Algorithm are proposed, and MM and MD are implemented by two adder units. Combining with the optimization of operation scheduling of PM, on 55 nm CMOS ASIC platform, the proposed architecture takes 0.96, 1.37, 1.87, 2.44 ms and consumes 8.29, 11.86, 16.20, 21.13 uJ to perform one PM on 160-bit, 192-bit, 224-bit, 256-bit field orders. It occupies 56.03 k gate area and has a power of 8.66 mW. The implementation results demonstrate that the proposed architecture outperforms the other contemporary designs reported in the literature in terms of area and configurability.
Yusuke AIKAWA Koji NUIDA Masaaki SHIRASE
In 2017, Shirase proposed a variant of Elliptic Curve Method combined with Complex Multiplication method for generating certain special kinds of elliptic curves. His algorithm can efficiently factorize a given composite integer when it has a prime factor p of the form 4p=1+Dv2 for some integer v, where -D is an auxiliary input integer called a discriminant. However, there is a disadvantage that the previous method works only for restricted cases where the class polynomial associated to -D has degree at most two. In this paper, we propose a generalization of the previous algorithm to the cases of class polynomials having arbitrary degrees, which enlarges the class of composite integers factorizable by our algorithm. We also extend the algorithm to more various cases where we have 4p=t2+Dv2 and p+1-t is a smooth integer.
Shotaro SUGIYAMA Hiromitsu AWANO Makoto IKEDA
A 256-bit $mathbb{F}_p$ ECDSA crypto processor featuring low latency, low energy consumption and capability of changing the Elliptic curve parameters is designed and fabricated in SOTB 65nm CMOS process. We have demonstrated the lowest ever reported signature generation time of 31.3 μs at 238MHz clock frequency. Energy consumption is 3.28 μJ/signature-generation, which is same as the lowest reported till date. We have also derived addition formulae on Elliptic curve useful for reduce the number of registers and operation cycles.
Kotaro MATSUMOTO Kazuyoshi TAKAGI Naofumi TAKAGI
The problem of evaluating the matrix polynomial I+A+A2+…+AN-1 with a reduced number of matrix multiplications has long been considered. Several algorithms have been proposed for this problem, which find a procedure requiring O(log N) matrix multiplications for a given N. Among them, the hybrid algorithm based on the double-base representation of N, i.e., using mixed radices 2 and 3, proposed by Dimitrov and Cooklev is most efficient. It has been suggested by them that the use of higher radices would not bring any more efficient algorithms. In this paper, we show that we can derive more efficient algorithms by using higher radices, and propose several efficient algorithms.
Sun-Mi PARK Ku-Young CHANG Dowon HONG Changho SEO
In this paper, we present a new three-way split formula for binary polynomial multiplication (PM) with five recursive multiplications. The scheme is based on a recently proposed multievaluation and interpolation approach using field extension. The proposed PM formula achieves the smallest space complexity. Moreover, it has about 40% reduced time complexity compared to best known results. In addition, using developed techniques for PM formulas, we propose a three-way split formula for Toeplitz matrix vector product with five recursive products which has a considerably improved complexity compared to previous known one.
Rong CHEN Cunqian FENG Sisan HE Yi RAO
The extraction of micro-motion parameters is deeply influenced by the precision of estimation on translational motion parameters. Based on the periodicity of micro-motion, the quadratic polynomial fitting is carried out among range delays to align envelope. The micro-motion component of phase information is eliminated by conjugate multiplication after which the translational motion parameters are estimated. Then the translational motion is precisely compensated through the third order polynomial fitting. Results of simulation demonstrate that the algorithm put forward here can realize the precise compensation for translational motion parameters even under an environment with low signal noise ratio (SNR).
Md. Al-Amin KHANDAKER Yasuyuki NOGAMI
Scalar multiplication over higher degree rational point groups is often regarded as the bottleneck for faster pairing based cryptography. This paper has presented a skew Frobenius mapping technique in the sub-field isomorphic sextic twisted curve of Kachisa-Schaefer-Scott (KSS) pairing friendly curve of embedding degree 18 in the context of Ate based pairing. Utilizing the skew Frobenius map along with multi-scalar multiplication procedure, an efficient scalar multiplication method for KSS curve is proposed in the paper. In addition to the theoretic proposal, this paper has also presented a comparative simulation of the proposed approach with plain binary method, sliding window method and non-adjacent form (NAF) for scalar multiplication. The simulation shows that the proposed method is about 60 times faster than plain implementation of other compared methods.
Yang LI Jinlin WANG Xuewen ZENG Xiaozhou YE
Montgomery modular multiplication is one of the most efficient algorithms for modular multiplication of large integers. On resource-constraint embedded processors, memory-access operations play an important role as arithmetic operations in the modular multiplication. To improve the efficiency of Montgomery modular multiplication on embedded processors, this paper concentrates on reducing the memory-access operations through adding a few working registers. We first revisit previous popular Montgomery modular multiplication algorithms, and then present improved algorithms for Montgomery modular multiplication and squaring for arbitrary prime fields. The algorithms adopt the general ideas of hybrid multiplication algorithm proposed by Gura and lazy doubling algorithm proposed by Lee. By careful optimization and redesign, we propose novel implementations for Montgomery multiplication and squaring called coarsely integrated product and operand hybrid scanning algorithm (CIPOHS) and coarsely integrated lazy doubling algorithm (CILD). Then, we implement the algorithms on general MIPS64 processor and OCTEON CN6645 processor equipped with specific multiply-add instructions. Experiments show that CIPOHS and CILD offer the best performance both on the general MIPS64 and OCTEON CN6645 processors. But the proposed algorithms have obvious advantages for the processors with specific multiply-add instructions such as OCTEON CN6645. When the modulus is 2048 bits, the CIPOHS and CILD outperform the CIOS algorithm by a factor of 47% and 58%, respectively.
Takumi HONDA Yasuaki ITO Koji NAKANO
In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warp-synchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiple-length multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.
Tatsuya KAWAMOTO Xin ZHOU Jacir L. BORDIM Yasuaki ITO Koji NAKANO
Algorithms requiring fast manipulation of multiple-length numbers are usually implemented in hardware. However, hardware implementation, using HDL (Hardware Description Language) for instance, is a laborious task and the quality of the solution relies heavily on the designer expertise. The main contribution of this work is to present a flexible-length-arithmetic processor based on FDFM (Few DSP slices and Few Memory blocks) approach that supports arithmetic operations on multiple-length numbers using FPGAs (Field Programmable Gate Array). The proposed processor has been implement on the Xilinx Virtex-6 FPGA. Arithmetic instructions of the proposed processor architecture include addition, subtraction, and multiplication of integer numbers exceeding 64-bits. To reduce the burden of implementing algorithm directly on the FPGA, applications requiring multiple-length arithmetic operations are written in a C-like language and translated into a machine program. The machine program is then transferred and executed on the proposed architecture. A 2048-bit RSA encryption/decryption implementation has been used to assess the goodness of the proposed approach. Experimental results shows that the computing time, using the proposed architecture, of a 2048-bit RSA encryption takes only 2.2 times longer than a direct FPGA implementation. Furthermore, by employing multiple FDFM cores for the same task, the computing time reduces considerably.
Takeshi SUGAWARA Daisuke SUZUKI Minoru SAEKI
The single-shot collision attack on RSA proposed by Hanley et al. is studied focusing on the difference between two operands of multiplier. It is shown that how leakage from integer multiplier and long-integer multiplication algorithm can be asymmetric between two operands. The asymmetric leakage is verified with experiments on FPGA and micro-controller platforms. Moreover, we show an experimental result in which success and failure of the attack is determined by the order of operands. Therefore, designing operand order can be a cost-effective countermeasure. Meanwhile we also show a case in which a particular countermeasure becomes ineffective when the asymmetric leakage is considered. In addition to the above main contribution, an extension of the attack by Hanley et al. using the signal-processing technique of Big Mac Attack is presented.
Due to its on-demand and pay-as-you-go properties, cloud computing has become an attractive alternative for HPC applications. However, communication-intensive applications with complex communication patterns still cannot be performed efficiently on cloud platforms, which are equipped with MapReduce technologies, such as Hadoop and Spark. In particular, one major obstacle is that MapReduce's simple programming model cannot explicitly manipulate data transfers between compute nodes. Another obstacle is cloud's relatively poor network performance compared with traditional HPC platforms. The traditional Strassen's algorithm of square matrix multiplication has a recursive and complex pattern on the HPC platform. Therefore, it cannot be directly applied to the cloud platform. In this paper, we demonstrate how to make Strassen's algorithm with complex communication patterns “cloud-friendly”. By reorganizing Strassen's algorithm in an iterative pattern, we completely separate its computations and communications, making it fit to MapReduce programming model. By adopting a novel data/task parallel strategy, we solve Strassen's data dependency problems, making it well balanced. This is the first instance of Strassen's algorithm in MapReduce-style systems, which also matches Strassen's communication lower bound. Further experimental results show that it achieves a speedup ranging from 1.03× to 2.50× over the classical Θ(n3) algorithm. We believe the principle can be applied to many other complex scientific applications.
Hiroshi IMAI Vorapong SUPPAKITPAISARN
In this paper, we improve a width-3 joint sparse form proposed by Okeya, Katoh, and Nogami. After the improvement, the representation can attain an asymtotically optimal complexity found in our previous work. Although claimed as optimal by the authors, the average computation time of multi-scalar multiplication obtained by the representation is 563/1574n+o(n)≈0.3577n+o(n). That number is larger than the optimal complexity 281/786n+o(n)≈0.3575n+o(n) found in our previous work. To optimize the width-3 joint sparse form, we add more cases to the representation. After the addition, we can show that the complexity is updated to 281/786n+o(n)≈0.3575n+o(n), which implies that the modified representation is asymptotically optimal. Compared to our optimal algorithm in the previous work, the modified width-3 joint sparse form uses less dynamic memory, but it consumes more static memory.
Field Programmable Gate Array (FPGA) implementation of Elliptic Curve Cryptography (ECC) over GF(p) is commonly not fast enough to meet the request of high-performance applications. There are three critical factors to determine the performance of ECC processor over GF(p): multiplication structure, modular multiplication algorithm, and scalar point multiplication scheduling. This work proposes a novel multiplication structure which is a two-stage pipeline on the basis of Karatsuba-Ofman algorithm. With the proposed multiplication structure, we design a 256-bit modular multiplier based on Improved Barret Modular Multiplication algorithm. Upon the modular multiplier, we finish the scalar point multiplication scheduling and implement a high-performance ECC processor on FPGA. Compared with the previous modular multipliers, our modular multiplier reduces the 256-bit modular multiplication time by 28% at least. Synthesis result on Altera Stratix II shows that our ECC processor can complete a 256-bit ECC scalar point multiplication in 0.51ms, which is at least 1.3 times faster than the currently reported FPGA ECC processors over GF(p).
Bu-Ching LIN Juinn-Dar HUANG Jing-Yang JOU
The notion of multiple constant multiplication (MCM) is extensively adopted in digital signal processing (DSP) applications such as finite impulse filter (FIR) designs. A set of adders is utilized to replace regular multipliers for the multiplications between input data and constant filter coefficients. Though many algorithms have been proposed to reduce the total number of adders in an MCM block for area minimization, they do not consider the actual bitwidth of each adder, which may not estimate the hardware cost well enough. Therefore, in this article we propose a bitwidth-aware MCM optimization algorithm that focuses on minimizing the total number of adder bits rather than the adder count. It first builds a subexpression graph based on the given coefficients, derives a set of constraints for adder bitwidth minimization, and then optimally solves the problem through integer linear programming (ILP). Experimental results show that the proposed algorithm can effectively reduce the required adder bit count and outperforms the existing state-of-the-art techniques.
Naoki KANAYAMA Yang LIU Eiji OKAMOTO Kazutaka SAITO Tadanori TERUYA Shigenori UCHIYAMA
We implemented a scalar multiplication method over elliptic curves using division polynomials. We adapt an algorithm for computing elliptic nets proposed by Stange. According to our experimental results, the scalar multiplication method using division polynomials is faster than the binary method in an affine coordinate system.
Jinwei WANG Xirong MA Yuanping ZHU Jizhou SUN
Modern GPUs have evolved to become a more general processor capable of executing scientific and engineering computations. It provides a highly parallel computing environment due to its large number of computing cores, which are suitable for numerous data parallel arithmetic computations, particularly linear algebra operations. The matrix-vector multiplication is one of the most important dense linear algebraic operations. It is applied to a diverse set of applications in many fields and must therefore be fully optimized to achieve a high-performance. In this paper, we proposed a novel auto-tuning method for matrix-vector multiplication on GPUs, where the number of assigned threads that are used to compute one element of the result vector can be auto-tuned according to the size of matrix. On the Nvidia's GPU GTX 650 with the most recent Kepler architecture, we developed an auto-tuner that can automatically select the optimal number of assigned threads for calculation. Based on the auto-tuner's result, we developed a versatile generic matrix-vector multiplication kernel with the CUDA programming model. A series of experiments on different shapes and sizes of matrices were conducted for comparing the performance of our kernel with that of the kernels from CUBLAS 5.0, MAGMA 1.3 and a warp method. The experiments results show that the performance of our matrix-vector multiplication kernel is close to the optimal behavior with increasing of the size of the matrix and has very little dependency on the shape of the matrix, which is a significant improvement compared to the other three kernels that exhibit unstable performance behavior for different shapes of matrices.
Ferruh ÖZBUDAK Sedat AKLEYLEK Murat CENK
In this paper, Hermite polynomial representation is proposed as an alternative way to represent finite fields of characteristic two. We show that multiplication in Hermite polynomial representation can be achieved with subquadratic space complexity. This representation enables us to find binomial or trinomial irreducible polynomials which allows us faster modular reduction over binary fields when there is no desirable such low weight irreducible polynomial in other representations. We then show that the product of two elements in Hermite polynomial representation can be performed as Toeplitz matrix-vector product. This representation is very interesting for NIST recommended binary field GF(2571) since there is no ONB for the corresponding extension. This representation can be used to obtain more efficient finite field arithmetic.
The main benefit of HECC is that it has much smaller parameter sizes and offers equivalent security as ECC and RSA. However, there are still more researches on ECC than on HECC. One of the reasons is that the computation of scalar multiplication cannot catch up. The Kummer surface can speed up the scalar multiplication in genus two curves. In this paper, we find that the scalar multiplication formulas of Duquesne in characteristic p > 3 on the Kummer surface are not correct. We verify and revise the formulas with mathematical software. The operation counts become 29M + 2S for new pseudo-addition formulas and 30M + 10S for doubling ones. And then we speed up the scalar multiplication on the Kummer surface with Euclidean addition chains.