We have realized a design automation platform of hardware accelerator for pairing operation over multiple elliptic curve parameters. Pairing operation is one of the fundamental operations to realize functional encryption. However, known as a computational complexity-heavy algorithm. Also because there have been not yet identified standard parameters, we need to choose curve parameters based on the required security level and affordable hardware resources. To explore this design optimization for each curve parameter is essential. In this research, we have realized an automated design platform for pairing hardware for such purposes. Optimization results show almost equivalent to those prior-art designs by hand.
Vu-Trung-Duong LE Hoai-Luan PHAM Thi-Hong TRAN Yasuhiko NAKASHIMA
Blockchain-based Internet of Things (IoT) applications require flexible, fast, and low-power hashing hardware to ensure IoT data integrity and maintain blockchain network confidentiality. However, existing hashing hardware poses challenges in achieving high performance and low power and limits flexibility to compute multiple hash functions with different message lengths. This paper introduces the flexible and energy-efficient crypto-processor (FECP) to achieve high flexibility, high speed, and low power with high hardware efficiency for blockchain-based IoT applications. To achieve these goals, three new techniques are proposed, namely the crypto arithmetic logic unit (Crypto-ALU), dual buffering extension (DBE), and local data memory (LDM) scheduler. The experiments on ASIC show that the FECP can perform various hash functions with a power consumption of 0.239-0.676W, a throughput of 10.2-3.35Gbps, energy efficiency of 4.44-14.01Gbps/W, and support up to 8916-bit message input. Compared to state-of-art works, the proposed FECP is 1.65-4.49 times, 1.73-21.19 times, and 1.48-17.58 times better in throughput, energy efficiency, and energy-delay product (EDP), respectively.
Kaoru MASADA Ryohei NAKAYAMA Makoto IKEDA
BLS signature is an elliptic curve cryptography with an attractive feature that signatures can be aggregated and shortened. We have designed two ASIC architectures for hashing to the elliptic curve and pairing to minimize the latency. Also, the designs are optimized for BLS12-381, a relatively new and safe curve.
Song BIAN Masayuki HIROMOTO Takashi SATO
In this work, we provide the first practical secure email filtering scheme based on homomorphic encryption. Specifically, we construct a secure naïve Bayesian filter (SNBF) using the Paillier scheme, a partially homomorphic encryption (PHE) scheme. We first show that SNBF can be implemented with only the additive homomorphism, thus eliminating the need to employ expensive fully homomorphic schemes. In addition, the design space for specialized hardware architecture realizing SNBF is explored. We utilize a recursive Karatsuba Montgomery structure to accelerate the homomorphic operations, where multiplication of 2048-bit integers are carried out. Through the experiment, both software and hardware versions of the SNBF are implemented. On software, 104-105x runtime and 103x storage reduction are achieved by SNBF, when compared to existing fully homomorphic approaches. By instantiating the designed hardware for SNBF, a further 33x runtime and 1919x power reduction are achieved. The proposed hardware implementation classifies an average-length email in under 0.5s, which is much more practical than existing solutions.
Hiromitsu AWANO Tadayuki ICHIHASHI Makoto IKEDA
An ASIC crypto processor optimized for the 254-bit prime-field optimal-ate pairing over Barreto-Naehrig (BN) curve is proposed. The data path of the proposed crypto processor is designed to compute five Fp2 operations, a multiplication, three addition/subtractions, and an inversion, simultaneously. We further propose a design methodology to automate the instruction scheduling by using a combinatorial optimization solver, with which the total cycle count is reduced to 1/2 compared with ever reported. The proposed crypto processor is designed and fabricated by using a 65nm silicon-on-thin-box (SOTB) CMOS process. The chip measurement result shows that the fabricated chip successfully computes a pairing in 0.185ms when a typical operating voltage of 1.20V is applied, which corresponds to 2.8× speed up compared to the current state-of-the-art pairing implementation on ASIC platform.
Shotaro SUGIYAMA Hiromitsu AWANO Makoto IKEDA
A 256-bit $mathbb{F}_p$ ECDSA crypto processor featuring low latency, low energy consumption and capability of changing the Elliptic curve parameters is designed and fabricated in SOTB 65nm CMOS process. We have demonstrated the lowest ever reported signature generation time of 31.3 μs at 238MHz clock frequency. Energy consumption is 3.28 μJ/signature-generation, which is same as the lowest reported till date. We have also derived addition formulae on Elliptic curve useful for reduce the number of registers and operation cycles.
Jinguang HAO Gang WANG Lili WANG Honggang WANG
In this paper, an optimal method is proposed to design sparse-coefficient notch filters with principal basic vectors in the column space of a matrix constituted with frequency samples. The proposed scheme can perform in two stages. At the first stage, the principal vectors can be determined in the least-squares sense. At the second stage, with some components of the principal vectors, the notch filter design is formulated as a linear optimization problem according to the desired specifications. Optimal results can form sparse coefficients of the notch filter by solving the linear optimization problem. The simulation results show that the proposed scheme can achieve better performance in designing a sparse-coefficient notch filter of small order compared with other methods such as the equiripple method, the orthogonal matching pursuit based scheme and the L1-norm based method.
SM3 is a hash function standard defined by China. Unlike SHA-1 and SHA-2, it is hard for SM3 to speed up the throughput because it has more complicated compression function than other hash algorithm. In this paper, we propose a 4-round-in-1 structure to reduce the number of rounds, and a logical simplifying to move 3 adders and 3 XOR gates from critical path to the non-critical path. Based in SMIC 65nm CMOS technology, the throughput of SM3 can achieve 6.54Gbps which is higher than that of the reported designs.
ThienLuan HO Seung-Rohk OH HyunJin KIM
A parallel Aho-Corasick (AC) approach, named PAC-k, is proposed for string matching in deep packet inspection (DPI). The proposed approach adopts graphic processing units (GPUs) to perform the string matching in parallel for high throughput. In parallel string matching, the boundary detection problem happens when a pattern is matched across chunks. The PAC-k approach solves the boundary detection problem because the number of characters to be scanned by a thread can reach the longest pattern length. An input string is divided into multiple sub-chunks with k characters. By adopting the new starting position in each sub-chunk for the failure transition, the required number of threads is reduced by a factor of k. Therefore, the overhead of terminating and reassigning threads is also decreased. In order to avoid the unnecessary overlapped scanning with multiple threads, a checking procedure is proposed that decides whether a new starting position is in the sub-chunk. In the experiments with target patterns from Snort and realistic input strings from DEFCON, throughputs are enhanced greatly compared to those of previous AC-based string matching approaches.
Tetsuya MANABE Takaaki HASEGAWA
This paper presents a design methodology for positioning sub-platform from the viewpoint of positioning for smartphone-based location-based services (LBS). To achieve this, we analyze a mechanism of positioning error generation including principles of positioning sub-systems and structure of smartphones. Specifically, we carry out the experiments of smartphone positioning performance evaluation by the smartphone basic API (Application Programming Interface) and by the wireless LAN in various environments. Then, we describe the importance of considering three layers as follows: 1) the lower layer that caused by positioning sub-systems, e.g., GPS, wireless LAN, mobile base stations, and so on; 2) the middle layer that caused by functions provided from the platform such as Android and iOS; 3) the upper layer that caused by operation algorithm of applications on the platform.
Yoshikazu MIYANAGA Wataru TAKAHASHI Shingo YOSHIZAWA
This paper introduces our developed noise robust speech communication techniques and describes its implementation to a smart info-media system, i.e., a small robot. Our designed speech communication system consists of automatic speech detection, recognition, and rejection. By using automatic speech detection and recognition, an observed speech waveform can be recognized without a manual trigger. In addition, using speech rejection, this system only accepts registered speech phrases and rejects any other words. In other words, although an arbitrary input speech waveform can be fed into this system and recognized, the system responds only to the registered speech phrases. The developed noise robust speech processing can reduce various noises in many environments. In addition to the design of noise robust speech recognition, the LSI design of this system has been introduced. By using the design of speech recognition application specific IC (ASIC), we can simultaneously realize low power consumption and real-time processing. This paper describes the LSI architecture of this system and its performances in some field experiments. In terms of current speech recognition accuracy, the system can realize 85-99% under 0-20dB SNR and echo environments.
Duc-Hung LE Katsumi INOUE Cong-Kha PHAM
A CAM-based matching system for fast exact pattern matching is implemented on a hardware system with FPGA and ASIC. The system has a simple structure, and does not employ any Central Processor Unit (CPU) as well as complicated computations. We take advantage of Content Addressable Memory (CAM) which has an ability of parallel multi-match mode for designing the system. The system is applied to fast pattern matching with various required search patterns without using search principles. In this paper, the authors present a CAM-based system for fast exact pattern matching on 2-D data.
Minoru IIZUKA Naohiro HAMADA Hiroshi SAITO
This paper proposes an ASIC design support tool set for non-pipelined asynchronous circuits with bundled-data implementation. This tool set consists of seven tools to automate design processes of bundled-data implementation such as the generation of design constraints, timing verification, and delay adjustment considering a given latency constraint. With the proposed design flow which combines the proposed tool set and commercial CAD tools, most of design processes from an RTL model is fully automated. In the experiments, to show the effectiveness of energy consumption in bundled-data implementation compared to synchronous counterpart, this paper synthesizes several circuits with a latency constraint which is generated from the synchronous counterpart with the minimum clock cycle time.
Ryohei HORI Taisuke UEOKA Taku OTANI Masaya YOSHIKAWA Takeshi FUJINO
A low-cost and low-power via-programmable structured ASIC architecture named “VPEX3” and a VPEX3-specific CAD system are developed. In the VPEX3 architecture, which is an improved version of the old VPEX and VPEX2 architectures, an arbitrary logic function including sequential logic can be programmed by three via layers. The logic elements (LEs) of VPEX3 are 60% smaller than those of the previous VPEX2, which can be programmed by two via layers. In this paper, we describe a global architecture named Logic Array Block (LAB) composed of LE matrices. The clock lines are buffered in the buffering region on the left and right sides of LAB. Next, a VPEX3-specific CAD system utilizing an academic placement tool named “CAPO” and the “FGR” global router is developed. Since these tools are originally designed for ASICs, we developed CAD tools for supporting a structured ASIC architecture. In particular, we developed a detailed router that assigns via positions on the via-programmable routing fabric. Our CAD system successfully converts the HDL design to GDS-II data format including via-1, 2, 3 layouts, and the successful verification of LVS and DRC on GDSII is achieved. The performance of the VPEX3 architecture and the CAD system is evaluated using ISCAS benchmark circuits. The developed CAD system is used to successfully design a test chip composed of 130110 LEs.
Ryohei HORI Tatsuya KITAMORI Taisuke UEOKA Masaya YOSHIKAWA Takeshi FUJINO
Various kinds of structured ASICs have been proposed that can customize logic functions using a few photomasks, which decreases the initial cost, especially that of expensive photo-masks. In the past, we have developed a via programmable structured ASIC “VPEX2” (Via Programmable logic device using EXclusive-or array) that is capable of changing logics on 2 via (the 1st and 3rd via) layers. The logic element (LE) of VPEX2 is composed of EXOR gate and 2 NOT gates. However, “VPEX2” architecture has the two important penalty, the area penalty is 5-6 times that of the ASIC and wiring congestion by detouring wires to avoid I/O terminals. In this paper, we propose a new architecture “VPEX3” in order to achieve the practical structures. In VPEX3, we applied three techniques for decrease area penalty and higher wiring efficiency: (1) LE area is reduced approximately 60% by omitting 1 NOT gate on a LE and the gate width reduction, (2) the kinds of configurable logic function on a single LE is increased from 13 to 22 by introducing “flexible AOI gate technique” and (3) flexible I/O terminal by introducing 2nd via as a programmable layers. Furthermore, the delay model for via programmable wiring is necessary in order to evaluate via programmable wiring architecture compared to standard cell ASIC. We extracted wiring delay characteristics from the ring oscillator test circuit using both of normal wiring and via-programmable wiring. These three new architectures and via programmable wiring-delay-model revealed that an area-delay product of “VPEX3” is as small as twice that of ASIC. Chip-cost estimation among FPGA, “VPEX2”, “VPEX3” and ASIC revealed that the “VPEX3” is the most cost-effective architecture for Systems-on-chips (SoCs) whose production volume is from one thousand to several tens of thousands units.
Dai YAMAMOTO Kouichi ITOH Jun YAJIMA
Compact design is very important for embedded systems such as wireless sensor nodes, RFID tags and mobile devices because of their limited hardware (H/W) resources. This paper proposes a compact H/W implementation for the KASUMI block cipher, which is the 3GPP standard encryption algorithm. In [8] and [9], Yamamoto et al. proposed a method of reducing the register size for the MISTY1 FO function (YYI-08), and implemented very compact MISTY1 H/W. In this paper we aim to implement the smallest KASUMI H/W to date by applying a YYI-08 configuration to KASUMI, whose FO function has a similar structure to that of MISTY1. However, we discovered that straightforward application of YYI-08 raises problems. We therefore propose a new YYI-08 configuration improved for KASUMI and the compact H/W architecture. The new YYI-08 configuration consists of new FL function calculation schemes and a suitable calculation order. According to our logic synthesis on a 0.11-µm ASIC process, the gate size is 2.99 K gates, which, to our knowledge, is the smallest to date.
Fast Fourier Transform (FFT) is an important algorithm in many digital signal processing applications, and it often requires parallel implementation for high throughput. In this paper, we first present the SmartCell coarse-grained reconfigurable architecture targeted for stream processing. A SmartCell prototype integrates 64 processing elements, configurable interconnections, and dedicated instruction and data memories into a single chip, which is able to provide high performance parallel processing while maintaining post-fabrication flexibility. Subsequently, we present a parallel FFT architecture targeted for multi-core platforms computing systems. This algorithm provides an optimized data flow pattern that reduces both communication and configuration overheads. The proposed parallel FFT algorithm is then mapped onto the SmartCell prototype device. Results show that the parallel FFT implementation on SmartCell is about 14.9 and 2.7 times faster than network-on-chip (NoC) and MorphoSys implementations, respectively. SmartCell also achieves the energy efficiency gains of 2.1 and 28.9 when compared with FPGA and DSP implementations.
Dai YAMAMOTO Jun YAJIMA Kouichi ITOH
This paper proposes a compact hardware (H/W) implementation for the MISTY1 block cipher, which is one of the ISO/IEC 18033-3 standard encryption algorithms. In designing the compact H/W, we focused on optimizing the implementation of FO/FI/FL functions, which are the main components of MISTY1. For this optimization, we propose three new methods; reducing temporary registers for the FO function, shortening the critical path for the FI function, and merging the FL/FL-1 functions. According to our logic synthesis on a 0.18-µm CMOS standard cell library based on our proposed methods, the gate size is 3.4 Kgates, which is the smallest as far as we know.
The development of network technology reveals the clear trend that mobile devices will soon be equipped with more and more network-based functions and services. This increase also results in more intrusions and attacks on mobile devices; therefore, mobile security mechanisms are becoming indispensable. In this paper, we propose a novel signature matching scheme for mobile security. This scheme not only emphasizes a small resource requirement and an optimal scan speed, which are both important for resource-limited mobile devices, but also focuses on practical features such as stable performance, fast signature set updates and hardware implementation. This scheme is based on the finite state machine (FSM) approach widely used for string matching. An SRAM-based two-level finite state machine (TFSM) solution is introduced to utilize the unbalanced transition distribution in the original FSM to decrease the memory requirement, and to shorten the critical path of the single-FSM solution. By adjusting the boundary of the two FSMs, optimum memory usage and throughput are obtainable. The hardware circuit of our scheme is designed and evaluated by both FPGA and ASIC technology. The result of FPGA evaluation shows that 2,168 unique patterns with a total of 32,776 characters are stored in 177.75 KB SelectRAM blocks of Xilinx XC4VLX40 FPGA and a 3.0 Gbps throughput is achieved. The result of ASIC evaluation with 180 nm-CMOS library shows a throughput of over 4.5 Gbps with 132 KB of SRAM. Because of the small amount of memory and logic cell requirements, as well as the scalability of our scheme, higher performance is achieved by instantiating several signature matching engines when more resources are provided.
A novel two-dimensional discrete wavelet transform (2-DDWT) parallel architecture for higher throughput and lower energy consumption is proposed. The proposed architecture fully exploits full-page burst accesses of DRAM and minimizes the number of DRAM activate and precharge operations. Simulation results revealed that the architecture reduces the number of clock cycles for DRAM memory accesses as well as the DRAM power consumption with moderate cost of internal memory. Evaluation of the VLSI implementation of the architecture showed that the throughput of wavelet filtering was increased by parallelizing row filtering with a minimum area cost, thereby enabling DRAM full-page burst accesses to be exploited.