Yoshinori ITOTAGAWA Koma ATSUMI Hikaru SEBE Daisuke KANEMOTO Tetsuya HIROSE
This paper describes a programmable differential bandgap reference (PD-BGR) for ultra-low-power IoT (Internet-of-Things) edge node devices. The PD-BGR consists of a current generator (CG) and differential voltage generator (DVG). The CG is based on a bandgap reference (BGR) and generates an operating current and a voltage, while the DVG generates another voltage from the current. A differential voltage reference can be obtained by taking the voltage difference from the voltages. The PD-BGR can produce a programmable differential output voltage by changing the multipliers of MOSFETs in a differential pair and resistance with digital codes. Simulation results showed that the proposed PD-BGR can generate 25- to 200-mV reference voltages with a 25-mV step within a ±0.7% temperature inaccuracy in a temperature range from -20 to 100°C. A Monte Carlo simulation showed that the coefficient of the variation in the reference was within 1.1%. Measurement results demonstrated that our prototype chips can generate stable programmable differential output voltages, almost the same results as those of the simulation. The average power consumption was only 88.4 nW, with a voltage error of -4/+3 mV with 5 samples.
Xiaoyong SONG Zhichuan GUO Xinshuo WANG Mangu SONG
In software defined network (SDN), packet processing is commonly implemented using match-action model, where packets are processed based on matched actions in match action table. Due to the limited FPGA on-board resources, it is an important challenge to achieve large-scale high throughput based on exact matching (EM), while solving hash conflicts and out-of-order problems. To address these issues, this study proposed an FPGA-based EM table that leverages shared rule tables across multiple pipelines to eliminate memory replication and enhance overall throughput. An out-of-order reordering function is used to ensure packet sequencing within the pipelines. Moreover, to handle collisions and increase load factor of hash table, multiple hash table blocks are combined and an auxiliary CAM-based EM table is integrated in each pipeline. To the best of our knowledge, this is the first time that the proposed design considers the recovery of out-of-order operations in multi-channel EM table for high-speed network packets processing application. Furthermore, it is implemented on Xilinx Alveo U250 field programmable gate arrays, which has a million rules and achieves a processing speed of 200 million operations per second, theoretically enabling throughput exceeding 100 Gbps for 64-Byte size packets.
Satoshi ITO Tomoaki KANAYA Akihiro NAKAO Masato OGUCHI Saneyasu YAMAGUCHI
The concepts of programmable switches and software-defined networking (SDN) give developers flexible and deep control over the behavior of switches. We expect these concepts to dramatically improve the functionality of switches. In this paper, we focus on the concept of Deeply Programmable Networks (DPN), where data planes are programmable, and application switches based on DPN. We then propose a method to improve the performance of a key-value store (KVS) through an application switch. First, we explain the DPN and application switches. The DPN is a network that makes not only control planes but also data planes programmable. An application switch is a switch that implements some functions of network applications, such as database management system (DBMS). Second, we propose a method to improve the performance of Cassandra, one of the most popular key-value based DBMS, by implementing a caching function in a switch in a dedicated network such as a data center. The proposed method is expected to be effective even though it is a simple and traditional way because it is in the data path and the center of the network application. Third, we implement a switch with the caching function, which monitors the accessed data described in packets (Ethernet frames) and dynamically replaces the cached data in the switch, and then show that the proposed caching switch can significantly improve the KVS transaction performance with this implementation. In the case of our evaluation, our method improved the KVS transaction throughput by up to 47%.
In this paper, a circuit based on a field programmable analog array (FPAA) is proposed for three types of chaotic spiking oscillator (CSO). The input/output conversion characteristics of a specific element in the FPAA can be defined by the user. By selecting the proper characteristics, three types of CSO are realized without changing the structure of the circuit itself. Chaotic attractors are observed in a hardware experiment. It is confirmed that the dynamics of the CSOs are consistent with numerical simulations.
Modern distributed storage requires microsecond-scale tail latency, but the current coordinator-based quorum coordination causes a burdensome latency overhead. This paper presents Archon, a new quorum coordination architecture that supports low tail latency for microsecond-scale replicated storage. The key idea of Archon is to perform the quorum coordination in the network switch by leveraging the flexibility and capability of emerging programmable switch ASICs. Our in-network quorum coordination is based on the observation that the modern programmable switch provides nanosecond-scale processing delay and high flexibility simultaneously. To realize the idea, we design a custom switch data plane. We implement a Archon prototype on an Intel Tofino switch and conduct a series of testbed experiments. Our experimental results show that Archon can provide lower tail latency than the coordinator-based solution.
Yifang BAO Shigeru YAMASHITA Bing LI Tsung-Yi HO
When we use a Programmable Microfluidic Device (PMD), we need to wash some contaminated area to use the chip for further experiments. Recently, a novel washing technique called Block-Flushing has been proposed. Block-Flushing washes contaminated area in PMDs by using buffer flows. In Block-Flushing, we need to keep a buffer flow from an input port to an output port of a PMD for a long period to dissolve residual contaminants. Thus, we may need a lot of buffer fluids and washing time even if the contaminated area is small. Another disadvantage of the washing method by Block-Flushing is such that we may not able to clean residual contaminants at valves completely by only buffer flows. To address the above-mentioned issues, this paper proposes a totally new idea to wash PMDs; our method does not use buffer flows, but washes contaminated area by using mixers. By using a mixer, we can dissolve residual contaminants at valves in the area of the mixer very efficiently. In this paper, we propose two methods to wash PMDs by using mixers. The first method can wash the whole chip area by using only four times of a single 2x2-mixer time. We also propose the second method which is a heuristic to reduce the number of moving valves because valves may wear down if they are used many times. We also show some experimental results to confirm that the second method can indeed decrease the number of used valves.
Xu BAI Ryusuke NEBASHI Makoto MIYAMURA Kazunori FUNAHASHI Naoki BANNO Koichiro OKAMOTO Hideaki NUMATA Noriyuki IGUCHI Tadahiko SUGIBAYASHI Toshitsugu SAKAMOTO Munehiro TADA
A static timing analysis (STA) tool for a 28nm atom-switch FPGA (AS-FPGA) is introduced to validate the signal delay of an application circuit before implementation. High accuracy of the STA tool is confirmed by implementing a practical application circuit on the 28nm AS-FPGA. Moreover, dramatic improvement of delay and power is demonstrated in comparison with a previous 40nm AS-FPGA.
Binhao HE Meiting XUE Shubiao LIU Feng YU Weijie CHEN
The top-K sorting is a variant of sorting used heavily in applications such as database management systems. Recently, the use of field programmable gate arrays (FPGAs) to accelerate sorting operation has attracted the interest of researchers. However, existing hardware top-K sorting algorithms are either resource-intensive or of low throughput. In this paper, we present a resource-efficient top-K sorting architecture that is composed of L cascading sorting units, and each sorting unit is composed of P sorting cells. K=PL largest elements are produced when a variable length input sequence is processed. This architecture can operate at a high frequency while consuming fewer resources. The experimental results show that our architecture achieved a maximum 1.2x throughput-to-resource improvement compared to previous studies.
Masayuki FUKUMITSU Shingo HASEGAWA
The Schnorr signature is one of the representative signature schemes and its security was widely discussed. In the random oracle model (ROM), it is provable from the DL assumption, whereas there is negative circumstantial evidence in the standard model. Fleischhacker, Jager, and Schröder showed that the tight security of the Schnorr signature is unprovable from a strong cryptographic assumption, such as the One-More DL (OM-DL) assumption and the computational and decisional Diffie-Hellman assumption, in the ROM via a generic reduction as long as the underlying cryptographic assumption holds. However, it remains open whether or not the impossibility of the provable security of the Schnorr signature from a strong assumption via a non-tight and reasonable reduction. In this paper, we show that the security of the Schnorr signature is unprovable from the OM-DL assumption in the non-programmable ROM as long as the OM-DL assumption holds. Our impossibility result is proven via a non-tight Turing reduction.
Hao XIAO Kaikai ZHAO Guangzhu LIU
This work presents a DNN accelerator architecture specifically designed for performing efficient inference on compressed and sparse DNN models. Leveraging the data sparsity, a runtime processing scheme is proposed to deal with the encoded weights and activations directly in the compressed domain without decompressing. Furthermore, a new data flow is proposed to facilitate the reusage of input activations across the fully-connected (FC) layers. The proposed design is implemented and verified using the Xilinx Virtex-7 FPGA. Experimental results show it achieves 1.99×, 1.95× faster and 20.38×, 3.04× more energy efficient than CPU and mGPU platforms, respectively, running AlexNet.
Saki HATTA Nobuyuki TANAKA Hiroyuki UZAWA Koyo NITTA
The application of network function virtualization (NFV) and software-defined networking (SDN) to passive optical networks (PONs) is attracting attention for the deployment of cost-effective access network systems. This paper presents a novel architecture of a programmable finite state machine (P-FSM) as a hardware accelerator for protocol processing in an optical line terminal (OLT). The P-FSM is programmable hardware that manages various types of FSMs to enhance flexibility in OLTs and achieve wired-rate performance with a negligible increase in total chip area. The P-FSM is implemented using three key technologies: a specific architecture for state management of communications protocols to minimize the logic area, a memory distributed implementation to minimize the program memory, and a new branch operation to minimize the memory area and reduce processing time. Evaluation results show that the P-FSM can handle 10G-EPON/NG-PON2 communications protocols in the same architecture while achieving wired-rate performance. The increase in the total designed area is only 1.5% to 4.9% depending on the number of protocols supported compared to the area of a conventional communications SoC without flexibility. We also clarify that our architecture has the scalability needed to modify the number of FSMs and the maximum number of ONU connections according to the system scale.
Foisal AHMED Michihiro SHINTANI Michiko INOUE
Analyzing aging-induced delay degradations of ring oscillators (ROs) is an effective way to detect recycled field-programmable gate arrays (FPGAs). However, it requires a large number of RO measurements for all FPGAs before shipping, which increases the measurement costs. We propose a cost-efficient recycled FPGA detection method using a statistical performance characterization technique called virtual probe (VP) based on compressed sensing. The VP technique enables the accurate prediction of the spatial process variation of RO frequencies on a die by using a very small number of sample RO measurements. Using the predicted frequency variation as a supervisor, the machine-learning model classifies target FPGAs as either recycled or fresh. Through experiments conducted using 50 commercial FPGAs, we demonstrate that the proposed method achieves 90% cost reduction for RO measurements while preserving the detection accuracy. Furthermore, a one-class support vector machine algorithm was used to classify target FPGAs with around 94% detection accuracy.
Kota MUROI Hayato MASHIKO Yukihide KOHIRA
Due to progressing process technology, yield of chips is reduced by timing violation caused by delay variation of gates and wires in fabrication. Recently, post-silicon delay tuning, which inserts programmable delay elements (PDEs) into clock trees before the fabrication and adjusts the delays of the PDEs to recover the timing violation after the fabrication, is promising to improve the yield. Although post-silicon delay tuning improves the yield, it increases circuit area and power consumption since the PDEs are inserted. In this paper, a PDE structure is taken into consideration to reduce the circuit area and the power consumption. Moreover, a delay selection algorithm, and a clustering method, in which some PDEs are merged into a PDE and the PDE is inserted for multiple registers, are proposed to reduce the circuit area and the power consumption. In computational experiments, the proposed method reduced the circuit area and the power consumption in comparison with an existing method.
The CC-Link proposed by the Mitsubishi Electric Company is an industrial network used exclusively in most industries. However, the probabilities of data loss and interference with equipment control increase if the transmission time is greater than the link scan time of 381µs. The link scan time can be reduced by designing the CC-Link module as an external microprocessor (MPU) interface of R-IN32M3; however, it then suffers from expandability issues. Thus, in this paper, we propose a new CC-Link module utilizing R-IN32M3 to improve the expandability. In our designed CC-Link module, we devise a dual-port RAM (DPRAM) function in an external I/O module, which enables parallel communication between the DPRAM and the external MPU. Our experiment with the implemented CC-Link prototype demonstrates that our CC-Link design improves the communication speed owing to the parallel communication between DPRAM and external MPU, and expandability of remote I/O. Our design achieves miniaturization of the CC-Link module, wiring reduction, and an approximately 30% reduction in the link scan time. Furthermore, because we utilize both the Renesas R-IN32M3 and Xilinx XC95144XL chips widely used in diverse application areas, the designed CC-Link module reduces the investment cost. The proposed design is expected to significantly contribute to the utilization of the programmable logic controller memory and I/O expansion for factory automation and improvement of the investment efficiency in the flat panel display industry.
Toshihiro KATASHITA Masakazu HIOKI Yohei HORI Hanpei KOIKE
Field-programmable gate array (FPGA) devices are applied for accelerating specific calculations and reducing power consumption in a wide range of areas. One of the challenges associated with FPGAs is reducing static power for enforcing their power effectiveness. We propose a method involving fine-grained reconfiguration of body biases of logic and net resources to reduce the static power of FPGA devices. In addition, we develop an FPGA device called Flex Power FPGA with SOTB technology and demonstrate its power reduction function with a 32-bit counter circuit. In this paper, we describe the construction of an experimental platform to precisely evaluate power consumption and the maximum operating frequency of the device under various operating voltages and body biases with various practical circuits. Using the abovementioned platform, we evaluate the Flex Power FPGA chip at operating voltages of 0.5-1.0 V and at body biases of 0.0-0.5 V. In the evaluation, we use a 32-bit adder, 16-bit multiplier, and an SBOX circuit for AES cryptography. We operate the chip virtually with uniformed body bias voltage to drive all of the logic resources with the same threshold voltage. We demonstrate the advantage of the Flex Power FPGA by comparing its performance with non-reconfigurable biasing.
Yasuaki OHIRA Takahiro MATSUMOTO Hideyuki TORII Yuta IDA Shinya MATSUFUJI
In this paper, we propose a new structure for a compact matched filter bank (MFB) for an optical zero-correlation zone (ZCZ) sequence set with Zcz=2z. The proposed MFB can reduces operation elements such as 2-input adders and delay elements. The number of 2-input adders decrease from O(N2) to O(N log2 N), delay elements decrease from O(N2) to O(N). In addition, the proposed MFBs for the sequence of length 32, 64, 128 and 256 with Zcz=2,4 and 8 are implemented on a field programmable gate array (FPGA). As a result, the numbers of logic elements (LEs) of the proposed MFBs for the sequences with Zcz=2 of length 32, 64, 128 and 256 are suppressed to about 76.2%, 84.2%, 89.7% and 93.4% compared to that of the conventional MFBs, respectively.
Masayuki FUKUMITSU Shingo HASEGAWA
In recent years, Fischlin and Fleischhacker showed the impossibility of proving the security of specific types of FS-type signatures, the signatures constructed by the Fiat-Shamir transformation, via a single-instance reduction in the non-programmable random oracle model (NPROM, for short). In this paper, we pose a question whether or not the impossibility of proving the security of any FS-type signature can be shown in the NPROM. For this question, we show that each FS-type signature cannot be proven to be secure via a key-preserving reduction in the NPROM from the security against the impersonation of the underlying identification scheme under the passive attack, as long as the identification scheme is secure against the impersonation under the active attack. We also show the security incompatibility between the security of some FS-type signatures in the NPROM via a single-instance key-preserving reduction and the underlying cryptographic assumptions. By applying this result to the Schnorr signature, one can prove the incompatibility between the security of the Schnorr signature in this situation and the discrete logarithm assumption, whereas Fischlin and Fleischhacker showed that such an incompatibility cannot be proven via a non-key-preserving reduction.
An energy-efficient nonvolatile FPGA with assuring highly-reliable backup operation using a self-terminated power-gating scheme is proposed. Since the write current is automatically cut off just after the temporal data in the flip-flop is successfully backed up in the nonvolatile device, the amount of write energy can be minimized with no write failure. Moreover, when the backup operation in a particular cluster is completed, power supply of the cluster is immediately turned off, which minimizes standby energy due to leakage current. In fact, the total amount of energy consumption during the backup operation is reduced by 66% in comparison with that of a conventional worst-case-based approach where the long time write current pulse is used for the reliable write.
This paper proposes 0-1-A-Ā LUT, a new programmable logic using atom switches, and a delay-optimal mapping algorithm for it. Atom switch is a non-volatile memory device of very small geometry which is fabricated between metal layers of a VLSI, and it can be used as a switch device of very small on-resistance and parasitic capacitance. While considerable area reduction of Look Up Tables (LUTs) used in conventional Field Programmable Gate Arrays (FPGAs) has been achieved by simply replacing each SRAM element with a memory element using a pair of atom switches, our 0-1-A-Ā LUT achieves further area and delay reduction. Unlike the conventional atom-switch-based LUT in which all k input signals are fed to a MUX, one of input signals is fed to the switch array, resulting area reduction due to the reduced number of inputs of the MUX from 2k to 2k-1, as well as delay reduction due to reduced fanout load of the input buffers. Since the fanout of this input buffers depends on the mapped logic function, this paper also proposes technology mapping algorithms to select logic function of fewer number of fanouts of input buffers to achieve further delay reduction. From our experiments, the circuit delay using our k-LUT is 0.94% smaller in the best case compared with using the conventional atom-switch-based k-LUT.
The purpose of DNA sequencing is to determine the order of nucleotides within a DNA molecule of target. The target DNA molecules are fragmented into short reads, which are short fixed-length subsequences composed of ‘A’, ‘C’, ‘G’ ‘T’, by next generation sequencing (NGS) machine. To reconstruct the target DNA from the short reads using a reference genome, which is a representative example of a species that was constructed in advance, it is necessary to determine their locations in the target DNA from where they have been extracted by aligning them onto the reference genome. This process is called short read mapping, and it is important to improve the performance of the short read mapping to realize fast DNA sequencing. We propose three types of FPGA acceleration methods based on hash table; (1) sorting and parallel comparison, (2) matching that allows one mutation to reduce the number of the candidates, (3) optimized hash function using variable masks. The first one reduces the number of accesses to off-chip memory to avoid the bottleneck by access latency. The second one enables to reduce the number of the candidates without degrading mapping sensitivity by allowing one mutation in the comparison. The last one reduces hash collisions using a table that was calculated from the reference genome in advance. We implemented the three methods on Xilinx Virtex-7 and evaluated them to show their effectiveness of them. In our experiments, our system achieves 20 fold of processing speed compared with BWA, which is one of the most popular mapping tools. Furthermore, we shows that the our system outperforms one of the fastest FPGA short read mapping systems.