Baoquan ZHONG Zhiqun CHENG Minshi JIA Bingxin LI Kun WANG Zhenghao YANG Zheming ZHU
Kazuya TADA
Suguru KURATOMI Satoshi USUI Yoko TATEWAKI Hiroaki USUI
Yoshihiro NAKA Masahiko NISHIMOTO Mitsuhiro YOKOTA
Hiroki Hoshino Kentaro Kusama Takayuki Arai
Tsuneki YAMASAKI
Kengo SUGAHARA
Cuong Manh BUI Hiroshi SHIRAI
Hiroyuki DEGUCHI Masataka OHIRA Mikio TSUJI
Hiroto Tochigi Masakazu Nakatani Ken-ichi Aoshima Mayumi Kawana Yuta Yamaguchi Kenji Machida Nobuhiko Funabashi Hideo Fujikake
Yuki Imamura Daiki Fujii Yuki Enomoto Yuichi Ueno Yosei Shibata Munehiro Kimura
Keiya IMORI Junya SEKIKAWA
Naoki KANDA Junya SEKIKAWA
Yongzhe Wei Zhongyuan Zhou Zhicheng Xue Shunyu Yao Haichun Wang
Mio TANIGUCHI Akito IGUCHI Yasuhide TSUJI
Kouji SHIBATA Masaki KOBAYASHI
Zhi Earn TAN Kenjiro MATSUMOTO Masaya TAKAGI Hiromasa SAEKI Masaya TAMURA
Misato ONISHI Kazuhiro YAMAGUCHI Yuji SAKAMOTO
Koya TANIKAWA Shun FUJII Soma KOGURE Shuya TANAKA Shun TASAKA Koshiro WADA Satoki KAWANISHI Takasumi TANABE
Shotaro SUGITANI Ryuichi NAKAJIMA Keita YOSHIDA Jun FURUTA Kazutoshi KOBAYASHI
Ryosuke Ichikawa Takumi Watanabe Hiroki Takatsuka Shiro Suyama Hirotsugu Yamamoto
Chan-Liang Wu Chih-Wen Lu
Umer FAROOQ Masayuki MORI Koichi MAEZAWA
Ryo ITO Sumio SUGISAKI Toshiyuki KAWAHARAMURA Tokiyoshi MATSUDA Hidenori KAWANISHI Mutsumi KIMURA
Paul Cain
Arie SETIAWAN Shu SATO Naruto YONEMOTO Hitoshi NOHMI Hiroshi MURATA
Seiichiro Izawa
Hang Liu Fei Wu
Keiji GOTO Toru KAWANO Ryohei NAKAMURA
Takahiro SASAKI Yukihiro KAMIYA
Xiang XIONG Wen LI Xiaohua TAN Yusheng HU
Tohgo HOSODA Kazuyuki SAITO
Yihan ZHU Takashi OHSAWA
Shengbao YU Fanze MENG Yihan SHEN Yuzhu HAO Haigen ZHOU
Stanislav SEDUKHIN Yoichi TOMIOKA Kohei YAMAMOTO
In this paper, starting from the algorithm, a performance- and energy-efficient 3D structure or shape of the Tensor Processing Engine (TPE) for CNN acceleration is systematically searched and evaluated. An optimal accelerator's shape maximizes the number of concurrent MAC operations per clock cycle while minimizes the number of redundant operations. The proposed 3D vector-parallel TPE architecture with an optimal shape can be very efficiently used for considerable CNN acceleration. Due to implemented support of inter-block image data independency, it is possible to use multiple of such TPEs for the additional CNN acceleration. Moreover, it is shown that the proposed TPE can also be uniformly used for acceleration of the different CNN models such as VGG, ResNet, YOLO, and SSD. We also demonstrate that our theoretical efficiency analysis is matched with the result of a real implementation for an SSD model to which a state-of-the-art channel pruning technique is applied.
Kentaro KAWAKAMI Kouji KURIHARA Masafumi YAMAZAKI Takumi HONDA Naoto FUKUMOTO
To accelerate deep learning (DL) processes on the supercomputer Fugaku, the authors have ported and optimized oneDNN for Fugaku's CPU, the Fujitsu A64FX. oneDNN is an open-source DL processing library developed by Intel for the x86_64 architecture. The A64FX CPU is based on the Armv8-A architecture. oneDNN dynamically creates the execution code for the computation kernels, which are implemented at the granularity of x86_64 instructions using Xbyak, the Just-In-Time (JIT) assembler for x86_64 architecture. To port oneDNN to A64FX, it must be rewritten into Armv8-A instructions using Xbyak_aarch64, the JIT assembler for the Armv8-A architecture. This is challenging because the number of steps to be rewritten exceeds several tens of thousands of lines. This study presents the Xbyak_translator_aarch64. Xbyak_translator_aarch64 is a binary translator that at runtime converts dynamically produced executable codes for the x86_64 architecture into executable codes for the Armv8-A architecture. Xbyak_translator_aarch64 eliminates the need to rewrite the source code for porting oneDNN to A64FX and allows us to port oneDNN to A64FX quickly.
Shunsuke TSUKADA Hikaru TAKAYASHIKI Masayuki SATO Kazuhiko KOMATSU Hiroaki KOBAYASHI
A hybrid memory architecture (HMA) that consists of some distinct memory devices is expected to achieve a good balance between high performance and large capacity. Unlike conventional memory architectures, the HMA needs the metadata for data management since the data are migrated between the memory devices during the execution of an application. The memory controller caches the metadata to avoid accessing the memory devices for the metadata reference. However, as the amount of the metadata increases in proportion to the size of the HMA, the memory controller needs to handle a large amount of metadata. As a result, the memory controller cannot cache all the metadata and increases the number of metadata references. This results in an increase in the access latency to reach the target data and degrades the performance. To solve this problem, this paper proposes a metadata prefetching mechanism for HMAs. The proposed mechanism loads the metadata needed in the near future by prefetching. Moreover, to increase the effect of the metadata prefetching, the proposed mechanism predicts the metadata used in the near future based on an address difference that is the difference between two consecutive access addresses. The evaluation results show that the proposed metadata prefetching mechanism can improve the instructions per cycle by up to 44% and 9% on average.
Takahiro KAWAGUCHI Naofumi TAKAGI
A 32-bit arithmetic logic unit (ALU) is designed for a rapid single flux quantum (RSFQ) bit-parallel processor. In the ALU, clocked gates are partially replaced by clockless gates. This reduces the number of D flip flops (DFFs) required for path balancing. The number of clocked gates, including DFFs, is reduced by approximately 40 %, and size of the clock distribution network is reduced. The number of pipeline stages becomes modest. The layout design of the ALU and simulation results show the effectiveness of using clockless gates in wide datapath circuits.
Naoki TAKEUCHI Taiki YAMAE Christopher L. AYALA Hideo SUZUKI Nobuyuki YOSHIKAWA
The adiabatic quantum-flux-parametron (AQFP) is an energy-efficient superconductor logic element based on the quantum flux parametron. AQFP circuits can operate with energy dissipation near the thermodynamic and quantum limits by maximizing the energy efficiency of adiabatic switching. We have established the design methodology for AQFP logic and developed various energy-efficient systems using AQFP logic, such as a low-power microprocessor, reversible computer, single-photon image sensor, and stochastic electronics. We have thus demonstrated the feasibility of the wide application of AQFP logic in future information and communications technology. In this paper, we present a tutorial review on AQFP logic to provide insights into AQFP circuit technology as an introduction to this research field. We describe the historical background, operating principle, design methodology, and recent progress of AQFP logic.
Fumihiro CHINA Naoki TAKEUCHI Hideo SUZUKI Yuki YAMANASHI Hirotaka TERAI Nobuyuki YOSHIKAWA
The adiabatic quantum flux parametron (AQFP) is an energy-efficient, high-speed superconducting logic device. To observe the tiny output currents from the AQFP in experiments, high-speed voltage drivers are indispensable. In the present study, we develop a compact voltage driver for AQFP logic based on a Josephson latching driver (JLD), which has been used as a high-speed driver for rapid single-flux-quantum (RSFQ) logic. In the JLD-based voltage driver, the signal currents of AQFP gates are converted into gap-voltage-level signals via an AQFP/RSFQ interface and a four-junction logic gate. Furthermore, this voltage driver includes only 15 Josephson junctions, which is much fewer than in the case for the previously designed driver based on dc superconducting quantum interference devices (60 junctions). In measurement, we successfully operate the JLD-based voltage driver up to 4 GHz. We also evaluate the bit error rate (BER) of the driver and find that the BER is 7.92×10-10 and 2.67×10-3 at 1GHz and 4GHz, respectively.
Tomoyuki TANAKA Christopher L. AYALA Nobuyuki YOSHIKAWA
Extremely energy-efficient logic devices are required for future low-power high-performance computing systems. Superconductor electronic technology has a number of energy-efficient logic families. Among them is the adiabatic quantum-flux-parametron (AQFP) logic family, which adiabatically switches the quantum-flux-parametron (QFP) circuit when it is excited by an AC power-clock. When compared to state-of-the-art CMOS technology, AQFP logic circuits have the advantage of relatively fast clock rates (5 GHz to 10 GHz) and 5 - 6 orders of magnitude reduction in energy before cooling overhead. We have been developing extremely energy-efficient computing processor components using the AQFP. The adder is the most basic computational unit and is important in the development of a processor. In this work, we designed and measured a 16-bit parallel prefix carry look-ahead Kogge-Stone adder (KSA). We fabricated the circuit using the AIST 10 kA/cm2 High-speed STandard Process (HSTP). Due to a malfunction in the measurement system, we were not able to confirm the complete operation of the circuit at the low frequency of 100 kHz in liquid He, but we confirmed that the outputs that we did observe are correct for two types of tests: (1) critical tests and (2) 110 random input tests in total. The operation margin of the circuit is wide, and we did not observe any calculation errors during measurement.
Taiki YAMAE Naoki TAKEUCHI Nobuyuki YOSHIKAWA
The adiabatic quantum-flux-parametron (AQFP) is an energy-efficient superconductor logic device. In a previous study, we proposed a low-latency clocking scheme called delay-line clocking, and several low-latency AQFP logic gates have been demonstrated. In delay-line clocking, the latency between adjacent excitation phases is determined by the propagation delay of excitation currents, and thus the rising time of excitation currents should be sufficiently small; otherwise, an AQFP gate can switch before the previous gate is fully excited. This means that delay-line clocking needs high clock frequencies, because typical excitation currents are sinusoidal and the rising time depends on the frequency. However, AQFP circuits need to be tested in a wide frequency range experimentally. Hence, in the present study, we investigate AQFP circuits adopting delay-line clocking with square excitation currents to apply delay-line clocking in a low frequency range. Square excitation currents have shorter rising time than sinusoidal excitation currents and thus enable low frequency operation. We demonstrate an AQFP buffer chain with delay-line clocking using square excitation currents, in which the latency is approximately 20ps per gate, and confirm that the operating margin for the buffer chain is kept sufficiently wide at clock frequencies below 1GHz, whereas in the sinusoidal case the operating margin shrinks below 500MHz. These results indicate that AQFP circuits adopting delay-line clocking can operate in a low frequency range by using square excitation currents.
Tomohiro YAMAJI Masayuki SHIRANE Tsuyoshi YAMAMOTO
A Josephson parametric oscillator (JPO) is an interesting system from the viewpoint of quantum optics because it has two stable self-oscillating states and can deterministically generate quantum cat states. A theoretical proposal has been made to operate a network of multiple JPOs as a quantum annealer, which can solve adiabatically combinatorial optimization problems at high speed. Proof-of-concept experiments have been actively conducted for application to quantum computations. This article provides a review of the mechanism of JPOs and their application as a quantum annealer.
Shuhei TAMATE Yutaka TABUCHI Yasunobu NAKAMURA
In this paper, we review the basic components of superconducting quantum computers. We mainly focus on the packaging and wiring technologies required to realize large-scalable superconducting quantum computers.
Kenta SATO Naonori SEGA Yuta SOMEI Hiroshi SHIMADA Takeshi ONOMI Yoshinao MIZUGAKI
We experimentally evaluated random number sequences generated by a superconducting hardware random number generator composed of a Josephson-junction oscillator, a rapid-single-flux-quantum (RSFQ) toggle flip-flop (TFF), and an RSFQ AND gate. Test circuits were fabricated using a 10 kA/cm2 Nb/AlOx/Nb integration process. Measurements were conducted in a liquid helium bath. The random numbers were generated for a trigger frequency of 500 kHz under the oscillating Josephson-junction at 29 GHz. 26 random number sequences of 20 kb length were evaluated for bias voltages between 2.0 and 2.7 mV. The NIST FIPS PUBS 140-2 tests were used for the evaluation. 100% pass rates were confirmed at the bias voltages of 2.5 and 2.6 mV. We found that the Monobit test limited the pass rates. As numerical simulations suggested, a detailed evaluation for the probability of obtaining “1” demonstrated the monotonical dependence on the bias voltage.