Phong X. NGUYEN Hung Q. CAO Khang V. T. NGUYEN Hung NGUYEN Takehisa YAIRI
In recent years, there has been an increasing trend of applying artificial intelligence in many different fields, which has a profound and direct impact on human life. Consequently, this raises the need to understand the principles of model making predictions. Since most current high-precision models are black boxes, neither the AI scientist nor the end-user profoundly understands what is happening inside these models. Therefore, many algorithms are studied to explain AI models, especially those in the image classification problem in computer vision such as LIME, CAM, GradCAM. However, these algorithms still have limitations, such as LIME's long execution time and CAM's confusing interpretation of concreteness and clarity. Therefore, in this paper, we will propose a new method called Segmentation - Class Activation Mapping (SeCAM)/ This method combines the advantages of these algorithms above while at simultaneously overcoming their disadvantages. We tested this algorithm with various models, including ResNet50, InceptionV3, and VGG16 from ImageNet Large Scale Visual Recognition Challenge (ILSVRC) data set. Outstanding results were achieved when the algorithm has met all the requirements for a specific explanation in a remarkably short space of time.
Takayuki NOZAKI Motohiko ISAKA
Low-density parity-check (LDPC) codes are widely used in communication systems for their high error-correcting performance. This survey introduces the elements of LDPC codes: decoding algorithms, code construction, encoding algorithms, and several classes of LDPC codes.
This paper presents a novel blind signal separation method for the measurement of pulse waves at multiple body positions using an array radar system. The proposed method is based on a mathematical model of pulse wave propagation. The model relies on three factors: (1) a small displacement approximation, (2) beam pattern orthogonality, and (3) an impulse response model of pulse waves. The separation of radar echoes is formulated as an optimization problem, and the associated objective function is established using the mathematical model. We evaluate the performance of the proposed method using measured radar data from participants lying in a prone position. The accuracy of the proposed method, in terms of estimating the body displacements, is measured using reference data taken from laser displacement sensors. The average estimation errors are found to be 10-21% smaller than those of conventional methods. These results indicate the effectiveness of the proposed method for achieving noncontact measurements of the displacements of multiple body positions.
Nowadays, a rapid increase of demand on high-performance computation causes the enthusiastic research activities regarding massively parallel systems. An interconnection network in a massively parallel system interconnects a huge number of processing elements so that they can cooperate to process tasks by communicating among others. By regarding a processing element and a link between a pair of processing elements as a node and an edge, respectively, many problems with respect to communication and/or routing in an interconnection network are reducible to the problems in the graph theory. For interconnection networks of the massively parallel systems, many topologies have been proposed so far. The hypercube is a very popular topology and it has many variants. The bicube is a such topology and it can interconnect the same number of nodes with the same degree as the hypercube while its diameter is almost half of that of the hypercube. In addition, the bicube keeps the node-symmetric property. Hence, we focus on the bicube and propose an algorithm that gives a minimal or shortest path between an arbitrary pair of nodes. We give a proof of correctness of the algorithm and demonstrate its execution.
Takashi YOKOTA Kanemitsu OOTSU Shun KOJIMA
Parallel computing essentially consists of computation and communication and, in many cases, communication performance is vital. Many parallel applications use collective communications, which often dominate the performance of the parallel execution. This paper focuses on collective communication performance to speed-up the parallel execution. This paper firstly offers our experimental result that splitting a session of collective communication to small portions (slices) possibly enables efficient communication. Then, based on the results, this paper proposes a new concept cup-stacking with a genetic algorithm based methodology. The preliminary evaluation results reveal the effectiveness of the proposed method.
Shiqing QIAN Wenping GE Yongxing ZHANG Pengju ZHANG
Sparse code division multiple access (SCMA) is a non-orthogonal multiple access (NOMA) technology that can improve frequency band utilization and allow many users to share quite a few resource elements (REs). This paper uses the modulation of lattice theory to develop a systematic construction procedure for the design of SCMA codebooks under Gaussian channel environments that can achieve near-optimal designs, especially for cases that consider large-scale SCMA parameters. However, under the condition of large-scale SCMA parameters, the mother constellation (MC) points will overlap, which can be solved by the method of the partial dimensions transformation (PDT). More importantly, we consider the upper bounded error probability of the signal transmission in the AWGN channels, and design a codeword allocation method to reduce the inter symbol interference (ISI) on the same RE. Simulation results show that under different codebook sizes and different overload rates, using two different message passing algorithms (MPA) to verify, the codebook proposed in this paper has a bit error rate (BER) significantly better than the reference codebooks, moreover the convergence time does not exceed that of the reference codebooks.
Fukashi MORISHITA Wataru SAITO Norihito KATO Yoichi IIZUKA Masao ITO
This paper proposes novel test techniques for high accuracy measurement of ADCs and a ramp generator on a CMOS image sensor (CIS) chip. The test circuit for the ADCs has a dual path and has an ability of multi-functional fine pattern generator that can define any input for each column to evaluate CIS specific characteristics electrically. The test circuit for the ramp generator can realize an on-chip current cell test and reject the current cell failure within 1LSB accuracy. We fabricated the test sensor using 55nm CIS process and measured the IP characteristics. Measured results show INL of 14.6LSB, crosstalk of 14.9LSB and column interference noise of 5.4LSB. These measured results agree with the designed values. By using this technique, we confirmed the accurate ADC measurement can be realized without being affected by the ambiguity of the optical input.
Takahiro KAWAGUCHI Naofumi TAKAGI
A 32-bit arithmetic logic unit (ALU) is designed for a rapid single flux quantum (RSFQ) bit-parallel processor. In the ALU, clocked gates are partially replaced by clockless gates. This reduces the number of D flip flops (DFFs) required for path balancing. The number of clocked gates, including DFFs, is reduced by approximately 40 %, and size of the clock distribution network is reduced. The number of pipeline stages becomes modest. The layout design of the ALU and simulation results show the effectiveness of using clockless gates in wide datapath circuits.
Fumihiro CHINA Naoki TAKEUCHI Hideo SUZUKI Yuki YAMANASHI Hirotaka TERAI Nobuyuki YOSHIKAWA
The adiabatic quantum flux parametron (AQFP) is an energy-efficient, high-speed superconducting logic device. To observe the tiny output currents from the AQFP in experiments, high-speed voltage drivers are indispensable. In the present study, we develop a compact voltage driver for AQFP logic based on a Josephson latching driver (JLD), which has been used as a high-speed driver for rapid single-flux-quantum (RSFQ) logic. In the JLD-based voltage driver, the signal currents of AQFP gates are converted into gap-voltage-level signals via an AQFP/RSFQ interface and a four-junction logic gate. Furthermore, this voltage driver includes only 15 Josephson junctions, which is much fewer than in the case for the previously designed driver based on dc superconducting quantum interference devices (60 junctions). In measurement, we successfully operate the JLD-based voltage driver up to 4 GHz. We also evaluate the bit error rate (BER) of the driver and find that the BER is 7.92×10-10 and 2.67×10-3 at 1GHz and 4GHz, respectively.
Taiki YAMAE Naoki TAKEUCHI Nobuyuki YOSHIKAWA
The adiabatic quantum-flux-parametron (AQFP) is an energy-efficient superconductor logic device. In a previous study, we proposed a low-latency clocking scheme called delay-line clocking, and several low-latency AQFP logic gates have been demonstrated. In delay-line clocking, the latency between adjacent excitation phases is determined by the propagation delay of excitation currents, and thus the rising time of excitation currents should be sufficiently small; otherwise, an AQFP gate can switch before the previous gate is fully excited. This means that delay-line clocking needs high clock frequencies, because typical excitation currents are sinusoidal and the rising time depends on the frequency. However, AQFP circuits need to be tested in a wide frequency range experimentally. Hence, in the present study, we investigate AQFP circuits adopting delay-line clocking with square excitation currents to apply delay-line clocking in a low frequency range. Square excitation currents have shorter rising time than sinusoidal excitation currents and thus enable low frequency operation. We demonstrate an AQFP buffer chain with delay-line clocking using square excitation currents, in which the latency is approximately 20ps per gate, and confirm that the operating margin for the buffer chain is kept sufficiently wide at clock frequencies below 1GHz, whereas in the sinusoidal case the operating margin shrinks below 500MHz. These results indicate that AQFP circuits adopting delay-line clocking can operate in a low frequency range by using square excitation currents.
Tomohiro YAMAJI Masayuki SHIRANE Tsuyoshi YAMAMOTO
A Josephson parametric oscillator (JPO) is an interesting system from the viewpoint of quantum optics because it has two stable self-oscillating states and can deterministically generate quantum cat states. A theoretical proposal has been made to operate a network of multiple JPOs as a quantum annealer, which can solve adiabatically combinatorial optimization problems at high speed. Proof-of-concept experiments have been actively conducted for application to quantum computations. This article provides a review of the mechanism of JPOs and their application as a quantum annealer.
Tomoyuki TANAKA Christopher L. AYALA Nobuyuki YOSHIKAWA
Extremely energy-efficient logic devices are required for future low-power high-performance computing systems. Superconductor electronic technology has a number of energy-efficient logic families. Among them is the adiabatic quantum-flux-parametron (AQFP) logic family, which adiabatically switches the quantum-flux-parametron (QFP) circuit when it is excited by an AC power-clock. When compared to state-of-the-art CMOS technology, AQFP logic circuits have the advantage of relatively fast clock rates (5 GHz to 10 GHz) and 5 - 6 orders of magnitude reduction in energy before cooling overhead. We have been developing extremely energy-efficient computing processor components using the AQFP. The adder is the most basic computational unit and is important in the development of a processor. In this work, we designed and measured a 16-bit parallel prefix carry look-ahead Kogge-Stone adder (KSA). We fabricated the circuit using the AIST 10 kA/cm2 High-speed STandard Process (HSTP). Due to a malfunction in the measurement system, we were not able to confirm the complete operation of the circuit at the low frequency of 100 kHz in liquid He, but we confirmed that the outputs that we did observe are correct for two types of tests: (1) critical tests and (2) 110 random input tests in total. The operation margin of the circuit is wide, and we did not observe any calculation errors during measurement.
Stanislav SEDUKHIN Yoichi TOMIOKA Kohei YAMAMOTO
In this paper, starting from the algorithm, a performance- and energy-efficient 3D structure or shape of the Tensor Processing Engine (TPE) for CNN acceleration is systematically searched and evaluated. An optimal accelerator's shape maximizes the number of concurrent MAC operations per clock cycle while minimizes the number of redundant operations. The proposed 3D vector-parallel TPE architecture with an optimal shape can be very efficiently used for considerable CNN acceleration. Due to implemented support of inter-block image data independency, it is possible to use multiple of such TPEs for the additional CNN acceleration. Moreover, it is shown that the proposed TPE can also be uniformly used for acceleration of the different CNN models such as VGG, ResNet, YOLO, and SSD. We also demonstrate that our theoretical efficiency analysis is matched with the result of a real implementation for an SSD model to which a state-of-the-art channel pruning technique is applied.
In this study, we aim to improve the performance of audio source separation for monaural mixture signals. For monaural audio source separation, semisupervised nonnegative matrix factorization (SNMF) can achieve higher separation performance by employing small supervised signals. In particular, penalized SNMF (PSNMF) with orthogonality penalty is an effective method. PSNMF forces two basis matrices for target and nontarget sources to be orthogonal to each other and improves the separation accuracy. However, the conventional orthogonality penalty is based on an inner product and does not affect the estimation of the basis matrix properly because of the scale indeterminacy between the basis and activation matrices in NMF. To cope with this problem, a new PSNMF with cosine similarity between the basis matrices is proposed. The experimental comparison shows the efficacy of the proposed cosine similarity penalty in supervised audio source separation.
Ryota YOSHIMURA Ichiro MARUTA Kenji FUJIMOTO Ken SATO Yusuke KOBAYASHI
Particle filters have been widely used for state estimation problems in nonlinear and non-Gaussian systems. Their performance depends on the given system and measurement models, which need to be designed by the user for each target system. This paper proposes a novel method to design these models for a particle filter. This is a numerical optimization method, where the particle filter design process is interpreted into the framework of reinforcement learning by assigning the randomnesses included in both models of the particle filter to the policy of reinforcement learning. In this method, estimation by the particle filter is repeatedly performed and the parameters that determine both models are gradually updated according to the estimation results. The advantage is that it can optimize various objective functions, such as the estimation accuracy of the particle filter, the variance of the particles, the likelihood of the parameters, and the regularization term of the parameters. We derive the conditions to guarantee that the optimization calculation converges with probability 1. Furthermore, in order to show that the proposed method can be applied to practical-scale problems, we design the particle filter for mobile robot localization, which is an essential technology for autonomous navigation. By numerical simulations, it is demonstrated that the proposed method further improves the localization accuracy compared to the conventional method.
Tongzhou QU Zibin DAI Yanjiang LIU Lin CHEN Xianzhao XIA
The existing research on Amdahl's law is limited to multi/many-core processors, and cannot be applied to the important parallel processing architecture of coarse-grained reconfigurable arrays. This paper studies the relation between the multi-level parallelism of block cipher algorithms and the architectural characteristics of coarse-grain reconfigurable arrays. We introduce the key variables that affect the performance of reconfigurable arrays, such as communication overhead and configuration overhead, into Amdahl's law. On this basis, we propose a performance model for coarse-grain reconfigurable block cipher array (CGRBA) based on the extended Amdahl's law. In addition, this paper establishes the optimal integer nonlinear programming model, which can provide a parameter reference for the architecture design of CGRBA. The experimental results show that: (1) reducing the communication workload ratio and increasing the number of configuration pages reasonably can significantly improve the algorithm performance on CGRBA; (2) the communication workload ratio has a linear effect on the execution time.
This paper evaluates the bluetooth low energy (BLE) positioning systems using the sparse-training data through the comparison experiments. The sparse-training data is extracted from the database including enough data for realizing the highly accurate and precise positioning. First, we define the sparse-training data, i.e., the data collection time and the number of smartphones, directions, beacons, and reference points, on BLE positioning systems. Next, the positioning performance evaluation experiments are conducted in two indoor environments, that is, an indoor corridor as a one-dimensionally spread environment and a hall as a twodimensionally spread environment. The algorithms for comparison are the conventional fingerprint algorithm and the hybrid algorithm (the authors already proposed, and combined the proximity algorithm and the fingerprint algorithm). Based on the results, we confirm that the hybrid algorithm performs well in many cases even when using sparse-training data. Consequently, the robustness of the hybrid algorithm, that the authors already proposed for the sparse-training data, is shown.
Zhimin GUO Jianfei CHEN Sheng ZHANG
Millimeter wave synthetic aperture interferometric radiometers (SAIR) are very powerful instruments, which can effectively realize high-precision imaging detection. However due to the existence of interference factor and complex near-field error, the imaging effect of near-field SAIR is usually not ideal. To achieve better imaging results, a new fully connected imaging network (FCIN) is proposed for near-field SAIR. In FCIN, the fully connected network is first used to reconstruct the image domain directly from the visibility function, and then the residual dense network is used for image denoising and enhancement. The simulation results show that the proposed FCIN method has high imaging accuracy and shorten imaging time.
Wen SHI Jianling LIU Jingyu ZHANG Yuran MEN Hongwei CHEN Deke WANG Yang CAO
Syndrome is a crucial principle of Traditional Chinese Medicine. Formula classification is an effective approach to discover herb combinations for the clinical treatment of syndromes. In this study, a local search based firefly algorithm (LSFA) for parameter optimization and feature selection of support vector machines (SVMs) for formula classification is proposed. Parameters C and γ of SVMs are optimized by LSFA. Meanwhile, the effectiveness of herbs in formula classification is adopted as a feature. LSFA searches for well-performing subsets of features to maximize classification accuracy. In LSFA, a local search of fireflies is developed to improve FA. Simulations demonstrate that the proposed LSFA-SVM algorithm outperforms other classification algorithms on different datasets. Parameters C and γ and the features are optimized by LSFA to obtain better classification performance. The performance of FA is enhanced by the proposed local search mechanism.
Yutaro BESSHO Yuto HAYAMIZU Kazuo GODA Masaru KITSUREGAWA
Parallel processing is a typical approach to answer analytical queries on large database. As the size of the database increases, we often try to increase the parallelism by incorporating more processing nodes. However, this approach increases the possibility of node failure as well. According to the conventional practice, if a failure occurs during query processing, the database system restarts the query processing from the beginning. Such temporal cost may be unacceptable to the user. This paper proposes a fault-tolerant query processing mechanism, named PhoeniQ, for analytical parallel database systems. PhoeniQ continuously takes a checkpoint for every operator pipeline and replicates the output of each stateful operator among different processing nodes. If a single processing node fails during query processing, another can promptly take over the processing. Hence, PhoneniQ allows the database system to efficiently resume query processing after a partial failure event. This paper presents a key design of PhoeniQ and prototype-based experiments to demonstrate that PhoeniQ imposes negligible performance overhead and efficiently continues query processing in the face of node failure.