The search functionality is under construction.

Author Search Result

[Author] Yasuhiko NAKASHIMA(25hit)

1-20hit(25hit)

  • FOREWORD Open Access

    Yasuhiko NAKASHIMA  

     
    FOREWORD

      Vol:
    E98-D No:12
      Page(s):
    2047-2047
  • Tensor Rank and Strong Quantum Nondeterminism in Multiparty Communication

    Marcos VILLAGRA  Masaki NAKANISHI  Shigeru YAMASHITA  Yasuhiko NAKASHIMA  

     
    PAPER-Fundamentals of Information Systems

      Vol:
    E96-D No:1
      Page(s):
    1-8

    In this paper we study quantum nondeterminism in multiparty communication. There are three (possibly) different types of nondeterminism in quantum computation: i) strong, ii) weak with classical proofs, and iii) weak with quantum proofs. Here we focus on the first one. A strong quantum nondeterministic protocol accepts a correct input with positive probability and rejects an incorrect input with probability 1. In this work we relate strong quantum nondeterministic multiparty communication complexity to the rank of the communication tensor in the Number-On-Forehead and Number-In-Hand models. In particular, by extending the definition proposed by de Wolf to nondeterministic tensor-rank (nrank), we show that for any boolean function f when there is no prior shared entanglement between the players, 1) in the Number-On-Forehead model the cost is upper-bounded by the logarithm of nrank(f); 2) in the Number-In-Hand model the cost is lower-bounded by the logarithm of nrank(f). Furthermore, we show that when the number of players is o(log log n), we have NQP BQP for Number-On-Forehead communication.

  • A Hybrid Bayesian-Convolutional Neural Network for Adversarial Robustness

    Thi Thu Thao KHONG  Takashi NAKADA  Yasuhiko NAKASHIMA  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2022/04/11
      Vol:
    E105-D No:7
      Page(s):
    1308-1319

    We introduce a hybrid Bayesian-convolutional neural network (hyBCNN) for improving the robustness against adversarial attacks and decreasing the computation time in the Bayesian inference phase. Our hyBCNN models are built from a part of BNN and CNN. Based on pre-trained CNNs, we only replace convolutional layers and activation function of the initial stage of CNNs with our Bayesian convolutional (BC) and Bayesian activation (BA) layers as a term of transfer learning. We keep the remainder of CNNs unchanged. We adopt the Bayes without Bayesian Learning (BwoBL) algorithm for hyBCNN networks to execute Bayesian inference towards adversarial robustness. Our proposal outperforms adversarial training and robust activation function, which are currently the outstanding defense methods of CNNs in the resistance to adversarial attacks such as PGD and C&W. Moreover, the proposed architecture with BwoBL can easily integrate into any pre-trained CNN, especially in scaling networks, e.g., ResNet and EfficientNet, with better performance on large-scale datasets. In particular, under l∞ norm PGD attack of pixel perturbation ε=4/255 with 100 iterations on ImageNet, our best hyBCNN EfficientNet reaches 93.92% top-5 accuracy without additional training.

  • Programmable Analog Calculation Unit with Two-Stage Architecture: A Solution of Efficient Vector-Computation Open Access

    Renyuan ZHANG  Takashi NAKADA  Yasuhiko NAKASHIMA  

     
    PAPER

      Vol:
    E102-A No:7
      Page(s):
    878-885

    A programmable analog calculation unit (ACU) is designed for vector computations in continuous-time with compact circuit scale. From our early study, it is feasible to retrieve arbitrary two-variable functions through support vector regression (SVR) in silicon. In this work, the dimensions of regression are expanded for vector computations. However, the hardware cost and computing error greatly increase along with the expansion of dimensions. A two-stage architecture is proposed to organize multiple ACUs for high dimensional regression. The computation of high dimensional vectors is separated into several computations of lower dimensional vectors, which are implemented by the free combination of several ACUs with lower cost. In this manner, the circuit scale and regression error are reduced. The proof-of-concept ACU is designed and simulated in a 0.18μm technology. From the circuit simulation results, all the demonstrated calculations with nine operands are executed without iterative clock cycles by 4960 transistors. The calculation error of example functions is below 8.7%.

  • FOREWORD Open Access

    Yasuhiko Nakashima  

     
    FOREWORD

      Vol:
    E99-D No:12
      Page(s):
    2858-2859
  • Selective Check of Data-Path for Effective Fault Tolerance

    Tanvir AHMED  Jun YAO  Yuko HARA-AZUMI  Shigeru YAMASHITA  Yasuhiko NAKASHIMA  

     
    PAPER-Design Methodology

      Vol:
    E96-D No:8
      Page(s):
    1592-1601

    Nowadays, fault tolerance has been playing a progressively important role in covering increasing soft/hard error rates in electronic devices that accompany the advances of process technologies. Research shows that wear-out faults have a gradual onset, starting with a timing fault and then eventually leading to a permanent fault. Error detection is thus a required function to maintain execution correctness. Currently, however, many highly dependable methods to cover permanent faults are commonly over-designed by using very frequent checking, due to lack of awareness of the fault possibility in circuits used for the pending executions. In this research, to address the over-checking problem, we introduce a metric for permanent defects, as operation defective probability (ODP), to quantitatively instruct the check operations being placed only at critical positions. By using this selective checking approach, we can achieve a near-100% dependability by having about 53% less check operations, as compared to the ideal reliable method, which performs exhaustive checks to guarantee a zero-error propagation. By this means, we are able to reduce 21.7% power consumption by avoiding the non-critical checking inside the over-designed approach.

  • Log-Likelihood Ratio Calculation Using 3-Bit Soft-Decision for Error Correction in Visible Light Communication Systems

    Dinh-Dung LE  Duc-Phuc NGUYEN  Thi-Hong TRAN  Yasuhiko NAKASHIMA  

     
    LETTER-Communication Theory and Signals

      Vol:
    E101-A No:12
      Page(s):
    2210-2212

    Forward Error Correction (FEC) schemes have played an important role in intensity-modulation direct-detection (IM/DD) Visible Light Communication (VLC) systems. While hard-decision FEC schemes are inferior to soft-decision FEC codes in terms of decoding performance, they are widely used in these VLC systems because receivers are only capable of recognizing logical values 0 and 1. In this letter, we propose a method to calculate the log-likelihood ratios (LLR) values which are used as input of soft-decision FEC decoders. Simulation results show that Polar decoder using proposed method performs better than that of using the hard-decision technique.

  • ReVolver/C40: A Scalable Parallel Computer for Volume Rendering--Design and Implementation--

    Shin-ichiro MORI  Tomoaki TSUMURA  Masahiro GOSHIMA  Yasuhiko NAKASHIMA  Hiroshi NAKASHIMA  Shinji TOMITA  

     
    PAPER

      Vol:
    E86-D No:10
      Page(s):
    2006-2015

    This paper describes the architecture of ReVolver/C40 a scalable parallel machine for volume rendering and its prototype implementation. The most important feature of ReVolver/C40 is view-independent real time rendering of translucent 3D object by using perspective projection. In order to realize this feature, the authors propose a parallel volume memory architecture based on the principal axis oriented sampling method and parallel treble volume memory. This paper also discusses the implementation issues of ReVolver/C40 where various kinds of parallelism extracted to achieve high-perfromance rendering are explained. The prototype systems had been developed and their performance evaluation results are explained. As the results of the evaluation of the prototype systems, ReVolver/C40 with 32 parallel volume memory is estimated to achieve more than 10 frame per second for 2563 volume data on 2562 screen by using perspective projection. The authors also review the development of ReVolver/C40 from several view points.

  • Performance Optimization of Light-Field Applications on GPU

    Yuttakon YUTTAKONKIT  Shinya TAKAMAEDA-YAMAZAKI  Yasuhiko NAKASHIMA  

     
    PAPER-Computer System

      Pubricized:
    2016/08/24
      Vol:
    E99-D No:12
      Page(s):
    3072-3081

    Light-field image processing has been widely employed in many areas, from mobile devices to manufacturing applications. The fundamental process to extract the usable information requires significant computation with high-resolution raw image data. A graphics processing unit (GPU) is used to exploit the data parallelism as in general image processing applications. However, the sparse memory access pattern of the applications reduced the performance of GPU devices for both systematic and algorithmic reasons. Thus, we propose an optimization technique which redesigns the memory access pattern of the applications to alleviate the memory bottleneck of rendering application and to increase the data reusability for depth extraction application. We evaluated our optimized implementations with the state-of-the-art algorithm implementations on several GPUs where all implementations were optimally configured for each specific device. Our proposed optimization increased the performance of rendering application on GTX-780 GPU by 30% and depth extraction application on GTX-780 and GTX-980 GPUs by 82% and 18%, respectively, compared with the original implementations.

  • GPGPU Implementation of Variational Bayesian Gaussian Mixture Models

    Hiroki NISHIMOTO  Renyuan ZHANG  Yasuhiko NAKASHIMA  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2021/11/24
      Vol:
    E105-D No:3
      Page(s):
    611-622

    The efficient implementation strategy for speeding up high-quality clustering algorithms is developed on the basis of general purpose graphic processing units (GPGPUs) in this work. Among various clustering algorithms, a sophisticated Gaussian mixture model (GMM) by estimating parameters through variational Bayesian (VB) mechanism is conducted due to its superior performances. Since the VB-GMM methodology is computation-hungry, the GPGPU is employed to carry out massive matrix-computations. To efficiently migrate the conventional CPU-oriented schemes of VB-GMM onto GPGPU platforms, an entire migration-flow with thirteen stages is presented in detail. The CPU-GPGPU co-operation scheme, execution re-order, and memory access optimization are proposed for optimizing the GPGPU utilization and maximizing the clustering speed. Five types of real-world applications along with relevant data-sets are introduced for the cross-validation. From the experimental results, the feasibility of implementing VB-GMM algorithm by GPGPU is verified with practical benefits. The proposed GPGPU migration achieves 192x speedup in maximum. Furthermore, it succeeded in identifying the proper number of clusters, which is hardly conducted by the EM-algotihm.

  • Flexible Bayesian Inference by Weight Transfer for Robust Deep Neural Networks

    Thi Thu Thao KHONG  Takashi NAKADA  Yasuhiko NAKASHIMA  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2021/07/28
      Vol:
    E104-D No:11
      Page(s):
    1981-1991

    Adversarial attacks are viewed as a danger to Deep Neural Networks (DNNs), which reveal a weakness of deep learning models in security-critical applications. Recent findings have been presented adversarial training as an outstanding defense method against adversaries. Nonetheless, adversarial training is a challenge with respect to big datasets and large networks. It is believed that, unless making DNN architectures larger, DNNs would be hard to strengthen the robustness to adversarial examples. In order to avoid iteratively adversarial training, our algorithm is Bayes without Bayesian Learning (BwoBL) that performs the ensemble inference to improve the robustness. As an application of transfer learning, we use learned parameters of pretrained DNNs to build Bayesian Neural Networks (BNNs) and focus on Bayesian inference without costing Bayesian learning. In comparison with no adversarial training, our method is more robust than activation functions designed to enhance adversarial robustness. Moreover, BwoBL can easily integrate into any pretrained DNN, not only Convolutional Neural Networks (CNNs) but also other DNNs, such as Self-Attention Networks (SANs) that outperform convolutional counterparts. BwoBL is also convenient to apply to scaling networks, e.g., ResNet and EfficientNet, with better performance. Especially, our algorithm employs a variety of DNN architectures to construct BNNs against a diversity of adversarial attacks on a large-scale dataset. In particular, under l∞ norm PGD attack of pixel perturbation ε=4/255 with 100 iterations on ImageNet, our proposal in ResNets, SANs, and EfficientNets increase by 58.18% top-5 accuracy on average, which are combined with naturally pretrained ResNets, SANs, and EfficientNets. This enhancement is 62.26% on average below l2 norm C&W attack. The combination of our proposed method with pretrained EfficientNets on both natural and adversarial images (EfficientNet-ADV) drastically boosts the robustness resisting PGD and C&W attacks without additional training. Our EfficientNet-ADV-B7 achieves the cutting-edge top-5 accuracy, which is 92.14% and 94.20% on adversarial ImageNet generated by powerful PGD and C&W attacks, respectively.

  • RazorProtector: Maintaining Razor DVS Efficiency in Large IR-Drop Zones by an Adaptive Redundant Data-Path

    Yukihiro SASAGAWA  Jun YAO  Takashi NAKADA  Yasuhiko NAKASHIMA  

     
    PAPER-Logic Synthesis, Test and Verification

      Vol:
    E95-A No:12
      Page(s):
    2319-2329

    Recently, the DVS (Dynamic Voltage Scaling) method has been aggressively applied to processors with Razor Flip-Flops. With Razor FF detecting setup errors, the supply voltage in these processors is down-scaled to a near critical setup timing level for a maximum power consumption reduction. However, the conventional Razor and DVS combinations cannot tolerate well error rate variations caused by IR-drops and environment changes. At the near critical setup timing point, even a small error rate change will result in sharp performance degradation. In this paper, we propose RazorProtector, a DVS application method based on a redundant data-path which uses a multi-cycle redundant calculation to shorten the recovery penalty after a setup error occurrence. A dynamic redundancy-adapting scheme is also given to use effectively the designed redundant data-path based on a study of the program, device and error rate characteristics. Our results show that RazorProtector with the adaptive redundancy architecture can, compared to the traditional DVS method with Razor FF, under a large setup rate caused by a 10% unwanted voltage drop, reduce EDP up to 78% at 100 µs/V, 88% at 200 µs/V voltage scaling slope.

  • SLIT: An Energy-Efficient Reconfigurable Hardware Architecture for Deep Convolutional Neural Networks Open Access

    Thi Diem TRAN  Yasuhiko NAKASHIMA  

     
    PAPER

      Pubricized:
    2020/12/18
      Vol:
    E104-C No:7
      Page(s):
    319-329

    Convolutional neural networks (CNNs) have dominated a range of applications, from advanced manufacturing to autonomous cars. For energy cost-efficiency, developing low-power hardware for CNNs is a research trend. Due to the large input size, the first few convolutional layers generally consume most latency and hardware resources on hardware design. To address these challenges, this paper proposes an innovative architecture named SLIT to extract feature maps and reconstruct the first few layers on CNNs. In this reconstruction approach, total multiply-accumulate operations are eliminated on the first layers. We evaluate new topology with MNIST, CIFAR, SVHN, and ImageNet datasets on image classification application. Latency and hardware resources of the inference step are evaluated on the chip ZC7Z020-1CLG484C FPGA with Lenet-5 and VGG schemes. On the Lenet-5 scheme, our architecture reduces 39% of latency and 70% of hardware resources with a 0.456 W power consumption compared to previous works. Even though the VGG models perform with a 10% reduction in hardware resources and latency, we hope our overall results will potentially give a new impetus for future studies to reach a higher optimization on hardware design. Notably, the SLIT architecture efficiently merges with most popular CNNs at a slightly sacrificing accuracy of a factor of 0.27% on MNIST, ranging from 0.5% to 1.5% on CIFAR, approximately 2.2% on ImageNet, and remaining the same on SVHN databases.

  • Daisy-Chained Systolic Array and Reconfigurable Memory Space for Narrow Memory Bandwidth

    Jun IWAMOTO  Yuma KIKUTANI  Renyuan ZHANG  Yasuhiko NAKASHIMA  

     
    PAPER-Computer System

      Pubricized:
    2019/12/06
      Vol:
    E103-D No:3
      Page(s):
    578-589

    A paradigm shift toward edge computing infrastructures that prioritize small footprint and scalable/easy-to-estimate performance is increasing. In this paper, we propose the following to improve the footprint and the scalability of systolic arrays: (1) column multithreading for reducing the number of physical units and maintaining the performance even for back-to-back floating-point accumulations; (2) a cascaded peer-to-peer AXI bus for a scalable multichip structure and an intra-chip parallel local memory bus for low latency; (3) multilevel loop control in any unit for reducing the startup overhead and adaptive operation shifting for efficient reuse of local memories. We designed a systolic array with a single column × 64 row configuration with Verilog HDL, evaluated the frequency and the performance on an FPGA attached to a ZYNQ system as an AXI slave device, and evaluated the area with a TSMC 28nm library and memory generator and identified the following: (1) the execution speed of a matrix multiplication/a convolution operation/a light-field depth extraction, whose size larger than the capacity of the local memory, is 6.3× / 9.2× / 6.6× compared with a similar systolic array (EMAX); (2) the estimated speed with a 4-chip configuration is 19.6× / 16.0× / 8.5×; (3) the size of a single-chip is 8.4 mm2 (0.31× of EMAX) and the basic performance per area is 2.4×.

  • Understanding Variations for Better Adjusting Parallel Supplemental Redundant Executions to Tolerate Timing Faults

    Yukihiro SASAGAWA  Jun YAO  Yasuhiko NAKASHIMA  

     
    PAPER-Architecture

      Vol:
    E97-D No:12
      Page(s):
    3083-3091

    Razor Flip-Flop (FF) is a good combination for the dynamic voltage scaling (DVS) technique to achieve high energy efficiency. We previously proposed a RazorProtector scheme, which uses, under a very high IR-drop zone, a redundant data-path to provide a very fast recovery for a Razor-FF based processor. In this paper, we propose a dynamic method to adjust the redundancy level to fine-grained fit both the program behaviors and processor manufacturing variations so as to achieve an optimal power saving. We design an online turning method to adjust the redundancy level according to the most related parameters, ILP (Instruction Level Parallelism) and DCF (Delay Criticality Factor). Our simulation results show that under a workload suite with different behaviors, the adaptive redundancy can achieve better Energy Delay Product (EDP) reduction than any static controls. Compared to the traditional application of Razor-FF and DVS, our proposed dynamic control achieves an EDP reduction of 56% in average for the workloads we studied.

  • Quantum Walks on the Line with Phase Parameters

    Marcos VILLAGRA  Masaki NAKANISHI  Shigeru YAMASHITA  Yasuhiko NAKASHIMA  

     
    PAPER

      Vol:
    E95-D No:3
      Page(s):
    722-730

    In this paper, a study on discrete-time coined quantum walks on the line is presented. Clear mathematical foundations are still lacking for this quantum walk model. As a step toward this objective, the following question is being addressed: Given a graph, what is the probability that a quantum walk arrives at a given vertex after some number of steps? This is a very natural question, and for random walks it can be answered by several different combinatorial arguments. For quantum walks this is a highly non-trivial task. Furthermore, this was only achieved before for one specific coin operator (Hadamard operator) for walks on the line. Even considering only walks on lines, generalizing these computations to a general SU(2) coin operator is a complex task. The main contribution is a closed-form formula for the amplitudes of the state of the walk (which includes the question above) for a general symmetric SU(2) operator for walks on the line. To this end, a coin operator with parameters that alters the phase of the state of the walk is defined. Then, closed-form solutions are computed by means of Fourier analysis and asymptotic approximation methods. We also present some basic properties of the walk which can be deducted using weak convergence theorems for quantum walks. In particular, the support of the induced probability distribution of the walk is calculated. Then, it is shown how changing the parameters in the coin operator affects the resulting probability distribution.

  • A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing

    Hoang-Gia VU  Shinya TAKAMAEDA-YAMAZAKI  Takashi NAKADA  Yasuhiko NAKASHIMA  

     
    PAPER-Device and Architecture

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    288-302

    Modern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide “fine-grained” management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).

  • An Instruction Mapping Scheme for FU Array Accelerator

    Kazuhiro YOSHIMURA  Takuya IWAKAMI  Takashi NAKADA  Jun YAO  Hajime SHIMADA  Yasuhiko NAKASHIMA  

     
    PAPER-Computer System

      Vol:
    E94-D No:2
      Page(s):
    286-297

    Recently, we have proposed using a Linear Array Pipeline Processor (LAPP) to improve energy efficiency for various workloads such as image processing and to maintain programmability by working on VLIW codes. In this paper, we proposed an instruction mapping scheme for LAPP to fully exploit the array execution of functional units (FUs) and bypass networks by a mapper to fit the VLIW codes onto the FUs. The mapping can be finished within multi-cycles during a data prefetch before the array execution of FUs. According to an HDL based implementation, the hardware required for mapping scheme is 84% of the cost introduced by a baseline method. In addition, the proposed mapper can further help to shrink the size of array stage, as our results show that their combination becomes 88% of the baseline model in area.

  • A Feasibility Study of Multi-Domain Stochastic Computing Circuit Open Access

    Tati ERLINA  Renyuan ZHANG  Yasuhiko NAKASHIMA  

     
    PAPER-Integrated Electronics

      Pubricized:
    2020/10/29
      Vol:
    E104-C No:5
      Page(s):
    153-163

    An efficient approximate computing circuit is developed for polynomial functions through the hybrid of analog and stochastic domains. Different from the ordinary time-based stochastic computing (TBSC), the proposed circuit exploits not only the duty cycle of pulses but also the pulse strength of the analog current to carry information for multiplications. The accumulation of many multiplications is performed by merely collecting the stochastic-current. As the calculation depth increases, the growth of latency (while summations), signal power weakening, and disparity of output signals (while multiplications) are substantially avoidable in contrast to that in the conventional TBSC. Furthermore, the calculation range spreads to bipolar infinite without scaling, theoretically. The proposed multi-domain stochastic computing (MDSC) is designed and simulated in a 0.18 µm CMOS technology by employing a set of current mirrors and an improved scheme of the TBSC circuit based on the Neuron-MOS mechanism. For proof-of-concept, the multiply and accumulate calculations (MACs) are implemented, achieving an average accuracy of 95.3%. More importantly, the transistor counting, power consumption, and latency decrease to 6.1%, 55.4%, and 4.2% of the state-of-art TBSC circuit, respectively. The robustness against temperature and process variations is also investigated and presented in detail.

  • Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators

    Yoshikazu INAGAKI  Shinya TAKAMAEDA-YAMAZAKI  Jun YAO  Yasuhiko NAKASHIMA  

     
    PAPER-Architecture

      Pubricized:
    2015/09/15
      Vol:
    E98-D No:12
      Page(s):
    2141-2149

    The Energy-aware Multi-mode Accelerator eXtension [24],[25] (EMAX) is equipped with distributed single-port local memories and ring-formed interconnections. The accelerator is designed to achieve extremely high throughput for scientific computations, big data, and image processing as well as low-power consumption. However, before mapping algorithms on the accelerator, application developers require sufficient knowledge of the hardware organization and specially designed instructions. They also need significant effort to tune the code for improving execution efficiency when no well-designed compiler or library is available. To address this problem, we focus on library support for stencil (nearest-neighbor) computations that represent a class of algorithms commonly used in many partial differential equation (PDE) solvers. In this research, we address the following topics: (1) system configuration, features, and mnemonics of EMAX; (2) instruction mapping techniques that reduce the amount of data to be read from the main memory; (3) performance evaluation of the library for PDE solvers. With the features of a library that can reuse the local data across the outer loop iterations and map many instructions by unrolling the outer loops, the amount of data to be read from the main memory is significantly reduced to a minimum of 1/7 compared with a hand-tuned code. In addition, the stencil library reduced the execution time 23% more than a general-purpose processor.

1-20hit(25hit)