The search functionality is under construction.

Author Search Result

[Author] Masato MOTOMURA(9hit)

1-9hit
  • Through Chip Interface Based Three-Dimensional FPGA Architecture Exploration

    Li-Chung HSU  Masato MOTOMURA  Yasuhiro TAKE  Tadahiro KURODA  

     
    PAPER

      Vol:
    E98-C No:4
      Page(s):
    288-297

    This paper presents work on integrating wireless 3-D interconnection interface, namely ThruChip Interface (TCI), in three-dimensional field-programmable gate array (3-D FPGA) exploration tool (TPR). TCI is an emerging 3-D IC integration solution because of its advantages over cost, flexibility, reliability, comparable performance, and energy dissipation in comparison to through-silicon-via (TSV). Since the communication bandwidth of TCI is much higher than FPGA internal logic signals, in order to fully utilize its bandwidth, the time-division multiplexing (TDM) scheme is adopted. The experimental results show 25% on average and 58% at maximum path delay reduction over 2-D FPGA when five layers are used in TCI based 3-D FPGA architecture. Although the performance of TCI based 3-D FPGA architecture is 8% below that of TSV based 3-D FPGA on average, TCI based architecture can reduce active area consumed by vertical communication channels by 42% on average in comparison to TSV based architecture and hence leads to better delay and area product.

  • A Hybrid Integer Encoding Method for Obtaining High-Quality Solutions of Quadratic Knapsack Problems on Solid-State Annealers

    Satoru JIMBO  Daiki OKONOGI  Kota ANDO  Thiem Van CHU  Jaehoon YU  Masato MOTOMURA  Kazushi KAWAMURA  

     
    PAPER

      Pubricized:
    2022/05/26
      Vol:
    E105-D No:12
      Page(s):
    2019-2031

    For formulating Quadratic Knapsack Problems (QKPs) into the form of Quadratic Unconstrained Binary Optimization (QUBO), it is necessary to introduce an integer variable, which converts and incorporates the knapsack capacity constraint into the overall energy function. In QUBO, this integer variable is encoded with auxiliary binary variables, and the encoding method used for it affects the behavior of Simulated Annealing (SA) significantly. For improving the efficiency of SA for QKP instances, this paper first visualized and analyzed their annealing processes encoded by conventional binary and unary encoding methods. Based on this analysis, we proposed a novel hybrid encoding (HE), getting the best of both worlds. The proposed HE obtained feasible solutions in the evaluation, outperforming the others in small- and medium-scale models.

  • Dither NN: Hardware/Algorithm Co-Design for Accurate Quantized Neural Networks

    Kota ANDO  Kodai UEYOSHI  Yuka OBA  Kazutoshi HIROSE  Ryota UEMATSU  Takumi KUDO  Masayuki IKEBE  Tetsuya ASAI  Shinya TAKAMAEDA-YAMAZAKI  Masato MOTOMURA  

     
    PAPER-Computer System

      Pubricized:
    2019/07/22
      Vol:
    E102-D No:12
      Page(s):
    2341-2353

    Deep neural network (NN) has been widely accepted for enabling various AI applications, however, the limitation of computational and memory resources is a major problem on mobile devices. Quantized NN with a reduced bit precision is an effective solution, which relaxes the resource requirements, but the accuracy degradation due to its numerical approximation is another problem. We propose a novel quantized NN model employing the “dithering” technique to improve the accuracy with the minimal additional hardware requirement at the view point of the hardware-algorithm co-designing. Dithering distributes the quantization error occurring at each pixel (neuron) spatially so that the total information loss of the plane would be minimized. The experiment we conducted using the software-based accuracy evaluation and FPGA-based hardware resource estimation proved the effectiveness and efficiency of the concept of an NN model with dithering.

  • A Fully-Parallel Annealing Algorithm with Autonomous Pinning Effect Control for Various Combinatorial Optimization Problems

    Daiki OKONOGI  Satoru JIMBO  Kota ANDO  Thiem Van CHU  Jaehoon YU  Masato MOTOMURA  Kazushi KAWAMURA  

     
    PAPER

      Pubricized:
    2023/09/19
      Vol:
    E106-D No:12
      Page(s):
    1969-1978

    Annealing computation has recently attracted attention as it can efficiently solve combinatorial optimization problems using an Ising spin-glass model. Stochastic cellular automata annealing (SCA) is a promising algorithm that can realize fast spin-update by utilizing its parallel computing capability. However, in SCA, pinning effect control to suppress the spin-flip probability is essential, making escaping from local minima more difficult than serial spin-update algorithms, depending on the problem. This paper proposes a novel approach called APC-SCA (Autonomous Pinning effect Control SCA), where the pinning effect can be controlled autonomously by focusing on individual spin-flip. The evaluation results using max-cut, N-queen, and traveling salesman problems demonstrate that APC-SCA can obtain better solutions than the original SCA that uses pinning effect control pre-optimized by a grid search. Especially in solving traveling salesman problems, we confirm that the tour distance obtained by APC-SCA is up to 56.3% closer to the best-known compared to the conventional approach.

  • FOREWORD Open Access

    Masato MOTOMURA  

     
    FOREWORD

      Vol:
    E102-D No:5
      Page(s):
    1002-1002
  • FPGA-Based Annealing Processor with Time-Division Multiplexing

    Kasho YAMAMOTO  Masayuki IKEBE  Tetsuya ASAI  Masato MOTOMURA  Shinya TAKAMAEDA-YAMAZAKI  

     
    PAPER-Computer System

      Pubricized:
    2019/09/20
      Vol:
    E102-D No:12
      Page(s):
    2295-2305

    An annealing processor based on the Ising model is a remarkable candidate for combinatorial optimization problems and it is superior to general von Neumann computers. CMOS-based implementations of the annealing processor are efficient and feasible based on current semiconductor technology. However, critical problems with annealing processors remain. There are few simulated spins and inflexibility in terms of implementable graph topology due to hardware constraints. A prior approach to overcoming these problems is to emulate a complicated graph on a simple and high-density spin array with so-called minor embedding, a spin duplication method based on graph theory. When a complicated graph is embedded on such hardware, numerous spins are consumed to represent high-degree spins by combining multiple low-degree spins. In addition to the number of spins, the quality of solutions decreases as a result of dummy strong connections between the duplicated spins. Thus, the approach cannot handle large-scale practical problems. This paper proposes a flexible and scalable hardware architecture with time-division multiplexing for massive spins and high-degree topologies. A target graph is separated and mapped onto multiple virtual planes, and each plane is subject to interleaved simulation with time-division processing. Therefore, the behavior of high-degree spins is efficiently emulated over time, so that no dummy strong connections are required, and the solution quality is accordingly improved. We implemented a prototype hardware design for FPGAs, and we evaluated the proposed method in a software-based annealing processor simulator. The results indicate that the method increased the spins that can be deployed. In addition, our time-division multiplexing architecture improved the solution quality and convergence time with reasonable resource consumption.

  • Cache-Processor Coupling: A Fast and Wide On-Chip Data Cache Design

    Masato MOTOMURA  Toshiaki INOUE  Hachiro YAMADA  Akihiko KONAGAYA  

     
    PAPER

      Vol:
    E78-C No:6
      Page(s):
    623-630

    This paper presents a new data cache design, cache-processor coupling, which tightly binds an on-chip data cache with a microprocessor. Parallel architectures and high-speed circuit techniques are developed for speeding address handling process associated with accessing the data cache. The address handling time has been reduced by 51% by these architectures and circuit techniques. On the other hand, newly proposed instructions increase data cache bandwidth by eight times. Excessive power consumption due to the wide-bandwidth data transfer is carefully avoided by newly developed circuit techniques, which reduce dissipation power per bit to 1/26. Simulation study of the proposed architecture and circuit techniques yields a 1.8 ns delay each for address handling, cache access, and register access for a 16 kilobyte direct mapped cache with a 0.4 µm CMOS design rule.

  • Design of 1024-I/Os 3. 84 GB/s High Bandwidth 600 mW Low Power 16 Mb DRAM Macros for Parallel Image Processing RAM

    Yoshiharu AIMOTO  Tohru KIMURA  Yoshikazu YABE  Hideki HEIUCHI  Youetsu NAKAZAWA  Masato MOTOMURA  Takuya KOGA  Yoshihiro FUJITA  Masayuki HAMADA  Takaho TANIGAWA  Hajime NOBUSAWA  Kuniaki KOYAMA  

     
    PAPER

      Vol:
    E81-C No:5
      Page(s):
    759-767

    We have developed a parallel image processing RAM (PIP-RAM) which integrates a 16-Mb DRAM and 128 processor elements (PEs) by means of 0. 38-µm CMOS 64-Mb DRAM process technology. It achieves 7. 68-GIPS processing performance and 3. 84-GB/s memory bandwidth with only 1-W power dissipation (@ 30-MHz), and the key to this performance is the DRAM design. This paper presents the key circuit techniques employed in the DRAM design: 1) a paged-segmentation accessing scheme that reduces sense amplifier power dissipation, and 2) a clocked low-voltage-swing differential-charge-transfer scheme that reduces data line power dissipation with the help of a multi-phase synchronization DRAM control scheme. These techniques have general importance for the design of LSIs in which DRAMs and logic are tightly integrated on single chips.

  • A Hierarchical Cost Estimation Technique for High Level Synthesis

    Mahmoud MERIBOUT  Masato MOTOMURA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E86-A No:2
      Page(s):
    444-461

    The aim of this paper is to present a new cost estimation technique to synthesis hardware from high level circuit description. The scheduling and allocation processes are performed in alternative manner, while using realistic cost measurements models that account for Functional Unit (FU), registers, and multiplexers. This is an improvement over previous works, were most of them use very simple cost models that primarily focus on FU resources alone. These latest, however, are not accurate enough to allow effective design space exploration since the effects of storage and interconnect resources can indeed dominates the cost function. We tested our technique on several high-level synthesis benchmarks. The results indicate that the tool can generate near-optimal bus-based and multiplexer-based architectural models with lower number of registers and buses, while presenting high throughput.