IEICE global.ieice.org Site

Author Search Result

[Author] Masato MOTOMURA(9hit)

1-9hit

Through Chip Interface Based Three-Dimensional FPGA Architecture Exploration
Li-Chung HSU Masato MOTOMURA Yasuhiro TAKE Tadahiro KURODA

PAPER

Vol:
E98-C No:4
Page(s):
288-297
This paper presents work on integrating wireless 3-D interconnection interface, namely ThruChip Interface (TCI), in three-dimensional field-programmable gate array (3-D FPGA) exploration tool (TPR). TCI is an emerging 3-D IC integration solution because of its advantages over cost, flexibility, reliability, comparable performance, and energy dissipation in comparison to through-silicon-via (TSV). Since the communication bandwidth of TCI is much higher than FPGA internal logic signals, in order to fully utilize its bandwidth, the time-division multiplexing (TDM) scheme is adopted. The experimental results show 25% on average and 58% at maximum path delay reduction over 2-D FPGA when five layers are used in TCI based 3-D FPGA architecture. Although the performance of TCI based 3-D FPGA architecture is 8% below that of TSV based 3-D FPGA on average, TCI based architecture can reduce active area consumed by vertical communication channels by 42% on average in comparison to TSV based architecture and hence leads to better delay and area product.
A Hybrid Integer Encoding Method for Obtaining High-Quality Solutions of Quadratic Knapsack Problems on Solid-State Annealers
Satoru JIMBO Daiki OKONOGI Kota ANDO Thiem Van CHU Jaehoon YU Masato MOTOMURA Kazushi KAWAMURA

PAPER

Pubricized:
2022/05/26
Vol:
E105-D No:12
Page(s):
2019-2031
For formulating Quadratic Knapsack Problems (QKPs) into the form of Quadratic Unconstrained Binary Optimization (QUBO), it is necessary to introduce an integer variable, which converts and incorporates the knapsack capacity constraint into the overall energy function. In QUBO, this integer variable is encoded with auxiliary binary variables, and the encoding method used for it affects the behavior of Simulated Annealing (SA) significantly. For improving the efficiency of SA for QKP instances, this paper first visualized and analyzed their annealing processes encoded by conventional binary and unary encoding methods. Based on this analysis, we proposed a novel hybrid encoding (HE), getting the best of both worlds. The proposed HE obtained feasible solutions in the evaluation, outperforming the others in small- and medium-scale models.
Dither NN: Hardware/Algorithm Co-Design for Accurate Quantized Neural Networks
Kota ANDO Kodai UEYOSHI Yuka OBA Kazutoshi HIROSE Ryota UEMATSU Takumi KUDO Masayuki IKEBE Tetsuya ASAI Shinya TAKAMAEDA-YAMAZAKI Masato MOTOMURA

PAPER-Computer System

Pubricized:
2019/07/22
Vol:
E102-D No:12
Page(s):
2341-2353
Deep neural network (NN) has been widely accepted for enabling various AI applications, however, the limitation of computational and memory resources is a major problem on mobile devices. Quantized NN with a reduced bit precision is an effective solution, which relaxes the resource requirements, but the accuracy degradation due to its numerical approximation is another problem. We propose a novel quantized NN model employing the “dithering” technique to improve the accuracy with the minimal additional hardware requirement at the view point of the hardware-algorithm co-designing. Dithering distributes the quantization error occurring at each pixel (neuron) spatially so that the total information loss of the plane would be minimized. The experiment we conducted using the software-based accuracy evaluation and FPGA-based hardware resource estimation proved the effectiveness and efficiency of the concept of an NN model with dithering.
A Fully-Parallel Annealing Algorithm with Autonomous Pinning Effect Control for Various Combinatorial Optimization Problems
Daiki OKONOGI Satoru JIMBO Kota ANDO Thiem Van CHU Jaehoon YU Masato MOTOMURA Kazushi KAWAMURA

PAPER

Pubricized:
2023/09/19
Vol:
E106-D No:12
Page(s):
1969-1978
Annealing computation has recently attracted attention as it can efficiently solve combinatorial optimization problems using an Ising spin-glass model. Stochastic cellular automata annealing (SCA) is a promising algorithm that can realize fast spin-update by utilizing its parallel computing capability. However, in SCA, pinning effect control to suppress the spin-flip probability is essential, making escaping from local minima more difficult than serial spin-update algorithms, depending on the problem. This paper proposes a novel approach called APC-SCA (Autonomous Pinning effect Control SCA), where the pinning effect can be controlled autonomously by focusing on individual spin-flip. The evaluation results using max-cut, N-queen, and traveling salesman problems demonstrate that APC-SCA can obtain better solutions than the original SCA that uses pinning effect control pre-optimized by a grid search. Especially in solving traveling salesman problems, we confirm that the tour distance obtained by APC-SCA is up to 56.3% closer to the best-known compared to the conventional approach.
FOREWORD Open Access
Masato MOTOMURA

FOREWORD

Vol:
E102-D No:5
Page(s):
1002-1002
FPGA-Based Annealing Processor with Time-Division Multiplexing
Kasho YAMAMOTO Masayuki IKEBE Tetsuya ASAI Masato MOTOMURA Shinya TAKAMAEDA-YAMAZAKI

PAPER-Computer System

Pubricized:
2019/09/20
Vol:
E102-D No:12
Page(s):
2295-2305
An annealing processor based on the Ising model is a remarkable candidate for combinatorial optimization problems and it is superior to general von Neumann computers. CMOS-based implementations of the annealing processor are efficient and feasible based on current semiconductor technology. However, critical problems with annealing processors remain. There are few simulated spins and inflexibility in terms of implementable graph topology due to hardware constraints. A prior approach to overcoming these problems is to emulate a complicated graph on a simple and high-density spin array with so-called minor embedding, a spin duplication method based on graph theory. When a complicated graph is embedded on such hardware, numerous spins are consumed to represent high-degree spins by combining multiple low-degree spins. In addition to the number of spins, the quality of solutions decreases as a result of dummy strong connections between the duplicated spins. Thus, the approach cannot handle large-scale practical problems. This paper proposes a flexible and scalable hardware architecture with time-division multiplexing for massive spins and high-degree topologies. A target graph is separated and mapped onto multiple virtual planes, and each plane is subject to interleaved simulation with time-division processing. Therefore, the behavior of high-degree spins is efficiently emulated over time, so that no dummy strong connections are required, and the solution quality is accordingly improved. We implemented a prototype hardware design for FPGAs, and we evaluated the proposed method in a software-based annealing processor simulator. The results indicate that the method increased the spins that can be deployed. In addition, our time-division multiplexing architecture improved the solution quality and convergence time with reasonable resource consumption.
Cache-Processor Coupling: A Fast and Wide On-Chip Data Cache Design
Masato MOTOMURA Toshiaki INOUE Hachiro YAMADA Akihiko KONAGAYA

PAPER

Vol:
E78-C No:6
Page(s):
623-630
This paper presents a new data cache design, cache-processor coupling, which tightly binds an on-chip data cache with a microprocessor. Parallel architectures and high-speed circuit techniques are developed for speeding address handling process associated with accessing the data cache. The address handling time has been reduced by 51% by these architectures and circuit techniques. On the other hand, newly proposed instructions increase data cache bandwidth by eight times. Excessive power consumption due to the wide-bandwidth data transfer is carefully avoided by newly developed circuit techniques, which reduce dissipation power per bit to 1/26. Simulation study of the proposed architecture and circuit techniques yields a 1.8 ns delay each for address handling, cache access, and register access for a 16 kilobyte direct mapped cache with a 0.4 µm CMOS design rule.
Design of 1024-I/Os 3. 84 GB/s High Bandwidth 600 mW Low Power 16 Mb DRAM Macros for Parallel Image Processing RAM
Yoshiharu AIMOTO Tohru KIMURA Yoshikazu YABE Hideki HEIUCHI Youetsu NAKAZAWA Masato MOTOMURA Takuya KOGA Yoshihiro FUJITA Masayuki HAMADA Takaho TANIGAWA Hajime NOBUSAWA Kuniaki KOYAMA

PAPER

Vol:
E81-C No:5
Page(s):
759-767
We have developed a parallel image processing RAM (PIP-RAM) which integrates a 16-Mb DRAM and 128 processor elements (PEs) by means of 0. 38-µm CMOS 64-Mb DRAM process technology. It achieves 7. 68-GIPS processing performance and 3. 84-GB/s memory bandwidth with only 1-W power dissipation (@ 30-MHz), and the key to this performance is the DRAM design. This paper presents the key circuit techniques employed in the DRAM design: 1) a paged-segmentation accessing scheme that reduces sense amplifier power dissipation, and 2) a clocked low-voltage-swing differential-charge-transfer scheme that reduces data line power dissipation with the help of a multi-phase synchronization DRAM control scheme. These techniques have general importance for the design of LSIs in which DRAMs and logic are tightly integrated on single chips.
A Hierarchical Cost Estimation Technique for High Level Synthesis
Mahmoud MERIBOUT Masato MOTOMURA

PAPER-VLSI Design Technology and CAD

Vol:
E86-A No:2
Page(s):
444-461
The aim of this paper is to present a new cost estimation technique to synthesis hardware from high level circuit description. The scheduling and allocation processes are performed in alternative manner, while using realistic cost measurements models that account for Functional Unit (FU), registers, and multiplexers. This is an improvement over previous works, were most of them use very simple cost models that primarily focus on FU resources alone. These latest, however, are not accurate enough to allow effective design space exploration since the effects of storage and interconnect resources can indeed dominates the cost function. We tested our technique on several high-level synthesis benchmarks. The results indicate that the tool can generate near-optimal bus-based and multiplexer-based architectural models with lower number of registers and buses, while presenting high throughput.

Author Search Result

[Author] Masato MOTOMURA(9hit)

Through Chip Interface Based Three-Dimensional FPGA Architecture Exploration

A Hybrid Integer Encoding Method for Obtaining High-Quality Solutions of Quadratic Knapsack Problems on Solid-State Annealers

Dither NN: Hardware/Algorithm Co-Design for Accurate Quantized Neural Networks

A Fully-Parallel Annealing Algorithm with Autonomous Pinning Effect Control for Various Combinatorial Optimization Problems

FOREWORD Open Access

FPGA-Based Annealing Processor with Time-Division Multiplexing

Cache-Processor Coupling: A Fast and Wide On-Chip Data Cache Design

Design of 1024-I/Os 3. 84 GB/s High Bandwidth 600 mW Low Power 16 Mb DRAM Macros for Parallel Image Processing RAM

A Hierarchical Cost Estimation Technique for High Level Synthesis

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles