IEICE global.ieice.org Site

Author Search Result

[Author] Renyuan ZHANG(5hit)

1-5hit

GPGPU Implementation of Variational Bayesian Gaussian Mixture Models
Hiroki NISHIMOTO Renyuan ZHANG Yasuhiko NAKASHIMA

PAPER-Fundamentals of Information Systems

Pubricized:
2021/11/24
Vol:
E105-D No:3
Page(s):
611-622
The efficient implementation strategy for speeding up high-quality clustering algorithms is developed on the basis of general purpose graphic processing units (GPGPUs) in this work. Among various clustering algorithms, a sophisticated Gaussian mixture model (GMM) by estimating parameters through variational Bayesian (VB) mechanism is conducted due to its superior performances. Since the VB-GMM methodology is computation-hungry, the GPGPU is employed to carry out massive matrix-computations. To efficiently migrate the conventional CPU-oriented schemes of VB-GMM onto GPGPU platforms, an entire migration-flow with thirteen stages is presented in detail. The CPU-GPGPU co-operation scheme, execution re-order, and memory access optimization are proposed for optimizing the GPGPU utilization and maximizing the clustering speed. Five types of real-world applications along with relevant data-sets are introduced for the cross-validation. From the experimental results, the feasibility of implementing VB-GMM algorithm by GPGPU is verified with practical benefits. The proposed GPGPU migration achieves 192x speedup in maximum. Furthermore, it succeeded in identifying the proper number of clusters, which is hardly conducted by the EM-algotihm.
Daisy-Chained Systolic Array and Reconfigurable Memory Space for Narrow Memory Bandwidth
Jun IWAMOTO Yuma KIKUTANI Renyuan ZHANG Yasuhiko NAKASHIMA

PAPER-Computer System

Pubricized:
2019/12/06
Vol:
E103-D No:3
Page(s):
578-589
A paradigm shift toward edge computing infrastructures that prioritize small footprint and scalable/easy-to-estimate performance is increasing. In this paper, we propose the following to improve the footprint and the scalability of systolic arrays: (1) column multithreading for reducing the number of physical units and maintaining the performance even for back-to-back floating-point accumulations; (2) a cascaded peer-to-peer AXI bus for a scalable multichip structure and an intra-chip parallel local memory bus for low latency; (3) multilevel loop control in any unit for reducing the startup overhead and adaptive operation shifting for efficient reuse of local memories. We designed a systolic array with a single column × 64 row configuration with Verilog HDL, evaluated the frequency and the performance on an FPGA attached to a ZYNQ system as an AXI slave device, and evaluated the area with a TSMC 28nm library and memory generator and identified the following: (1) the execution speed of a matrix multiplication/a convolution operation/a light-field depth extraction, whose size larger than the capacity of the local memory, is 6.3× / 9.2× / 6.6× compared with a similar systolic array (EMAX); (2) the estimated speed with a 4-chip configuration is 19.6× / 16.0× / 8.5×; (3) the size of a single-chip is 8.4 mm2 (0.31× of EMAX) and the basic performance per area is 2.4×.
A Feasibility Study of Multi-Domain Stochastic Computing Circuit Open Access
Tati ERLINA Renyuan ZHANG Yasuhiko NAKASHIMA

PAPER-Integrated Electronics

Pubricized:
2020/10/29
Vol:
E104-C No:5
Page(s):
153-163
An efficient approximate computing circuit is developed for polynomial functions through the hybrid of analog and stochastic domains. Different from the ordinary time-based stochastic computing (TBSC), the proposed circuit exploits not only the duty cycle of pulses but also the pulse strength of the analog current to carry information for multiplications. The accumulation of many multiplications is performed by merely collecting the stochastic-current. As the calculation depth increases, the growth of latency (while summations), signal power weakening, and disparity of output signals (while multiplications) are substantially avoidable in contrast to that in the conventional TBSC. Furthermore, the calculation range spreads to bipolar infinite without scaling, theoretically. The proposed multi-domain stochastic computing (MDSC) is designed and simulated in a 0.18 µm CMOS technology by employing a set of current mirrors and an improved scheme of the TBSC circuit based on the Neuron-MOS mechanism. For proof-of-concept, the multiply and accumulate calculations (MACs) are implemented, achieving an average accuracy of 95.3%. More importantly, the transistor counting, power consumption, and latency decrease to 6.1%, 55.4%, and 4.2% of the state-of-art TBSC circuit, respectively. The robustness against temperature and process variations is also investigated and presented in detail.
A ReRAM-Based Row-Column-Oriented Memory Architecture for Convolutional Neural Networks
Yan CHEN Jing ZHANG Yuebing XU Yingjie ZHANG Renyuan ZHANG Yasuhiko NAKASHIMA

BRIEF PAPER

Vol:
E102-C No:7
Page(s):
580-584
An efficient resistive random access memory (ReRAM) structure is developed for accelerating convolutional neural network (CNN) powered by the in-memory computation. A novel ReRAM cell circuit is designed with two-directional (2-D) accessibility. The entire memory system is organized as a 2-D array, in which specific memory cells can be identically accessed by both of column- and row-locality. For the in-memory computations of CNNs, only relevant cells in an identical sub-array are accessed by 2-D read-out operations, which is hardly implemented by conventional ReRAM cells. In this manner, the redundant access (column or row) of the conventional ReRAM structures is prevented to eliminated the unnecessary data movement when CNNs are processed in-memory. From the simulation results, the energy and bandwidth efficiency of the proposed memory structure are 1.4x and 5x of a state-of-the-art ReRAM architecture, respectively.
Programmable Analog Calculation Unit with Two-Stage Architecture: A Solution of Efficient Vector-Computation Open Access
Renyuan ZHANG Takashi NAKADA Yasuhiko NAKASHIMA

PAPER

Vol:
E102-A No:7
Page(s):
878-885
A programmable analog calculation unit (ACU) is designed for vector computations in continuous-time with compact circuit scale. From our early study, it is feasible to retrieve arbitrary two-variable functions through support vector regression (SVR) in silicon. In this work, the dimensions of regression are expanded for vector computations. However, the hardware cost and computing error greatly increase along with the expansion of dimensions. A two-stage architecture is proposed to organize multiple ACUs for high dimensional regression. The computation of high dimensional vectors is separated into several computations of lower dimensional vectors, which are implemented by the free combination of several ACUs with lower cost. In this manner, the circuit scale and regression error are reduced. The proof-of-concept ACU is designed and simulated in a 0.18μm technology. From the circuit simulation results, all the demonstrated calculations with nine operands are executed without iterative clock cycles by 4960 transistors. The calculation error of example functions is below 8.7%.

Author Search Result

[Author] Renyuan ZHANG(5hit)

GPGPU Implementation of Variational Bayesian Gaussian Mixture Models

Daisy-Chained Systolic Array and Reconfigurable Memory Space for Narrow Memory Bandwidth

A Feasibility Study of Multi-Domain Stochastic Computing Circuit Open Access

A ReRAM-Based Row-Column-Oriented Memory Architecture for Convolutional Neural Networks

Programmable Analog Calculation Unit with Two-Stage Architecture: A Solution of Efficient Vector-Computation Open Access

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles