1-14hit |
Yukihiro NAKAMURA Kiyoshi OGURI Akira NAGOYA Mitsuteru YUKISHITA Ryo NOMURA
This paper describes the hierarchical behavioral description language celled SFL and its processing system. This integrated CAD system called PARTHENON is used for designs of the leading ASICs in the NTT Systems Labs. This paper shows, therefore, the effectiveness of PARTHENON as a practical high-lelel synthesis system through real design experience. SFL was developed to aid in the design of the hardware functions and behaviors of ASICs composed solely of clocksynchronized circuits. The main features of SFL are as follows: (1) It is not mixed with connection description, but employs only behavioral description (like procedual description in program language), and it provides hierarchical expression of behavioral description. (2) It permits the description of parallel processing operations by adopting a new hardware task concept. And, (3) it is linked with the behavioral simulator, logic synthesizer, and other components of the processing system. After describing SFL in some detail, a brief explanation of its synthesizer and other processing components is provided, along with its application results in the real design of some leading ASICs at the NTT Systems Laboratories.
Jun'ichiro TAKEMOTO Toshihiro GOTO Yuichiro SHIBATA Kiyoshi OGURI
In this paper, the efficient structure of an LUT (look-up table) for an asynchronous reconfigurable PCA (Plastic Cell Architecture) device is investigated. A total of 15 types of implementation alternatives for LUTs are evaluated and compared in an empirical manner in which full custom layout design is developed and simulated. The evaluation results show that by introducing transmission gates in memory cells in an LUT, read time can be improved by 14.3% at the cost of 13.6% area increase compared to a conventional speed oriented implementation. It is also shown that use of transmission gates reduces 6.4% of area and 19.2% of read time against a conventional area oriented LUT implementation.
Hideyuki ITO Kouichi NAGAMI Tsunemichi SHIOZAWA Kiyoshi OGURI Yukihiro NAKAMURA
We are working on an algorithm to optimize the logic circuits that can be realized on the super fine-grain parallel processing architecture. As a part of this work, we have developed an inverter reduction algorithm. This algorithm is based on modeling logic circuits as dynamical systems. We implement the algorithm in the PARTHENON system, which is the high level synthesis system developed in NTT's laboratories, and evaluate it using ISCAS85 benchmarks. We also compare the results with both the existing algorithm of PARTHENON and the algorithm of Jain and Bryant.
Theint Theint THU Jimpei HAMAMURA Rie SOEJIMA Yuichiro SHIBATA Kiyoshi OGURI
Field Programmable Gate Array (FPGA) based robust model fitting enjoys immense popularity in image processing because of its high efficiency. This paper focuses on the tradeoff analysis of real-time FPGA implementation of robust circle and ellipse estimations based on the random sample consensus (RANSAC) algorithm, which estimates parameters of a statistical model from a data set of feature points which contains outliers. In particular, this paper mainly highlights implementation alternatives for solvers of simultaneous equations and compares Gauss-Jordan elimination and Cramer's rule by changing matrix size and arithmetic processes. Experimental evaluation shows a Cramer's rule approach coupled with long integer arithmetic can reduce most hardware resources without unacceptable degradation of estimation accuracy compared to floating point versions.
Norbert IMLIG Tsunemichi SHIOZAWA Ryusuke KONISHI Kiyoshi OGURI Kouichi NAGAMI Hideyuki ITO Minoru INAMORI Hiroshi NAKADA
This paper introduces a flexible, stream-oriented dataflow processing model based on the "Communicating Logic (CL)" framework. As the target architecture, we adopt the dual layered "Plastic Cell Architecture (PCA). " Datapath processing functionality is encapsulated in asynchronous hardware objects with variable graining and implemented using look-up tables. Communication (i.e. connectivity and control) between the distributed processing objects is achieved by means of inter-object message passing. The key point of the CL approach is that it offers the merits of scalable performance, low power hardware implementation with the user friendly compilation and linking capabilities unique to software.
Mitsuteru YUKISHITA Kiyoshi OGURI Tsukasa KAWAOKA
We developed a new test-synthesis that operates method based on data transfer analysis at the language level. Using this method, an efficient scan path is inserted to generate test data for the sequential circuit by using only a test generation tool for the combinatorial circuit. We have applied this method successfully to the behavior, logic, and test design of a 32-bit, RISC-type processor. The size of the synthesized circuit without test synthesis is 23,407 gates; the size with test synthesis is 24,811 gates. This is an increase of only a little over 6%.
Kouichi NAGAMI Kiyoshi OGURI Tsunemichi SHIOZAWA Hideyuki ITO Ryusuke KONISHI
We propose an architectural reference of programmable devices that we call Plastic Cell Architecture (PCA). PCA is a reference for implementing a device with autonomous reconfigurability, which we also introduce in this paper. This reconfigurability is a further step toward new reconfigurable computing, which introduces variable- and programmable-grained parallelism to wired logic computing. This computing follows the Object-Oriented paradigm: it regards configured circuits as objects. These objects will be described in a new hardware description language dealing with the semantics of dynamic module instantiation. PCA is the fusion of SRAM-based FPGAs and cellular automata (CA), where the CA are dedicated to support run time activities of objects. This paper mainly focus on autonomous reconfigurability and PCA. The following discussions examine a research direction towards general-purpose reconfigurable computing.
Hideyuki ITO Ryusuke KONISHI Hiroshi NAKADA Kiyoshi OGURI Minoru INAMORI Akira NAGOYA
This paper describes the realization of a dynamically reconfigurable logic LSI based on a novel parallel computer architecture. The key point of the architecture is its dual-structured cell array which enables dynamic and autonomous reconfiguration of the logic circuits. The LSI was completed by successfully introducing two specific features: fully asynchronous logic circuits and a homogeneous structure, only LUTs are used.
Hiroshi SEKIGAWA Kiyoshi OGURI Ryo NOMURA Yukihiro NAKAMURA
In recent VLSI design of digital data paths, significantly more area is occupied by interconnect elements than by functional units and registers. Nevertheless, until recently most work in data path synthesis has been concentrated on trying to reduce the area of functional units and registers, without paying much attention to the interconnect area. Lately, research that addresses reducing the area of interconnection and of functional units and registers is increasing, but in them, most algorithms for assigning interconnect elements are not efficient enough to optimize the interconnect area. In most current research, algorithms for interconnect element assignment are used to calculate the cost functions during the scheduling and/or allocation steps. This makes it impossible to use efficient optimization algorithms that may consume long time. This paper presents some new algorithms used to assign interconnect elements in data paths. The algorithms minimize the number of multiplexer inputs after the scheduling and operator/register allocations have been made. The algorithms have two characteristics. First, we use a branch and bound method for small problems. We confirmed that exact solutions in practical time can be obtained with this method for rather large problems, when the solutions are restricted to a one-level multiplexer model. Second, we use a certain heuristic method for larger problems. The algorithms have been implemented in C on an Apollo Domain Series 10000.
Keisuke DOHI Yuichiro SHIBATA Kiyoshi OGURI Takafumi FUJIMOTO
In this paper, we propose and discuss efficient GPU implementation techniques of absorbing boundary conditions (ABCs) for a 3D finite-difference time-domain (FDTD) electromagnetic field simulation for antenna design. In view of architectural nature of GPUs, the idea of a periodic boundary condition is introduced to implementation of perfect matched layers (PMLs) as well as a transformation technique of PML equations for partial boundaries. We also present efficient implementation method of a non-uniform grid. The evaluation results with a typical simulation model reveal that our proposed technique almost double the simulation performance and eventually achieve the 55.8% of the peak memory bandwidth of a target GPU.
Keisuke DOHI Kazuhiro NEGI Yuichiro SHIBATA Kiyoshi OGURI
We implement external memory-free deep pipelined FPGA implementation including HOG feature extraction and AdaBoost classification. To construct our design by compact FPGA, we introduce some simplifications of the algorithm and aggressive use of stream oriented architectures. We present comparison results between our simplified fixed-point scheme and an original floating-point scheme in terms of quality of results, and the results suggest the negative impact of the simplified scheme for hardware implementation is limited. We empirically show that, our system is able to detect human from 640480 VGA images at up to 112 FPS on a Xilinx Virtex-5 XC5VLX50 FPGA.
Keisuke DOHI Koji OKINA Rie SOEJIMA Yuichiro SHIBATA Kiyoshi OGURI
In this paper, we discuss performance modeling of 3-D stencil computing on an FPGA accelerator with a high-level synthesis environment, aiming for efficient exploration of user-space design parameters. First, we analyze resource utilization and performance to formulate these relationships as mathematical models. Then, in order to evaluate our proposed models, we implement heat conduction simulations as a benchmark application, by using MaxCompiler, which is a high-level synthesis tool for FPGAs, and MaxGenFD, which is a domain specific framework of the MaxCompiler for finite-difference equation solvers. The experimental results with various settings of architectural design parameters show the best combination of design parameters for pipeline structure can be systematically found by using our models. The effects of changing arithmetic accuracy and using data stream compression are also discussed.
Taito MANABE Yuichiro SHIBATA Kiyoshi OGURI
The super-resolution technology is one of the solutions to fill the gap between high-resolution displays and lower-resolution images. There are various algorithms to interpolate the lost information, one of which is using a convolutional neural network (CNN). This paper shows an FPGA implementation and a performance evaluation of a novel CNN-based super-resolution system, which can process moving images in real time. We apply horizontal and vertical flips to input images instead of enlargement. This flip method prevents information loss and enables the network to make the best use of its patch size. In addition, we adopted the residue number system (RNS) in the network to reduce FPGA resource utilization. Efficient multiplication and addition with LUTs increased a network scale that can be implemented on the same FPGA by approximately 54% compared to an implementation with fixed-point operations. The proposed system can perform super-resolution from 960×540 to 1920×1080 at 60fps with a latency of less than 1ms. Despite resource restriction of the FPGA, the system can generate clear super-resolution images with smooth edges. The evaluation results also revealed the superior quality in terms of the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) index, compared to systems with other methods.
Minoru INAMORI Hiroshi NAKADA Ryusuke KONISHI Akira NAGOYA Kiyoshi OGURI
This paper proposes a method for mapping a finite state machine (FSM) into a two-dimensional array of LUTs, which is a part of our plastic cell architecture (PCA). LSIs based on the PCA have already implemented as asynchronous devices. Functions that run on the LSIs must also be asynchronous. In order to make good use of the LSIs, a system that translates functions into circuit information for the PCA is needed. We introduce a prototype system that maps an asynchronous FSM onto the PCA. First, a basic mapping method is considered, and then we create three methods to minimize circuit size. Some benchmark suites are synthesized to estimate their efficiency. Experimental results show that all the methods can map an asynchronous FSM onto the PCA and that the three methods can effectively reduce circuit size.