1-18hit |
Tatsuma MORI Taito MANABE Yuichiro SHIBATA
The convex hull is the minimum convex surrounding a given set of points. Since the process of finding convex hulls has various practical application fields including embedded real-time systems, efficient acceleration of convex hull algorithms is an important problem in computer geometry. In this paper, we discuss an FPGA acceleration approach to address this problem. In order to compute the convex hull of an unsorted point set, it is necessary to store all the points during the computation, and thus the capacity of a on-chip memory is likely to be a major constraint for efficient FPGA implementation. On the other hand, approximate convex hulls are often sufficient for practical applications. Therefore, we propose a hardware oriented approximate convex hull algorithm, which can process the input points as a stream without storing all the points in the memory. We also propose some computation reduction techniques for efficient FPGA implementation. Then, we present FPGA implementation of the proposed algorithm, which is parallelized both in temporal and spatial domains, and evaluate its effectiveness in terms of performance and accuracy. As a result, we demonstrated 11 to 30 times faster performance compared to the widely-used convex hull software library Qhull. In addition, accuracy assessment revealed that the maximum approximation error normalized to the diameters of point sets was 0.038%, which was reasonably small for practical use cases.
A parallel phrase matching (PM) engine for dictionary compression is presented. Hardware based parallel chaining hash can eliminate erroneous PM results raised by hash collision; while newly-designed storage architecture holding PM results solved the data dependency issue; Thus, the average compression speed is increased by 53%.
Masamitsu TANAKA Kazuyoshi TAKAGI Naofumi TAKAGI
We present circuit implementations for computing exponentials and logarithms suitable for rapid single-flux-quantum (RSFQ) logic. We propose hardware algorithms based on the sequential table-lookup (STL) method using the radix-2 signed-digit representation that achieve high-throughput, digit-serial calculations. The circuits are implemented by processing elements formed in systolic-array-like, regularly-aligned pipeline structures. The processing elements are composed of adders, shifters, and readouts of precomputed constants. The iterative calculations are fully overlapped, and throughputs approach the maximum throughput of serial processing. The circuit size for calculating significand parts is estimated to be approximately 5-10 times larger than that of a bit-serial floating-point adder or multiplier.
Naofumi TAKAGI Masamitsu TANAKA
Recent advances of superconducting single-flux-quantum (SFQ) circuit technology make it attractive to investigate computing systems using SFQ circuits, where arithmetic circuits play important roles. In order to develop excellent SFQ arithmetic circuits, we have to design or select their underlying algorithms, called hardware algorithms, from different point of view than CMOS circuits, because SFQ circuits work by pulse logic while CMOS circuits work by level logic. In this paper, we compare implementations of hardware algorithms for addition by synchronous-clocking SFQ circuits. We show that a set of individual bit-serial adders and Kogge-Stone adder are superior to others.
Naofumi HOMMA Yuki WATANABE Takafumi AOKI Tatsuo HIGUCHI
This paper presents a formal design of arithmetic circuits using an arithmetic description language called ARITH. The key idea in ARITH is to describe arithmetic algorithms directly with high-level mathematical objects (i.e., number representation systems and arithmetic operations/formulae). Using ARITH, we can provide formal description of arithmetic algorithms including those using unconventional number systems. In addition, the described arithmetic algorithms can be formally verified by equivalence checking with formula manipulations. The verified ARITH descriptions are easily translated into the equivalent HDL descriptions. In this paper, we also present an application of ARITH to an arithmetic module generator, which supports a variety of hardware algorithms for 2-operand adders, multi-operand adders, multipliers, constant-coefficient multipliers and multiply accumulators. The language processing system of ARITH incorporated in the generator verifies the correctness of ARITH descriptions in a formal method. As a result, we can obtain highly-reliable arithmetic modules whose functions are completely verified at the algorithm level.
Naofumi TAKAGI Shunsuke KADOWAKI Kazuyoshi TAKAGI
A hardware algorithm for integer division is proposed. It is based on the radix-2 non-restoring division algorithm. Fast computation is achieved by the use of the radix-2 signed-digit (SD2) representation. The algorithm does not require normalization of the divisor, and hence, does not require an area-consuming leading-one (or zero) detection nor shifts of variable-amount. Combinational (unfolded) implementation of the algorithm yields a regularly structured array divider, and sequential implementation yields compact dividers.
A hardware algorithm for computing the reciprocal of the Euclidean norm of a 3-dimensional (3-D) vector which appears frequently in 3-D computer graphics is proposed. It is based on a digit-recurrence algorithm for computing the Euclidean norm and an on-line division (on-line reciprocal computation) algorithm. These algorithms are modified, so that the reciprocal of the Euclidean norm is computed by performing on-line division where the divisor is the partial result of Euclidean norm computation. Division, square-rooting, and reciprocal square-root computation, which are important operations in 3-D graphics, can also be performed using a circuit based on the proposed algorithm.
Marcelo E. KAIHARA Naofumi TAKAGI
A hardware algorithm for modular multiplication/division which performs modular division, Montgomery multiplication, and ordinary modular multiplication is proposed. The modular division in our algorithm is based on the extended Euclidean algorithm. We employ our newly proposed computation method that consists of processing the multiplier from the most significant digit first to calculate Montgomery multiplication. Finally, the ordinary modular multiplication is based on shift-and-add multiplication. Each of these three operations is carried out through the iteration of simple operations such as shifts and additions/subtractions. To avoid carry propagation in all additions and subtractions, the radix-2 signed-digit representation is employed. A modular multiplier/divider based on the algorithm has a linear array structure with a bit-slice feature and carries out n-bit modular multiplication/division in O(n) clock cycles, where the length of the clock cycle is constant and independent of n. This multiplier/divider can be implemented using a hardware amount only slightly larger than that of the modular divider.
Naofumi TAKAGI Daisuke MATSUOKA Kazuyoshi TAKAGI
A digit-recurrence algorithm for computing reciprocal square-root which appears frequently in multimedia and graphics applications is proposed. The reciprocal square-root is computed by iteration of carry-propagation-free additions, shifts, and multiplications by one digit. Different specific versions of the algorithm are possible, depending on the radix, the redundancy factor of the digit set, and etc. Details of a radix-2 version and a radix-4 version and designs of a floating-point reciprocal square-root circuit based on them are shown.
Yasuaki WATANABE Naofumi TAKAGI Kazuyoshi TAKAGI
A VLSI algorithm for division in GF(2m) with the canonical basis representation is proposed. It is based on the extended Binary GCD algorithm for GF(2m), and performs division through iteration of simple operations, such as shifts and bitwise exclusive-OR operations. A divider in GF(2m) based on the algorithm has a linear array structure with a bit-slice feature and carries out division in 2m clock cycles. The amount of hardware of the divider is proportional to m and the depth is a constant independent of m.
A digit-recurrence algorithm for cube rooting is proposed. In cube rooting, the digit-recurrence equation of the residual includes the square of the partial result of the cube root. In the proposed algorithm, the square of the partial result is kept, and the square, as well as the residual, is updated by addition/subtraction, shift, and multiplication by one or two digits. Different specific versions of the algorithm are possible, depending on the radix, the digit set of the cube root, and etc. Any version of the algorithm can be implemented as a sequential (folded) circuit or a combinational (unfolded) circuit, which is suitable for VLSI realization.
Itsuo TAKANAMI Tadayoshi HORITA
We propose a model for fault tolerant 3D processor arrays using one-and-half track switches. Spare processors are laid on the two opposite surfaces of the 3D array. The fault compensation process is performed by shifting processors on a continuous straight line (called compensation path) from a faulty processor to a spare on the surfaces. It is not allowed that compensantion paths are in the near-miss relation each other. Then, switches with only 4 states are needed to preserve the 3D mesh topology after compensating for faults. We give an algorithm in a convenient form for reconfiguring by hardware the 3D mesh arrays with faults. The algorithm can reconfigure the 3D mesh arrays in polynomial time. By computer simulation, we show the survival rates and the reliabilities of arrays which express the efficiencies of reconfiguration according to the algorithm. The reliabilities are compared with those of the model using double tracks for which the near-miss relation among compensation paths is allowed, but whose hardware overhead is almost double of that of the proposed model using one-and-half track. Finally, we design a logical circuit for hardware realization of the algorithm. Using the circuit, we can construct such a built-in self-reconfigurable 3D mesh array that the reconfiguration is done very quickly without an aid of a host computer.
A new hardware algorithm for the block matching video motion estimation is presented. The algorithm works in the full-search fashion but unlike the Full-Search Block Matching Algorithm (FSBMA) it adjusts the number of computations dynamically to variable picture contents. Due to incorporated mechanism of data-driven thresholding, the proposed algorithm performs as four times as less operations comparing to the FSBMA while maintaining the same quality of results. Its hardware implementation is simple and compact. A supportive hardware design as well as simulation results on benchmarks are outlined.
An algorithm for modular division which is suitable for VLSI implementation is proposed. It is based on the plus-minus algorithm which is a modification of the binary method for calculating the greatest common divisor (GCD). The plus-minus algorithm for calculating GCD is extended for performing modular division. A modular division is carried out through iteration of simple operations, such as shifts and addition/subtractions. A redundant binary representation is employed so that addition/subtractions are performed without carry propagation. A modular divider based on the algorithm has a linear array structure with a bit-slice feature and carries out an n-bit modular division in O(n) clock cycles, where the length of clock cycle is constant independent of n.
A hardware algorithm for modular division is proposed. It is based on the extended Euclidean algorithm (EEA). The procedure for finding the multiplicative inverse in EEA is modified so that it calculates the quotient. Modular division is carried out through iteration of simple operations, such as shifts and additions. A redundant binary representation is employed so that additions are performed without carry propagation. An n-bit modular division is carried out in O(n) clock cycles. The length of each clock cycle is constant independent of n. A modular divider based on the algorithm has a bit-slice structure and is suitable for VLSI implementation.
It is demonstrated that the enhancement in the functional capability of an elemental transistor is quite essential in developing human-like intelligent electronic systems. For this purpose we have introduced the concept of four-terminal devices. Four-terminal devices have an additional dimension in the degree of freedom in controlling currents as compared to the three-terminal devices like bipolar and MOS transistors. The importance of the four-terminal device concept is demonstrated taking the neuron MOS transistor (abbreviated as neuMOS or νMOS) and its circuit applications as examples. We have found that any Boolean functin can be realized by a two-stage configuratin of νMOS inverters. In addition, the variable threshold nature of the device allows us to build real-time reconfigurable logic circuits (no floating gate charging effect is involved in varying the threshold). Based on the principle, we have developed Soft-Hardware Logic Circuits and Real-Time Rule-Variable Data Matching Circuits. A winner-take-all circuit which finds the largest signal by hardware parallel processing has been also developed. The circuit is applied to building an associative memory which is different from Hopfield network in both principle and operation. The hardware algorithm in which binary, multivalue, and analog operations are merged at a very device level is quite essential to establish intelligent information processing systems based on highly flexible, real-time programmable hardwares realized by four-terminal devices.
A hardware algorithm for modular inversion is proposed. It is based on the extended Euclidean algorithm. All intermediate results are represented in a redundant binary representation with a digit set {0, 1,1}. All addition/subtractions are performed without carry propagation. A modular inversion is carried out in O (n) clock cycles where n is the word length of the modulus. The length of each clock cycle is constant independent of n. A modular inverter based on the algorithm has a regular cellular array structure with a bit slice feature and is very suitable for VLSI implementation. Its amount of hardware is proportional to n.
Takahiro HANYU Michitaka KAMEYAMA Tatsuo HIGUCHI
Rapid advances in integrated circuit technology based on binary logic have made possible the fabrication of digital circuits or digital VLSI systems with not only a very large number of devices on a single chip or wafer, but also high-speed processing capability. However, the advance of processing speeds and improvement in cost/performance ratio based on conventional binary logic will not always continue unabated in submicron geometry. Submicron integrated circuits can handle multiple-valued signals at high speed rather than binary signals, especially at data communication level because of the reduced interconnections. The use of nonbinary logic or discrete-analog signal processing will not be out of the question if the multiple-valued hardware algorithms are developed for fast parallel operations. Moreover, in VLSI or ULSI processors the delay time due to global communications between functional modules or chips instead of each functional module itself is the most important factors to determine the total performance. Locally computable hardware implementation and new parallel hardware algorithms natural to multiple-valued data representation and circuit technologies are the key properties to develop VLSI processors in submicron geometry. As a result, multiple-valued VLSI processors make it possible to improve the effective chip density together with the processing speed significantly. In this paper, we summarize several potential advantages of multiple-valued VLSI processors in submicron geometry due to great reduction of interconnection and due to the suitability to locally computable hardware implementation, and demonstrate that some examples of special-purpose multiple-valued VLSI processors, which are a signed-digit arithmetic VLSI processor, a residue arithmetic VLSI processor and a matching VLSI processor can achieve higher performance for real-world computing system.