The search functionality is under construction.

Author Search Result

[Author] Naofumi TAKAGI(36hit)

1-20hit(36hit)

  • Nested Loop Parallelization Using Polyhedral Optimization in High-Level Synthesis

    Akihiro SUDA  Hideki TAKASE  Kazuyoshi TAKAGI  Naofumi TAKAGI  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E97-A No:12
      Page(s):
    2498-2506

    We propose a synthesis method of nested loops into parallelized circuits by integrating the polyhedral optimization, which is a state-of-the-art technique in the field of software, into high-level synthesis. Our method constructs circuits equipped with multiple processing elements (PEs), using information generated by the polyhedral optimizing compiler. Since multiple PEs cannot concurrently access the off-chip RAM, a method for constructing on-chip buffers is also proposed. Our buffering method reduces the off-chip RAM access conflicts and further enables burst accesses and data reuses. In our experimental result, the buffered circuits generated by our method are 8.2 times on average and 26.5 times at maximum faster than the sequential non-buffered ones, when each of the parallelized circuits is configured with eight PEs.

  • Low-Overhead Fault-Secure Parallel Prefix Adder by Carry-Bit Duplication

    Nobutaka KITO  Naofumi TAKAGI  

     
    PAPER

      Vol:
    E96-D No:9
      Page(s):
    1962-1970

    We propose a low-overhead fault-secure parallel prefix adder. We duplicate carry bits for checking purposes. Only one half of normal carry bits are compared with the corresponding redundant carry bits, and the hardware overhead of the adder is low. For concurrent error detection, we also predict the parity of the result. The adder uses parity-based error detection and it has high compatibility with systems that have parity-based error detection. We can implement various fault-secure parallel prefix adders such as Sklansky adder, Brent-Kung adder, Han-Carlson adder, and Kogge-Stone adder. The area overhead of the proposed adder is about 15% lower than that of a previously proposed adder that compares all the carry bits.

  • A Verification Method for Single-Flux-Quantum Circuits Using Delay-Based Time Frame Model

    Takahiro KAWAGUCHI  Kazuyoshi TAKAGI  Naofumi TAKAGI  

     
    PAPER-Logic Synthesis, Test and Verification

      Vol:
    E98-A No:12
      Page(s):
    2556-2564

    Superconducting single-flux-quantum (SFQ) device is an emerging device which can realize digital circuits with high switching speed and low power consumption. In SFQ digital circuits, voltage pulses are used for carrier of information, and the representation of logic values is different from that of CMOS circuits. Design methods exclusive to SFQ circuits have been developed. In this paper, we present timing analysis and functional verification methods for SFQ circuits based on new timing model which we call delay-based time frame model. Assuming that possible pulse arrival is periodic, the model defines comprehensive time frames and representation of logic values. In static timing analysis, expected pulse arrival time is checked based on the model, and the order among pulse arrival times is calculated for each logic gate. In functional verification, the circuit behavior is abstracted in a form similar to a synchronous sequential circuit using the order of pulse arrival times, and then the behavior is verified using formal verification tools. Using our proposed methods, we can verify the functional behavior of SFQ circuits with complex clocking scheme, which appear often in practical design but cannot be dealt with in existing verification method. Experimental results show that our method can be applied to practical designs.

  • A Hardware Algorithm for Modular Division Based on the Extended Euclidean Algorithm

    Naofumi TAKAGI  

     
    PAPER-Computer Hardware and Design

      Vol:
    E79-D No:11
      Page(s):
    1518-1522

    A hardware algorithm for modular division is proposed. It is based on the extended Euclidean algorithm (EEA). The procedure for finding the multiplicative inverse in EEA is modified so that it calculates the quotient. Modular division is carried out through iteration of simple operations, such as shifts and additions. A redundant binary representation is employed so that additions are performed without carry propagation. An n-bit modular division is carried out in O(n) clock cycles. The length of each clock cycle is constant independent of n. A modular divider based on the algorithm has a bit-slice structure and is suitable for VLSI implementation.

  • RSFQ 4-bit Bit-Slice Integer Multiplier

    Guang-Ming TANG  Kazuyoshi TAKAGI  Naofumi TAKAGI  

     
    PAPER

      Vol:
    E99-C No:6
      Page(s):
    697-702

    A rapid single-flux-quantum (RSFQ) 4-bit bit-slice multiplier is proposed. A new systolic-like multiplication algorithm suitable for RSFQ implementation is developed. The multiplier is designed using the cell library for AIST 10-kA/cm2 1.0-µm fabrication technology (ADP2). Concurrent flow clocking is used to design a fully pipelined RSFQ logic design. A 4n×4n-bit multiplier consists of 2n+17 stages. For verifying the algorithm and the logic design, a physical layout of the 8×8-bit multiplier has been designed with target operating frequency of 50GHz and simulated. It consists of 21 stages and 11,488 Josephson junctions. The simulation results show correct operation up to 62.5GHz.

  • A Digit-Recurrence Algorithm for Cube Rooting

    Naofumi TAKAGI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E84-A No:5
      Page(s):
    1309-1314

    A digit-recurrence algorithm for cube rooting is proposed. In cube rooting, the digit-recurrence equation of the residual includes the square of the partial result of the cube root. In the proposed algorithm, the square of the partial result is kept, and the square, as well as the residual, is updated by addition/subtraction, shift, and multiplication by one or two digits. Different specific versions of the algorithm are possible, depending on the radix, the digit set of the cube root, and etc. Any version of the algorithm can be implemented as a sequential (folded) circuit or a combinational (unfolded) circuit, which is suitable for VLSI realization.

  • Design and High-Speed Demonstration of Single-Flux-Quantum Bit-Serial Floating-Point Multipliers Using a 10kA/cm2 Nb Process

    Xizhu PENG  Yuki YAMANASHI  Nobuyuki YOSHIKAWA  Akira FUJIMAKI  Naofumi TAKAGI  Kazuyoshi TAKAGI  Mutsuo HIDAKA  

     
    PAPER

      Vol:
    E97-C No:3
      Page(s):
    188-193

    Recently, we proposed a new data-path architecture, named a large-scale reconfigurable data-path (LSRDP), based on single-flux-quantum (SFQ) circuits, to establish a fundamental technology for future high-end computers. In this architecture, a large number of SFQ floating-point units (FPUs) are used as core components, and their high performance and low power consumption are essential. In this research, we implemented an SFQ half-precision bit-serial floating-point multiplier (FPM) with a target clock frequency of 50GHz, using the AIST 10kA/cm2 Nb process. The FPM was designed, based on a systolic-array architecture. It contains 11,066 Josephson junctions, including on-chip high-speed test circuits. The size and power consumption of the FPM are 6.66mm × 1.92mm and 2.83mW, respectively. Its correct operation was confirmed at a maximum frequency of 93.4GHz for the exponent part and of 72.0GHz for the significand part by on-chip high-speed tests.

  • Large-Scale Integrated Circuit Design Based on a Nb Nine-Layer Structure for Reconfigurable Data-Path Processors Open Access

    Akira FUJIMAKI  Masamitsu TANAKA  Ryo KASAGI  Katsumi TAKAGI  Masakazu OKADA  Yuhi HAYAKAWA  Kensuke TAKATA  Hiroyuki AKAIKE  Nobuyuki YOSHIKAWA  Shuichi NAGASAWA  Kazuyoshi TAKAGI  Naofumi TAKAGI  

     
    INVITED PAPER

      Vol:
    E97-C No:3
      Page(s):
    157-165

    We describe a large-scale integrated circuit (LSI) design of rapid single-flux-quantum (RSFQ) circuits and demonstrate several reconfigurable data-path (RDP) processor prototypes based on the ISTEC Advanced Process (ADP2). The ADP2 LSIs are made up of nine Nb layers and Nb/AlOx/Nb Josephson junctions with a critical current density of 10kA/cm2, allowing higher operating frequencies and integration. To realize truly large-scale RSFQ circuits, careful design is necessary, with several compromises in the device structure, logic gates, and interconnects, balancing the competing demands of integration density, design flexibility, and fabrication yield. We summarize numerical and experimental results related to the development of a cell-based design in the ADP2, which features a unit cell size reduced to 30-µm square and up to four strip line tracks in the unit cell underneath the logic gates. The ADP LSIs can achieve ∼10 times the device density and double the operating frequency with the same power consumption per junction as conventional LSIs fabricated using the Nb four-layer process. We report the design and test results of RDP processor prototypes using the ADP2 cell library. The RDP processors are composed of many arrays of floating-point units (FPUs) and switch networks, and serve as accelerators in a high-performance computing system. The prototypes are composed of two-dimensional arrays of several arithmetic logic units instead of FPUs. The experimental results include a successful demonstration of full operation and reconfiguration in a 2×2 RDP prototype made up of 11.5k junctions at 45GHz after precise timing design. Partial operation of a 4×4 RDP prototype made up of 28.5k-junctions is also demonstrated, indicating the scalability of our timing design.

  • A Method of Sequential Circuit Synthesis Using One-Hot Encoding for Single-Flux-Quantum Digital Circuits

    Koji OBATA  Kazuyoshi TAKAGI  Naofumi TAKAGI  

     
    PAPER-Superconducting Electronics

      Vol:
    E90-C No:12
      Page(s):
    2278-2284

    A method of sequential circuit synthesis is proposed for Single-Flux-Quantum (SFQ) digital circuits. Since all logic gates of SFQ digital circuits are driven by a clock signal, methods of sequential circuit synthesis for semiconductor digital circuits cannot derive the full power of high-throughput computation of SFQ circuit technology. In the method, a 'state module' consisting of a DFF and several AND gates is used. First, states of a sequential machine are encoded by one-hot encoding and state modules are assigned to the states one-by-one, and then, the modules are connected with each other according to the state transition. For the connection, Confluence Buffers (CBs), i.e., merger gates without clock signals are used. Consequently, gates driven by a clock signal are removed from its feedback loops, and therefore, a high-throughput SFQ sequential circuit is achieved. The experimental results on benchmark circuits show that compared with a conventional method for semiconductor digital circuits, the proposed method synthesizes circuits that work with 4.9 times higher clock frequency and have 17.3% more gates on average.

  • Hardware Algorithm for Computing Reciprocal of Euclidean Norm of a 3-D Vector

    Fumio KUMAZAWA  Naofumi TAKAGI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E89-A No:6
      Page(s):
    1799-1806

    A hardware algorithm for computing the reciprocal of the Euclidean norm of a 3-dimensional (3-D) vector which appears frequently in 3-D computer graphics is proposed. It is based on a digit-recurrence algorithm for computing the Euclidean norm and an on-line division (on-line reciprocal computation) algorithm. These algorithms are modified, so that the reciprocal of the Euclidean norm is computed by performing on-line division where the divisor is the partial result of Euclidean norm computation. Division, square-rooting, and reciprocal square-root computation, which are important operations in 3-D graphics, can also be performed using a circuit based on the proposed algorithm.

  • 32-Bit ALU with Clockless Gates for RSFQ Bit-Parallel Processor Open Access

    Takahiro KAWAGUCHI  Naofumi TAKAGI  

     
    INVITED PAPER

      Pubricized:
    2021/12/03
      Vol:
    E105-C No:6
      Page(s):
    245-250

    A 32-bit arithmetic logic unit (ALU) is designed for a rapid single flux quantum (RSFQ) bit-parallel processor. In the ALU, clocked gates are partially replaced by clockless gates. This reduces the number of D flip flops (DFFs) required for path balancing. The number of clocked gates, including DFFs, is reduced by approximately 40 %, and size of the clock distribution network is reduced. The number of pipeline stages becomes modest. The layout design of the ALU and simulation results show the effectiveness of using clockless gates in wide datapath circuits.

  • Digit-Recurrence Algorithm for Computing Reciprocal Square-Root

    Naofumi TAKAGI  Daisuke MATSUOKA  Kazuyoshi TAKAGI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E86-A No:1
      Page(s):
    221-228

    A digit-recurrence algorithm for computing reciprocal square-root which appears frequently in multimedia and graphics applications is proposed. The reciprocal square-root is computed by iteration of carry-propagation-free additions, shifts, and multiplications by one digit. Different specific versions of the algorithm are possible, depending on the radix, the redundancy factor of the digit set, and etc. Details of a radix-2 version and a radix-4 version and designs of a floating-point reciprocal square-root circuit based on them are shown.

  • Layout-Driven Skewed Clock Tree Synthesis for Superconducting SFQ Circuits

    Kazuyoshi TAKAGI  Yuki ITO  Shota TAKESHIMA  Masamitsu TANAKA  Naofumi TAKAGI  

     
    PAPER

      Vol:
    E94-C No:3
      Page(s):
    288-295

    In this paper, we propose a method for layout-driven skewed clock tree synthesis for SFQ logic circuits. For a given logic circuit without a clock tree, our algorithm outputs a circuit with a synthesized clock tree and timing adjustments achieving the given clock period and a rough placement of the clocked gates. In the proposed algorithm, clocked gates are grouped into levels and the clock tree is synthesized for each level. For each level, we estimate the clock timing for all possible placements of each gate, and then we search a placement of all gates that minimizes the total number of delay elements for timing adjustment. Once the placement is obtained, we synthesize a clock tree without wire intersections. We applied the proposed method to a moderate size circuit and confirmed that clock trees satisfying given timing requirements can be synthesized automatically.

  • A VLSI Algorithm for Division in GF(2m) Based on Extended Binary GCD Algorithm

    Yasuaki WATANABE  Naofumi TAKAGI  Kazuyoshi TAKAGI  

     
    PAPER

      Vol:
    E85-A No:5
      Page(s):
    994-999

    A VLSI algorithm for division in GF(2m) with the canonical basis representation is proposed. It is based on the extended Binary GCD algorithm for GF(2m), and performs division through iteration of simple operations, such as shifts and bitwise exclusive-OR operations. A divider in GF(2m) based on the algorithm has a linear array structure with a bit-slice feature and carries out division in 2m clock cycles. The amount of hardware of the divider is proportional to m and the depth is a constant independent of m.

  • A VLSI Architecture for Output Probability Computations of HMM-Based Recognition Systems with Store-Based Block Parallel Processing

    Kazuhiro NAKAMURA  Masatoshi YAMAMOTO  Kazuyoshi TAKAGI  Naofumi TAKAGI  

     
    PAPER-VLSI Systems

      Vol:
    E93-D No:2
      Page(s):
    300-305

    In this paper, a fast and memory-efficient VLSI architecture for output probability computations of continuous Hidden Markov Models (HMMs) is presented. These computations are the most time-consuming part of HMM-based recognition systems. High-speed VLSI architectures with small registers and low-power dissipation are required for the development of mobile embedded systems with capable human interfaces. We demonstrate store-based block parallel processing (StoreBPP) for output probability computations and present a VLSI architecture that supports it. When the number of HMM states is adequate for accurate recognition, compared with conventional stream-based block parallel processing (StreamBPP) architectures, the proposed architecture requires fewer registers and processing elements and less processing time. The processing elements used in the StreamBPP architecture are identical to those used in the StoreBPP architecture. From a VLSI architectural viewpoint, a comparison shows the efficiency of the proposed architecture through efficient use of registers for storing input feature vectors and intermediate results during computation.

  • High-Throughput Rapid Single-Flux-Quantum Circuit Implementations for Exponential and Logarithm Computation Using the Radix-2 Signed-Digit Representation

    Masamitsu TANAKA  Kazuyoshi TAKAGI  Naofumi TAKAGI  

     
    PAPER

      Vol:
    E99-C No:6
      Page(s):
    703-709

    We present circuit implementations for computing exponentials and logarithms suitable for rapid single-flux-quantum (RSFQ) logic. We propose hardware algorithms based on the sequential table-lookup (STL) method using the radix-2 signed-digit representation that achieve high-throughput, digit-serial calculations. The circuits are implemented by processing elements formed in systolic-array-like, regularly-aligned pipeline structures. The processing elements are composed of adders, shifters, and readouts of precomputed constants. The iterative calculations are fully overlapped, and throughputs approach the maximum throughput of serial processing. The circuit size for calculating significand parts is estimated to be approximately 5-10 times larger than that of a bit-serial floating-point adder or multiplier.

  • Algorithms for Evaluating the Matrix Polynomial I+A+A2+…+AN-1 with Reduced Number of Matrix Multiplications

    Kotaro MATSUMOTO  Kazuyoshi TAKAGI  Naofumi TAKAGI  

     
    PAPER-Algorithms and Data Structures

      Vol:
    E101-A No:2
      Page(s):
    467-471

    The problem of evaluating the matrix polynomial I+A+A2+…+AN-1 with a reduced number of matrix multiplications has long been considered. Several algorithms have been proposed for this problem, which find a procedure requiring O(log N) matrix multiplications for a given N. Among them, the hybrid algorithm based on the double-base representation of N, i.e., using mixed radices 2 and 3, proposed by Dimitrov and Cooklev is most efficient. It has been suggested by them that the use of higher radices would not bring any more efficient algorithms. In this paper, we show that we can derive more efficient algorithms by using higher radices, and propose several efficient algorithms.

  • A Hardware Algorithm for Integer Division Using the SD2 Representation

    Naofumi TAKAGI  Shunsuke KADOWAKI  Kazuyoshi TAKAGI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E89-A No:10
      Page(s):
    2874-2881

    A hardware algorithm for integer division is proposed. It is based on the radix-2 non-restoring division algorithm. Fast computation is achieved by the use of the radix-2 signed-digit (SD2) representation. The algorithm does not require normalization of the divisor, and hence, does not require an area-consuming leading-one (or zero) detection nor shifts of variable-amount. Combinational (unfolded) implementation of the algorithm yields a regularly structured array divider, and sequential implementation yields compact dividers.

  • Automated Passive-Transmission-Line Routing Tool for Single-Flux-Quantum Circuits Based on A* Algorithm

    Masamitsu TANAKA  Koji OBATA  Yuki ITO  Shota TAKESHIMA  Motoki SATO  Kazuyoshi TAKAGI  Naofumi TAKAGI  Hiroyuki AKAIKE  Akira FUJIMAKI  

     
    PAPER-Digital Applications

      Vol:
    E93-C No:4
      Page(s):
    435-439

    We demonstrated an automated passive-transmission-line routing tool for single-flux-quantum (SFQ) circuits. The tool is based on the A* algorithm, which is widely used in CMOS LSI design, and tuned for microstrip/strip lines formed in the SRL 4-Nb layer structure. In large-scale SFQ circuits with 10000-20000 Josephson junctions, such as microprocessors, 80-90% of the wires can be automatically routed in about ten minutes. We verified correct operation above 40 GHz for an automatically routed 44 switch circuit from on-chip high-speed tests. The resulting circuit size and operating frequency were comparable to those of a manually designed result. We believe that the tool is useful for large-scale SFQ circuit design using conventional fabrication processes.

  • Minimum Cut Linear Arrangement of p-q Dags for VLSI Layout of Adder Trees

    Kazuyoshi TAKAGI  Naofumi TAKAGI  

     
    PAPER

      Vol:
    E82-A No:5
      Page(s):
    767-774

    Two algorithms for minimum cut linear arrangement of a class of graphs called p-q dags are proposed. A p-q dag represents the connection scheme of an adder tree, such as Wallace tree, and the VLSI layout problem of a bit slice of an adder tree is treated as the minimum cut linear arrangement problem of its corresponding p-q dag. One of the two algorithms is based on dynamic programming. It calculates an exact minimum solution within nO(1) time and space, where n is the size of a given graph. The other algorithm is an approximation algorithm which calculates a solution with O(log n) cutwidth. It requires O(n log n) time.

1-20hit(36hit)