The search functionality is under construction.

Author Search Result

[Author] Munehiro MATSUURA(14hit)

1-14hit
  • On Properties of Kleene TDDs

    Yukihiro IGUCHI  Tsutomu SASAO  Munehiro MATSUURA  

     
    PAPER-Logic Simulation and Logic Optimization

      Vol:
    E81-D No:7
      Page(s):
    716-723

    Three types of ternary decision diagrams (TDDs) are considered: AND -TDDs, EXOR-TDDs, and Kleene-TDDs. Kleene-TDDs are useful for logic simulation in the presence of unknown inputs. Let N(BDD:f), N(AND-TDD:f), and N(EXOR-TDD:f) be the number of non-terminal nodes in the BDD, the AND-TDD, and the EXOR-TDD for f, respectively. Let N(Kleene-TDD:) be the number of non-terminal nodes in the Kleene -TDD for , where is the regular ternary function corresponding to f. Then N(BDD:f) N(TDD:f). For parity functions, N(BDD:f)=N(AND-TDD:f)=N(EXOR-TDD:f)=N(Kleene-TDD:). For unate functions,N(BDD:f)=N(AND-TDD:f). The sizes of Kleene-TDDs are O(3n/n), and O(n3) for arbitrary functions, and symmetric functions, respectively. There exist a 2n-variable function, where Kleene-TDDs require O(n) nodes with the best order, while O(3n) nodes in the worst order.

  • A Parallel Branching Program Machine for Sequential Circuits: Implementation and Evaluation

    Hiroki NAKAHARA  Tsutomu SASAO  Munehiro MATSUURA  Yoshifumi KAWAMURA  

     
    PAPER-Logic Design

      Vol:
    E93-D No:8
      Page(s):
    2048-2058

    The parallel branching program machine (PBM128) consists of 128 branching program machines (BMs) and a programmable interconnection. To represent logic functions on BMs, we use quaternary decision diagrams. To evaluate functions, we use 3-address quaternary branch instructions. We realized many benchmark functions on the PBM128, and compared its memory size, computation time, and power consumption with the Intel's Core2Duo microprocessor. The PBM128 requires approximately a quarter of the memory for the Core2Duo, and is 21.4-96.1 times faster than the Core2Duo. It dissipates a quarter of the power of the Core2Duo. Also, we realized packet filters such as an access controller and a firewall, and compared their performance with software on the Core2Duo. For these packet filters, the PBM128 requires approximately 17% of the memory for the Core2Duo, and is 21.3-23.7 times faster than the Core2Duo.

  • A Design Method of a Regular Expression Matching Circuit Based on Decomposed Automaton

    Hiroki NAKAHARA  Tsutomu SASAO  Munehiro MATSUURA  

     
    PAPER-Design Methodology

      Vol:
    E95-D No:2
      Page(s):
    364-373

    This paper shows a design method for a regular expression matching circuit based on a decomposed automaton. To implement a regular expression matching circuit, first, we convert a regular expression into a non-deterministic finite automaton (NFA). Then, to reduce the number of states, we convert the NFA into a merged-states non-deterministic finite automaton with unbounded string transition (MNFAU) using a greedy algorithm. Next, to realize it by a feasible amount of hardware, we decompose the MNFAU into a deterministic finite automaton (DFA) and an NFA. The DFA part is implemented by an off-chip memory and a simple sequencer, while the NFA part is implemented by a cascade of logic cells. Also, in this paper, we show that the MNFAU based implementation has lower area complexity than the DFA and the NFA based ones. Experiments using regular expressions form SNORT shows that, as for the embedded memory size per a character, the MNFAU is 17.17-148.70 times smaller than DFA methods. Also, as for the number of LCs (Logic Cells) per a character, the MNFAU is 1.56-5.12 times smaller than NFA methods. This paper describes detail of the MEMOCODE2010 HW/SW co-design contest for which we won the first place award.

  • A PC-Based Logic Simulator Using a Look-Up Table Cascade Emulator

    Hiroki NAKAHARA  Tsutomu SASAO  Munehiro MATSUURA  

     
    PAPER-Simulation and Verification

      Vol:
    E89-A No:12
      Page(s):
    3471-3481

    This paper represents a cycle-based logic simulation method using an LUT cascade emulator, where an LUT cascade consists of multiple-output LUTs (cells) connected in series. The LUT cascade emulator is an architecture that emulates LUT cascades. It has a control part, a memory for logic, and registers. It connects the memory to registers through a programmable interconnection circuit, and evaluates the given circuit stored in the memory. The LUT cascade emulator runs on an ordinary PC. This paper also compares the method with a Levelized Compiled Code (LCC) simulator and a simulator using a Quasi-Reduced Multi-valued Decision Diagram (QRMDD). Our simulator is 3.5 to 10.6 times faster than the LCC, and 1.1 to 3.9 times faster than the one using a QRMDD. The simulation setup time is 2.0 to 9.8 times shorter than the LCC. The necessary amount of memory is 1/1.8 to 1/5.5 of the one using a QRMDD.

  • Area-Time Complexities of Multi-Valued Decision Diagrams

    Shinobu NAGAYAMA  Tsutomu SASAO  Yukihiro IGUCHI  Munehiro MATSUURA  

     
    PAPER

      Vol:
    E87-A No:5
      Page(s):
    1020-1028

    This paper considers Quasi-Reduced ordered Multi-valued Decision Diagrams with k bits (QRMDD(k)s) to represent binary logic functions. Experimental results show relations between the values of k and the numbers of nodes, the memory sizes, the numbers of memory accesses, and area-time complexity for QRMDD(k). For many benchmark functions, the numbers of nodes and memory accesses for QRMDD(k)s are nearly equal to of the corresponding Quasi-Reduced ordered Binary Decision Diagrams (QRBDDs), and the memory sizes and the area-time complexities for QRMDD(k)s are minimum when k = 2 and k = 3-6, respectively.

  • A Realization of Multiple-Output Functions by a Look-Up Table Ring

    Hui QIN  Tsutomu SASAO  Munehiro MATSUURA  Shinobu NAGAYAMA  Kazuyuki NAKAMURA  Yukihiro IGUCHI  

     
    PAPER-Logic Synthesis

      Vol:
    E87-A No:12
      Page(s):
    3141-3150

    A look-up table (LUT) cascade is a new type of a programmable logic device (PLD) that provides an alternative way to realize multiple-output functions. An LUT ring is an emulator for an LUT cascade. Compared with an LUT cascade, the LUT ring is more flexible. In this paper we discuss the realization of multiple-output functions with the LUT ring. Unlike an FPGA realization of a logic function, accurate prediction of the delay time is easy in an LUT ring realization. A prototype of an LUT ring has been custom-designed with 0.35 µm CMOS technology. Simulation results show that the LUT ring is 80 to 241 times faster than software programs on an SH-1, and 36 to 93 times faster than software programs on a PentiumIII when the frequencies for the LUT ring and the MPUs are the same, but is slightly slower than commercial FPGAs.

  • A Quaternary Decision Diagram Machine: Optimization of Its Code

    Tsutomu SASAO  Hiroki NAKAHARA  Munehiro MATSUURA  Yoshifumi KAWAMURA  Jon T. BUTLER  

     
    INVITED PAPER

      Vol:
    E93-D No:8
      Page(s):
    2026-2035

    This paper first reviews the trends of VLSI design, focusing on the power dissipation and programmability. Then, we show the advantage of Quarternary Decision Diagrams (QDDs) in representing and evaluating logic functions. That is, we show how QDDs are used to implement QDD machines, which yield high-speed implementations. We compare QDD machines with binary decision diagram (BDD) machines, and show a speed improvement of 1.28-2.02 times when QDDs are chosen. We consider 1-and 2-address BDD machines, and 3- and 4-address QDD machines, and we show a method to minimize the number of instructions.

  • Bi-Partition of Shared Binary Decision Diagrams

    Munehiro MATSUURA  Tsutomu SASAO  Jon T. BUTLER  Yukihiro IGUCHI  

     
    PAPER-Logic Synthesis

      Vol:
    E85-A No:12
      Page(s):
    2693-2700

    A shared binary decision diagram (SBDD) represents a multiple-output function, where nodes are shared among BDDs representing the various outputs. A partitioned SBDD consists of two or more SBDDs that share nodes. The separate SBDDs are optimized independently, often resulting in a reduction in the number of nodes over a single SBDD. We show a method for partitioning a single SBDD into two parts that reduces the node count. Among the benchmark functions tested, a node reduction of up to 23% is realized.

  • A Packet Classifier Based on Prefetching EVMDD (k) Machines

    Hiroki NAKAHARA  Tsutomu SASAO  Munehiro MATSUURA  

     
    PAPER-Logic Design

      Vol:
    E97-D No:9
      Page(s):
    2243-2252

    A Decision Diagram Machine (DDM) is a special-purpose processor that has special instructions to evaluate a decision diagram. Since the DDM uses only a limited number of instructions, it is faster than the general-purpose Micro Processor Unit (MPU). Also, the architecture for the DDM is much simpler than that for an MPU. This paper presents a packet classifier using a parallel EVMDD (k) machine. To reduce computation time and code size, first, a set of rules for a packet classifier is partitioned into groups. Then, the parallel EVMDD (k) machine evaluates them. To further speed-up for the standard EVMDD (k) machine, we propose the prefetching EVMDD (k) machine which reads both the index and the jump address at the same time. The prefetching EVMDD (k) machine is 2.4 times faster than the standard one using the same memory size. We implemented a parallel prefetching EVMDD (k) machine consisting of 30 machines on an FPGA, and compared it with the Intel's Core i5 microprocessor running at 1.7GHz. Our parallel machine is 15.1-77.5 times faster than the Core i5, and it requires only 8.1-58.5 percents of the memory for the Core i5.

  • A Virus Scanning Engine Using an MPU and an IGU Based on Row-Shift Decomposition

    Hiroki NAKAHARA  Tsutomu SASAO  Munehiro MATSUURA  

     
    PAPER-Application

      Vol:
    E96-D No:8
      Page(s):
    1667-1675

    This paper shows a virus scanning engine using two-stage matching. In the first stage, a binary CAM emulator quickly detects a part of the virus pattern, while in the second stage, the MPU detects the full length of the virus pattern. The binary CAM emulator is realized by an index generation unit (IGU) based on row-shift decomposition. The proposed system uses two off-chip SRAMs and a small FPGA. Thus, the cost and the power consumption are lower than the TCAM-based system. The system loaded 1,290,617 ClamAV virus patterns. As for the area and throughput, this system outperforms existing two-stage matching systems using FPGAs.

  • Design Methods of Radix Converters Using Arithmetic Decompositions

    Yukihiro IGUCHI  Tsutomu SASAO  Munehiro MATSUURA  

     
    PAPER-Computer Components

      Vol:
    E90-D No:6
      Page(s):
    905-914

    In arithmetic circuits for digital signal processing, radixes other than two are often used to make circuits faster. In such cases, radix converters are necessary. However, in general, radix converters tend to be complex. This paper considers design methods for p-nary to binary converters. First, it considers Look-Up Table (LUT) cascade realizations. Then, it introduces a new design technique called arithmetic decomposition by using LUTs and adders. Finally, it compares the amount of hardware and performance of radix converters implemented by FPGAs. 12-digit ternary to binary converters on Cyclone II FPGAs designed by the proposed method are faster than ones by conventional methods.

  • A Memory-Based IPv6 Lookup Architecture Using Parallel Index Generation Units

    Hiroki NAKAHARA  Tsutomu SASAO  Munehiro MATSUURA  Hisashi IWAMOTO  Yasuhiro TERAO  

     
    PAPER-Architecture

      Pubricized:
    2014/11/19
      Vol:
    E98-D No:2
      Page(s):
    262-271

    In the era of IPv6, since the number of IPv6 addresses rapidly increases and the required speed is more than Giga lookups per second (GLPS), an area-efficient and high-speed IP lookup architecture is desired. This paper shows a parallel index generation unit (IGU) for memory-based IPv6 lookup architecture. To reduce the size of memory in the IGU, we use a linear transformation and a row-shift decomposition. A single-memory realization requires O(2l log k) memory size, where l denotes the length of prefix, while the realization using IGU requires O(kl) memory size, where k denotes the number of prefixes. In IPv6 prefix lookup, since l is at most 64 and k is about 340 K, the IGU drastically reduces the memory size. Also, to reduce the cost, we realize the parallel IGU by using both on-chip and off-chip memories. We show a design algorithm for the parallel IGU to store given off-chip and on-chip memories. The parallel IGU has a simple architecture and performs lookup by using complete pipelines those insert the pipeline registers in all the paths. We loaded more than 340 K IPv6 pseudo prefixes on the Xilinx Virtex 6 FPGA with off-chip DDRII+ Static RAMs (SRAMs). Its lookup speed is 1.100 giga lookups per second (GLPS) which is sufficient for the required speed for a next generation 400 Gbps link throughput. As for the normalized area and lookup speed, our implementation outperforms existing FPGA implementations.

  • BDD Representation for Incompletely Specified Multiple-Output Logic Functions and Its Applications to the Design of LUT Cascades

    Munehiro MATSUURA  Tsutomu SASAO  

     
    PAPER-Logic Synthesis and Verification

      Vol:
    E90-A No:12
      Page(s):
    2762-2769

    A multiple-output function can be represented by a binary decision diagram for characteristic function (BDD_for_CF). This paper presents a method to represent multiple-output incompletely specified functions using BDD_for_CFs. An algorithm to reduce the widths of BDD_for_CFs is presented. This method is useful for decomposition of incompletely specified multiple-output functions. Experimental results for radix converters, adders, a multiplier, and lists of English words show that this method is useful for the synthesis of LUT cascades. An implementation of English words list by LUT cascades and an auxiliary memory is also shown.

  • A Design Algorithm for Sequential Circuits Using LUT Rings

    Hiroki NAKAHARA  Tsutomu SASAO  Munehiro MATSUURA  

     
    PAPER-Logic Synthesis

      Vol:
    E88-A No:12
      Page(s):
    3342-3350

    This paper shows a design method for a sequential circuit by using a Look-Up Table (LUT) ring. The method consists of two steps: The first step partitions the outputs into groups. The second step realizes them by LUT cascades, and allocates the cells of the cascades into the memory. The system automatically finds a fast implementation by maximally utilizing available memory. With the presented algorithm, we can easily design sequential circuits satisfying given specifications. The paper also compares the LUT ring with logic simulator to realize sequential circuits: the LUT ring is 25 to 237 times faster than a logic simulator that uses the same amount of memory.