The search functionality is under construction.

Author Search Result

[Author] Tatsuo OHTSUKI(49hit)

1-20hit(49hit)

  • A Scan-Based Attack Based on Discriminators for AES Cryptosystems

    Ryuta NARA  Nozomu TOGAWA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-Embedded, Real-Time and Reconfigurable Systems

      Vol:
    E92-A No:12
      Page(s):
    3229-3237

    A scan chain is one of the most important testing techniques, but it can be used as side-channel attacks against a cryptography LSI. We focus on scan-based attacks, in which scan chains are targeted for side-channel attacks. The conventional scan-based attacks only consider the scan chain composed of only the registers in a cryptography circuit. However, a cryptography LSI usually uses many circuits such as memories, micro processors and other circuits. This means that the conventional attacks cannot be applied to the practical scan chain composed of various types of registers. In this paper, a scan-based attack which enables to decipher the secret key in an AES cryptography LSI composed of an AES circuit and other circuits is proposed. By focusing on bit pattern of the specific register and monitoring its change, our scan-based attack eliminates the influence of registers included in other circuits than AES. Our attack does not depend on scan chain architecture, and it can decipher practical AES cryptography LSIs.

  • Fast Scheduling and Allocation Algorithms for Entropy CODEC

    Katsuharu SUZUKI  Nozomu TOGAWA  Masao SATO  Tatsuo OHTSUKI  

     
    PAPER-High Level Synthesis

      Vol:
    E80-D No:10
      Page(s):
    982-992

    Entropy coding/decoding are implemented on FPGAs as a fast and flexible system in which high-level synthesis technologies are key issues. In this paper, we propose scheduling and allocation algorithms for behavioral descriptions of entropy CODEC. The scheduling algorithm employs a control-flow graph as input and finds a solution with minimal hardware cost and execution time by merging nodes in the control-flow graph. The allocation algorithm assigns operations to operators with various bit lengths. As a result, register-transfer level descriptions are efficiently obtained from behavioral descriptions of entropy CODEC with complicated control flow and variable bit lengths. Experimental results demonstrate that our algorithms synthesize the same circuits as manually designed within one second.

  • A Retargetable Simulator Generator for DSP Processor Cores with Packed SIMD-type Instructions

    Nozomu TOGAWA  Kyosuke KASAHARA  Yuichiro MIYAOKA  Jinku CHOI  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-Simulation Accelerator

      Vol:
    E86-A No:12
      Page(s):
    3099-3109

    A packed SIMD type operation or a SIMD operation is n-parallel b/n-bit sub-operations executed by the modified n-bit functional unit. Such a functional unit is called a SIMD functional unit and a processor core which can execute SIMD operations is called a SIMD processor core. SIMD operations can be effectively applied to image processing applications. This paper focuses on hardware/software cosynthesis of SIMD processor cores and particularly proposes a new simulator generator which simulates pipelined instructions for a SIMD processor. Generally, a SIMD functional unit has many options and then we can have so many different SIMD functional unit instances. However, since our hardware/software cosynthesis system synthesizes a special-purpose processor core for an input application program, it uses very limited SIMD functional unit instances. In the proposed approach, we consider a SIMD operation to be a set of SIMD sub-operations. By adding up the appropriate SIMD sub-operations, we construct a single SIMD operation. Then a SIMD functional unit behavior can be characterized by a collection of SIMD operations. This approach has the advantage that: if we have a small number of behavior libraries for SIMD sub-operations, we can instantiate a particular SIMD functional unit behavior. Experimental results demonstrate the effectiveness of the proposed approach.

  • A Circuit Partitioning Algorithm with Replication Capability for Multi-FPGA Systems

    Nozomu TOGAWA  Masao SATO  Tatsuo OHTSUKI  

     
    PAPER

      Vol:
    E78-A No:12
      Page(s):
    1765-1776

    In circuit partitioning for FPGAs, partitioned signal nets are connected using I/O blocks, through which signals are coming from or going to external pins. However, the number of I/O blocks per chip is relatively small compared with the number of logic-blocks, which realize logic functions, accommodated in the FPGA chip. Because of the I/O block limitation, the size of a circuit implemented on each FPGA chip is usually small, which leads to a serious decrease of logic-block utilization. It is required to utilize unused logic-blocks in terms of reducing the number of I/O blocks and realize circuits on given FPGA chips. In this paper, we propose an algorithm which partitions an initial circuit into multi-FPGA chips. The algorithm is based on recursive bi-partitioning of a circuit. In each bi-partitioning, it searches a partitioning position of a circuit such that each of partitioned subcircuits is accommodated in each FPGA chip with making the number of signal nets between chips as small as possible. Such bi-partitioning is achieved by computing a minimum cut repeatedly applying a network flow technique, and replicating logic-blocks appropriately. Since a set of logic-blocks assigned to each chip is computed separately, logic-blocks to be replicated are naturally determined. This means that the algorithm makes good use of unused logic-blocks from the viewpoint of reducing the number of signal nets between chips, i.e. the number of required I/O blocks. The algorithm has been implemented and applied to MCNC PARTITIONING 93 benchmark circuits. The experimental results demonstrate that it decreases the maximum number of I/O blocks per chip by a maximum of 49% compared with conventional algorithms.

  • A Circuit Partitioning Algorithm with Path Delay Constraints for Multi-FPGA Systems

    Nozomu TOGAWA  Masao SATO  Tatsuo OHTSUKI  

     
    PAPER

      Vol:
    E80-A No:3
      Page(s):
    494-505

    In this paper, we extend the circuit partitioning algorithm which we have proposed for multi-FPGA systems and present a new algorithm in which the delay of each critical signal path is within a specified upper bound imposed on it. The core of the presented algorithm is recursive bipartitioning of a circuit. The bipartitioning procedure consists of three stages: 0) detection of critical paths; 1) bipartitioning of a set of primary inputs and outputs; and 2) bipartitioning of a set of logic-blocks. In 0), the algorithm computes the lower bounds of delays for paths with path delay constraints and detects the critical paths based on the difference between the lower and upper bound dynamically in every bipartitioning procedure. The delays of the critical paths are reduced with higher priority. In 1), the algorithm attempts to assign the primary inputs and outputs on each critical path to one chip so that the critical path does not cross between chips. Finally in 2), the algorithm not only decreases the number of crossings between chips but also assigns the logic-blocks on each critical path to one chip by exploiting a network flow technique. The algorithm has been implemented and applied to MCNC PARTITIONING 93 benchmark circuits. The experimental results demonstrate that it resolves almost all path delay constraints with maintaining the maximum number of required I/O blocks per chip small compared with conventional alogorithms.

  • A Simultaneous Technology Mapping, Placement, and Global Routing Algorithm for FPGAs with Path Delay Constraints

    Nozomu TOGAWA  Masao SATO  Tatsuo OHTSUKI  

     
    PAPER

      Vol:
    E79-A No:3
      Page(s):
    321-329

    In this paper, we propose a new FPGA design algorithm, Maple-opt, in which technology mapping, placement, and global routing are executed so that the delay of each critical signal path in an input circuit is within a specified upper bound imposed on it. The basic algorithm of Maple-opt is top-down hi-erarchical bi-partitioning of regions. Technology mapping onto logic-blocks of FPGAs, their placement, and global routing are determined simulatenously in each hierarchical process. This simultaneity leads to less congested layout for routing. In addition to that, Maple-opt computes a lower bound of delay for each path with a constraint value and determines critical paths based on the difference between the lower bound and the constraint value dynamically in each hierarchical process. Two delay reduction processes are executed for the critical paths; one is routing delay reduction and the other is logic-block delay reduction. Routing delay reduction is realized such that, when bi-partitioning a region, each constrained path is assigned to one subregion. Logic-block delay reduction is realized such that each constrained path is mapped onto fewer logic-blocks. Experimental results for some benchmark circuits show its efficiency and effectiveness.

  • Experimental Appraisal of Linear and Quadratic Objective Functions Effect on Force Directed Method for Analog Placement

    Imbaby I.MAHMOUD  Koji ASAKURA  Takashi NISHIBU  Tatsuo OHTSUKI  

     
    LETTER-Computer Aided Design (CAD)

      Vol:
    E77-A No:4
      Page(s):
    719-725

    This paper advocates the use of linear objective function in analytic analog placement. The role of linear and quadratic objctive functions in the behavior and results of an analog placement algorithm based on the force directed method is discussed. Experimental results for a MCNC benchmark circuit and another one from text books are shown to demonstrate the effect of a linear and a quadratic objective function on the analog constraint satisfaction and CPU time. By introducing linear objective function to the algorithm, we obtain better placements in terms of analog constraint satisfaction and computation cost than in case of conventional quadratic objective function.

  • CAM Processor Synthesis Based on Behavioral Descriptions

    Nozomu TOGAWA  Tatsuhiko WAKUI  Tatsuhiko YODEN  Makoto TERAJIMA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-Co-design and High-level Synthesis

      Vol:
    E83-A No:12
      Page(s):
    2464-2473

    CAM (Content Addressable Memory) units are generally designed so that they can be applied to variety of application programs. However, if a particular application runs on CAM units, some functions in CAM units may be often used and other functions may never be used. We consider that appropriate design for CAM units is required depending on the requirements for a given application program. This paper proposes a CAM processor synthesis system based on behavioral descriptions. The input of the system is an application program written in C including CAM functions, and its output is hardware descriptions of a synthesized processor and a binary code executed on it. Since the system determines functions in CAM units and synthesizes a CAM processor depending on the requirements of an application program, we expect that a synthesized CAM processor can execute the application program with small processor area and delay. Experimental results demonstrate its efficiency and effectiveness.

  • FPGA-Based Reconfigurable Adaptive FEC

    Kazunori SHIMIZU  Jumpei UCHIDA  Yuichiro MIYAOKA  Nozomu TOGAWA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-System Level Design

      Vol:
    E87-A No:12
      Page(s):
    3036-3046

    In this paper, we propose a reconfigurable adaptive FEC system. In adaptive FEC schemes, the error correction capability t is changed dynamically according to the communication channel condition. If a particular error correction capability t is given, we can implement an FEC decoder which is optimal for t by taking the number of operations into consideration. Thus, reconfiguring the optimal FEC decoder dynamically for each error correction capability allows us to maximize the throughput of each decoder within a limited hardware resource. Based on this concept, our reconfigurable adaptive FEC system can reduce the packet dropping rate more efficiently than conventional fixed hardware systems. We can improve data transmission throughput for a reliable transport protocol. Practical simulation results are also shown.

  • A CAM-Based Parallel Fault Simulation Algorithm with Minimal Storage Size

    Shinsuke OHNO  Masao SATO  Tatsuo OHTSUKI  

     
    PAPER

      Vol:
    E78-A No:12
      Page(s):
    1755-1764

    CAMs (Content Addressable Memories) are functional memories which have functions such as word-parallel equivalence search, bilateral 1-bit data shifting between consecutive words, and word-parallel writing. Since CAMs can be integrated because of their regular structure, massively parallel CAM functions can be executed. Taking advantage of CAMs, Ishiura and Yajima have proposed a parallel fault simulation algorithm using a CAM. This algorithm, however, requires a large amount of CAM storage to simulate large-scale circuits. In this paper, we propose a new massively parallel fault simulation algorithm requiring less CAM storage, and compare it with Ishiura and Yajima's algorithm. Experimental results of the algorithm on CHARGE --the CAM-based hardware engine developed in our laboratory--are also reported.

  • X-Handling for Current X-Tolerant Compactors with More Unknowns and Maximal Compaction

    Youhua SHI  Nozomu TOGAWA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-Logic Synthesis, Test and Verfication

      Vol:
    E92-A No:12
      Page(s):
    3119-3127

    This paper presents a novel X-handling technique, which removes the effect of unknowns on compacted test response with maximal compaction ratio. The proposed method combines with the current X-tolerant compactors and inserts masking cells on scan paths to selectively mask X's. By doing this, the number of unknown responses in each scan-out cycle could be reduced to a reasonable level such that the target X-tolerant compactor would tolerate with guaranteed possible error detection. It guarantees no test loss due to the effect of X's, and achieves the maximal compaction that the target response compactor could provide as well. Moreover, because the masking cells are only inserted on the scan paths, it has no performance degradation of the designs. Experimental results demonstrate the effectiveness of the proposed method.

  • A Selective Scan Chain Reconfiguration through Run-Length Coding for Test Data Compression and Scan Power Reduction

    Youhua SHI  Shinji KIMURA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-Test

      Vol:
    E87-A No:12
      Page(s):
    3208-3215

    Test data volume and power consumption for scan-based designs are two major concerns in system-on-a-chip testing. However, test set compaction by filling the don't-cares will invariably increase the scan-in power dissipation for scan testing, then the goals of test data reduction and low-power scan testing appear to be conflicted. Therefore, in this paper we present a selective scan chain reconfiguration method for test data compression and scan-in power reduction. The proposed method analyzes the compatibility of the internal scan cells for a given test set and then divides the scan cells into compatible classes. After the scan chain reconfiguration a dictionary is built to indicate the run-length of each compatible class and only the scan-in data for each class should be transferred from the ATE to the CUT so as to reduce test data volume. Experimental results for the larger ISCAS'89 benchmarks show that the proposed approach overcomes the limitations of traditional run-length coding techniques, and leads to highly reduced test data volume with significant power savings during scan testing in all cases.

  • High-Level Area/Delay/Power Estimation for Low Power System VLSIs with Gated Clocks

    Shinichi NODA  Nozomu TOGAWA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER

      Vol:
    E85-A No:4
      Page(s):
    827-834

    At high-level synthesis for system VLSIs, their power consumption is efficiently reduced by applying gated clocks to them. Since using gated clocks causes the reduction of power consumption and the increase of area/delay, estimating trade-off between power and area/delay by applying gated clocks is very important. In this paper, we discuss the amount of variance of area, delay and power by applying gated clocks. We propose a simple gate-level circuit model and estimation equations. We vary parameters in our proposed circuit model, and evaluate power consumption by back-annotating gate-level simulation results to the original circuit. This paper also proposes a conditional expression for applying gated clocks. The expression shows whether or not we can reduce power consumption by applying gated clocks. We confirm the accuracy of proposed estimation equations by experiments.

  • Scan-Based Side-Channel Attack against RSA Cryptosystems Using Scan Signatures

    Ryuta NARA  Kei SATOH  Masao YANAGISAWA  Tatsuo OHTSUKI  Nozomu TOGAWA  

     
    PAPER-Logic Synthesis, Test and Verification

      Vol:
    E93-A No:12
      Page(s):
    2481-2489

    Scan-based side-channel attacks retrieve a secret key in a cryptography circuit by analyzing scanned data. Since they must be considerable threats to a cryptosystem LSI, we have to protect cryptography circuits from them. RSA is one of the most important cryptography algorithms because it effectively realizes a public-key cryptography system. RSA is extensively used but conventional scan-based side-channel attacks cannot be applied to it because it has a complicated algorithm. This paper proposes a scan-based side-channel attack which enables us to retrieve a secret key in an RSA circuit. The proposed method is based on detecting intermediate values calculated in an RSA circuit. We focus on a 1-bit time-sequence which is specific to some intermediate values. By monitoring the 1-bit time-sequence in the scan path, we can find out the register position specific to the intermediate value and we can know whether this intermediate value is calculated or not in the target RSA circuit. We can retrieve a secret key one-bit by one-bit from MSB to LSB. The experimental results demonstrate that a 1,024-bit secret key used in the target RSA circuit can be retrieved using 30.2 input messages within 98.3 seconds and its 2,048-bit secret key can be retrieved using 34.4 input within 634.0 seconds.

  • Simultaneous Placement and Global Routing for Transport-Processing FPGA Layout

    Nozumu TOGAWA  Masao SATO  Tatsuo OHTSUKI  

     
    PAPER

      Vol:
    E79-A No:12
      Page(s):
    2140-2150

    Transport-processing FPGAs have been proposed for flexible telecommunication systems. Since those FPGAs have finer granularity of logic functions to implement circuits on them, the amount of routing resources tends to increase. In order to keep routing congstion small, it is necessary to execute placement and routing simultaneously. This paper proposes a simultaneous placement and global routing algorithm for transport-processing FPGAs whose primary objective is minimizing routing congestion. The algorithm is based on hierarchical bipartition of layout regions and sets of LUTs (Look Up Tables) to be placed. It achieves bipartitioning which leads to small routing congestion by applying a network flow technique to it and computing a maximum flow and a minimum cut. If there exist connections between bipartitioned LUT sets, pairs of pseudo-terminals are introduced to preserve the connections. A sequence of pseudo-terminals represents a global route of each net. As a result, both placement of LUTs and global routing are determined when hierarchical bipartitioning procedures are finished. The proposed algorithm has been implemented and applied to practical transport-processing circuits. The experimental results demonstrate that it decreases routing congestion by an average of 37% compared with a conventional algorithm and achieves 100% routing for the circuits for which the conventional algorithm causes unrouted nets.

  • A Hybrid Dictionary Test Data Compression for Multiscan-Based Designs

    Youhua SHI  Shinji KIMURA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-Test

      Vol:
    E87-A No:12
      Page(s):
    3193-3199

    In this paper, we present a test data compression technique to reduce test data volume for multiscan-based designs. In our method the internal scan chains are divided into equal sized groups and two dictionaries were build to encode either an entire slice or a subset of the slice. Depending on the codeword, the decompressor may load all scan chains or may load only a group of the scan chains, which can enhance the effectiveness of dictionary-based compression. In contrast to previous dictionary coding techniques, even for the CUT with a large number of scan chains, the proposed approach can achieve satisfied reduction in test data volume with a reasonable smaller dictionary. Experimental results showed the proposed test scheme works particularly well for the large ISCAS'89 benchmarks.

  • Optimal Constraint Graph Generation Algorithm for Layout Compaction Using Enhanced Plane-Sweep Method

    Toru AWASHIMA  Masao SATO  Tatsuo OHTSUKI  

     
    PAPER

      Vol:
    E76-A No:4
      Page(s):
    507-512

    This paper presents an optimal constraint graph generation algorithm for graph-based one-dimensional layout compaction. The first published algorithm for this problem was the shadow-propagation algorithm. However, without sophisticated implementation of a shadow-front, complexity of the algorithm could fall into O(n2), where n is the number of layout objects. Although our algorithm, called the enhanced plane-sweep based graph generation algorithm, is an extension of the shadow-propagation algorithm, such a drawback is resolved by introducing an enhanced plane-sweep technique. The algorithm maintains multiple shadow-fronts simultaneously by storing them in a work-list called previous-boundary. Since a balanced search tree is selected for implementation of the worklist, total complexity of the algorithm is O(n log n) which is optimal. Experimental results show that the enhanced plane-sweep based graph generation algorithm runs in almost linear time with respect to the number of layout objects and is faster than the perpendicular plane-sweep algorithm which is also optimal in terms of time complexity.

  • Unified Dual-Radix Architecture for Scalable Montgomery Multiplications in GF(P) and GF(2n)

    Kazuyuki TANIMURA  Ryuta NARA  Shunitsu KOHARA  Youhua SHI  Nozomu TOGAWA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E92-A No:9
      Page(s):
    2304-2317

    Modular multiplication is the most dominant arithmetic operation in elliptic curve cryptography (ECC), that is a type of public-key cryptography. Montgomery multiplier is commonly used to compute the modular multiplications and requires scalability because the bit length of operands varies depending on its security level. In addition, ECC is performed in GF(P) or GF(2n), and unified architecture for multipliers in GF(P) and GF(2n) is required. However, in previous works, changing frequency is necessary to deal with delay-time difference between GF(P) and GF(2n) multipliers because the critical path of the GF(P) multiplier is longer. This paper proposes unified dual-radix architecture for scalable Montgomery multiplications in GF(P) and GF(2n). This proposed architecture unifies four parallel radix-216 multipliers in GF(P) and a radix-264 multiplier in GF(2n) into a single unit. Applying lower radix to GF(P) multiplier shortens its critical path and makes it possible to compute the operands in the two fields using the same multiplier at the same frequency so that clock dividers to deal with the delay-time difference are not required. Moreover, parallel architecture in GF(P) reduces the clock cycles increased by dual-radix approach. Consequently, the proposed architecture achieves to compute a GF(P) 256-bit Montgomery multiplication in 0.28 µs. The implementation result shows that the area of the proposal is almost the same as that of previous works: 39 kgates.

  • A Built-in Reseeding Technique for LFSR-Based Test Pattern Generation

    Youhua SHI  Zhe ZHANG  Shinji KIMURA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-Timing Verification and Test Generation

      Vol:
    E86-A No:12
      Page(s):
    3056-3062

    Reseeding technique is proposed to improve the fault coverage in pseudo-random testing. However most of previous works on reseeding is based on storing the seeds in an external tester or in a ROM. In this paper we present a built-in reseeding technique for LFSR-based test pattern generation. The proposed structure can run both in pseudorandom mode and in reseeding mode. Besides, our method requires no storage for the seeds since in reseeding mode the seeds can be generated automatically in hardware. In this paper we also propose an efficient grouping algorithm based on simulated annealing to optimize test vector grouping. Experimental results for benchmark circuits indicate the superiority of our technique against other reseeding methods with respect to test length and area overhead. Moreover, since the theoretical properties of LFSRs are preserved, our method could be beneficially used in conjunction with any other techniques proposed so far.

  • A Depth-Constrained Technology Mapping Algorithm for Logic-Blocks Composed of Tree-Structured LUTs

    Nozomu TOGAWA  Koji ARA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER

      Vol:
    E82-A No:3
      Page(s):
    473-482

    This paper proposes a fast depth-constrained technology mapping algorithm for logic-blocks composed of tree-structured lookup tables. First, we propose a technology mapping algorithm which minimizes the number of logic-blocks if an input Boolean network is a tree. Second, we propose a technology mapping algorithm which minimizes logic depth for any input Boolean network. Finally, we combine those two technology mapping algorithms and propose an algorithm which realizes technology mapping whose depth is bounded by a given upper bound dc. Experimental results demonstrate the effectiveness and efficiency of the proposed algorithm.

1-20hit(49hit)