The search functionality is under construction.

Author Search Result

[Author] Masashi IMAI(9hit)

1-9hit
  • High-Throughput Partially Parallel Inter-Chip Link Architecture for Asynchronous Multi-Chip NoCs

    Naoya ONIZAWA  Akira MOCHIZUKI  Hirokatsu SHIRAHAMA  Masashi IMAI  Tomohiro YONEDA  Takahiro HANYU  

     
    PAPER-Dependable Computing

      Vol:
    E97-D No:6
      Page(s):
    1546-1556

    This paper introduces a partially parallel inter-chip link architecture for asynchronous multi-chip Network-on-Chips (NoCs). The multi-chip NoCs that operate as a large NoC have been recently proposed for very large systems, such as automotive applications. Inter-chip links are key elements to realize high-performance multi-chip NoCs using a limited number of I/Os. The proposed asynchronous link based on level-encoded dual-rail (LEDR) encoding transmits several bits in parallel that are received by detecting the phase information of the LEDR signals at each serial link. It employs a burst-mode data transmission that eliminates a per-bit handshake for a high-speed operation, but the elimination may cause data-transmission errors due to cross-talk and power-supply noises. For triggering data retransmission, errors are detected from the embedded phase information; error-detection codes are not used. The throughput is theoretically modelled and is optimized by considering the bit-error rate (BER) of the link. Using delay parameters estimated for a 0.13 µm CMOS technology, the throughput of 8.82 Gbps is achieved by using 10 I/Os, which is 90.5% higher than that of a link using 9 I/Os without an error-detection method operating under negligible low BER (<10-20).

  • Signal Propagation Delay Model in Vertically Stacked Chips

    Nanako NIIOKA  Masayuki WATANABE  Masa-aki FUKASE  Masashi IMAI  Atsushi KUROKAWA  

     
    PAPER-Device and Circuit Modeling and Analysis

      Vol:
    E98-A No:12
      Page(s):
    2614-2624

    To design high quality three-dimensional integrated circuits (3-D ICs), the effect of process and design parameters on delay must be adequately understood. This paper presents an electrical circuit model of an entire structure in through silicon via (TSV) based 3-D ICs with a new equation for on-chip interconnect capacitance and then proposes an effective model for evaluating signal propagation delay in vertically stacked chips. All electrical parameter values can be calculated by the closed-form equations without a field solver. The delay model is constructed with the first- or second-order function of each parameter to the delay obtained from a typical structure. The results obtained by on-chip interconnect capacitance equations and delay model are in excellent agreement with those by a field solver and circuit simulator, respectively. We also show that the model is very useful for evaluating effects of the process and design parameters on vertical signal propagation delay such as the sensitivity and variability analysis.

  • Synthesis of Serial Local Clock Controllers for Asynchronous Circuit Design

    Nattha SRETASEREEKUL  Hiroshi SAITO  Euiseok KIM  Metehan OZCAN  Masashi IMAI  Hiroshi NAKAMURA  Takashi NANYA  

     
    PAPER-IP Design

      Vol:
    E86-A No:12
      Page(s):
    3028-3037

    Asynchronous controllers effectively control high concurrence of datapath operations for high speed. Signal Transition Graphs (STGs) can effectively represent these concurrent events. However, highly concurrent STGs cause the state explosion problem in asynchronous synthesis tools. Many small but highly concurrent STGs cannot be synthesized to obtain control circuits. Moreover, STGs also lead to some control-time overhead of the four-phase handshake protocol. In this paper, we propose a method for deriving the serial control nodes from Control Data Flow Graphs (CDFGs) such that the concurrence of datapath operations is still preserved. The STGs derived from the serialized control nodes are serial STGs which are simpler for synthesis than the concurrent STGs. We also propose an implementation using these serialized controllers to generate local clocks at any necessary times. The implementation results in very small control-time overhead. The experimental results show that the number of synthesis states is proportional to the number of control signals, and the circuits with satisfiable small control-time overhead are obtained.

  • Design Method of High Performance and Low Power Functional Units Considering Delay Variations

    Kouichi WATANABE  Masashi IMAI  Masaaki KONDO  Hiroshi NAKAMURA  Takashi NANYA  

     
    PAPER-Circuit Synthesis

      Vol:
    E89-A No:12
      Page(s):
    3519-3528

    As VLSI technology advances, delay variations will become more serious. Delay-insensitive asynchronous dual-rail circuits tolerate any delay variation, but their energy consumption is more than double that of the single-rail circuits because signal transitions occur every cycle in all bits regardless of the input bit pattern. However, in functional units, a significant number of input bits may not change from the previous input in many cases. In such a situation, calculation of these bits is not required. Thus, we propose a method, called unflip-bits control, makes use of the above situation, to reduce energy consumption. We evaluate the energy consumption and performance penalty for the method using HSPICE and the verilog-XL simulator, and compare the method with the conventional dual-rail circuit and a synchronous circuit. Our evaluation results reveal that the proposed asynchronous dual-rail circuit has a 12-60% lower energy consumption compared with a conventional asynchronous dual-rail circuit.

  • Task Scheduling Based Redundant Task Allocation Method for the Multi-Core Systems with the DTTR Scheme

    Hiroshi SAITO  Masashi IMAI  Tomohiro YONEDA  

     
    PAPER

      Vol:
    E100-A No:7
      Page(s):
    1363-1373

    In this paper, we propose a redundant task allocation method for multi-core systems based on the Duplication with Temporary Triple-Modular Redundancy and Reconfiguration (DTTR) scheme. The proposed method determines task allocation of a given task graph to a given multi-core system model from task scheduling in given fault patterns. Fault patterns defined in this paper consist of a set of faulty cores and a set of surviving cores. To optimize the average failure rate of the system, task scheduling minimizes the execution time of the task graph preserving the property of the DTTR scheme. In addition, we propose a selection method of fault patterns to be scheduled to reduce the task allocation time. In the experiments, at first, we evaluate the proposed selection method of fault patterns in terms of the task allocation time. Then, we compare the average failure rate among the proposed method, a task allocation method which packs tasks into particular cores as much as possible, a task allocation method based on Simulated Annealing (SA), a task allocation method based on Integer Linear Programming (ILP), and a task allocation method based on task scheduling without considering the property of the DTTR scheme. The experimental results show that task allocation by the proposed method results in nearly the same average failure rate by the SA based method with shorter task allocation time.

  • Evaluation of Checkpointing Mechanism on SCore Cluster System

    Masaaki KONDO  Takuro HAYASHIDA  Masashi IMAI  Hiroshi NAKAMURA  Takashi NANYA  Atsushi HORI  

     
    PAPER-Dependable Software

      Vol:
    E86-D No:12
      Page(s):
    2553-2562

    Cluster systems are getting widely used because of good performance / cost ratio. However, their reliability has not been well discussed in practical environment so far. As the number of commodity components in a cluster system gets increased, it is indispensable to support reliability by system software. SCore cluster system software is a parallel programming environment for High Performance Computing (HPC). SCore provides checkpointing and rollback-recovery mechanism for high availability. In this paper, we analyze and evaluate the checkpointing and rollback-recovery mechanisms of SCore quantitively. The experimental results reveal that the required time for checkpointing scales very well in respect to the number of computing nodes. However, the required time is quite long due to the low effective network bandwidth. Based on the results, we modify SCore and successfully make checkpointing and recovery 1.8 2.8 times and 3.7 5.0 times faster respectively. This is very helpful for cluster systems to achieve high performance and high availability.

  • Fault Diagnosis and Reconfiguration Method for Network-on-Chip Based Multiple Processor Systems with Restricted Private Memories

    Masashi IMAI  Tomohiro YONEDA  

     
    PAPER

      Vol:
    E96-D No:9
      Page(s):
    1914-1925

    We propose a fault diagnosis and reconfiguration method based on the Pair and Swap scheme to improve the reliability and the MTTF (Mean Time To Failure) of network-on-chip based multiple processor systems where each processor core has its private memory. In the proposed scheme, two identical copies of a given task are executed on a pair of processor cores and the results are compared repeatedly in order to detect processor faults. If a fault is detected by mismatches, the fault is identified and isolated using a TMR (Triple Module Redundancy) and the system is reconfigured by the redundant processor cores. We propose that each task is quadruplicated and statically assigned to private memories so that each memory has only two different tasks. We evaluate the reliability of the proposed quadruplicated task allocation scheme in the viewpoint of MTTF. As a result, the MTTF of the proposed scheme is over 4.3 times longer than that of the duplicated task allocation scheme.

  • A Cascade ALU Architecture for Asynchronous Super-Scalar Processors

    Motokazu OZAWA  Masashi IMAI  Yoichiro UENO  Hiroshi NAKAMURA  Takashi NANYA  

     
    PAPER

      Vol:
    E84-C No:2
      Page(s):
    229-237

    Wire delays, instead of gate delays, are moving into dominance in modern VLSI design. Current synchronous processors have the critical path not in the ALU function but in the cache access. Since the cache performance enhancement is limited by the memory access delay which mainly consists of wire delays, a reduction in gate delays may no longer imply any enhancement in processor performance. To solve this problem, this paper presents a novel architecture, called the Cascade ALU. The Cascade ALU allows super-scalar processors with future technologies to move the critical path into the ALU part. Therefore the Cascade ALU can enjoy the expected progress in future device speed. Since the delay of the Cascade ALU varies depending on the executed instructions, an asynchronous system is shown to be suitable for implementing the Cascade ALU. However an asynchronous system may have a large handshake overhead, this paper also presents an asynchronous Fine Grain Pipeline technique that hides the handshake overhead. Finally, this paper presents results of performance and area evaluation for an asynchronous implementation of the cascade ALU. The results show that the cascade ALU architecture has a good performance scalability on the reduction of the ALU latency and imposes little area penalty compared with current synchronous processors.

  • Novel Implementation Method of Multiple-Way Asynchronous Arbiters

    Masashi IMAI  Tomohiro YONEDA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E98-A No:7
      Page(s):
    1519-1528

    Multiple-way (N-way) asynchronous arbitration is an important issue in asynchronous system design. In this paper, novel implementation methods of N-way asynchronous arbiters are presented. We first present N-way rectangle mesh arbiters using 2-way mutual exclusion elements. Then, N-way token-ring arbiters based on the non-return-to-zero signaling is also presented. The former can issue grant signals with the same percentage for all the arriving request signals while the latency is proportional to the number of inputs. The latter can achieve low latency and low energy arbitration for a heavy workload environment and a large number of inputs. In this paper, we compare their performances using the 28nm FD-SOI process technologies qualitatively and quantitatively.