The search functionality is under construction.

Author Search Result

[Author] Kazuaki MURAKAMI(33hit)

1-20hit(33hit)

  • Architectural-Level Soft-Error Modeling for Estimating Reliability of Computer Systems

    Makoto SUGIHARA  Tohru ISHIHARA  Kazuaki MURAKAMI  

     
    PAPER-VLSI Design Technology

      Vol:
    E90-C No:10
      Page(s):
    1983-1991

    This paper proposes a soft-error model for accurately estimating reliability of a computer system at the architectural level within reasonable computation time. The architectural-level soft-error model identifies which part of memory modules are utilized temporally and spatially and which single event upsets (SEUs) are critical to the program execution of the computer system at the cycle accurate instruction set simulation (ISS) level. The soft-error model is capable of estimating reliability of a computer system that has several memory hierarchies with it and finding which memory module is vulnerable in the computer system. Reliability estimation helps system designers apply reliable design techniques to vulnerable part of their design. The experimental results have shown that the usage of the soft-error model achieved more accurate reliability estimation than conventional approaches. The experimental results demonstrate that reliability of computer systems depends on not only soft error rates (SERs) of memories but also the behavior of software running in computer systems.

  • Tradeoffs in Processor Design for Superscalar Architectures

    Kazuaki MURAKAMI  Morihiro KUGA  Oubong GWUN  Shinji TOMITA  

     
    PAPER-Computer Systems

      Vol:
    E74-D No:11
      Page(s):
    3883-3893

    Superscalar processors can improve uniprocessor performance further byond RISC performance by exploiting spatial instruction-level parallelism. Superscalar processor design presents more opportunities for tradeoffs than conventional RISC design. In order to utilize processor resources augmented by the superscalar approaches, processors must be carefully designed and implemented. This paper examines the various aspects of superscalar processors and discusses the design features and tradeoffs. Specific aspects of superscalar processors that are examined include: instruction fetch boundary, instruction-cache line crossing, branch prediction, data-hazard resolution, control-hazard resolution, and precise or imprecise interrupts. This paper uses a superscalar simulator that modeled a DDU (Dynamically-hazard-resolved, Dynamic-code-scheduled, Uniform) superscalar architecture, called SIMP (Single Instructions stream/Multiple instruction Pipelining), and evaluate many different SIMP hardware organizations. This paper concludes that a superscalar processor can increase the performance with major five hardwary features: instruction aligning, branch prediction with branch-target buffer, code scheduling, speculative execution with conditional mode, and imprecise interrupts. However, the first three functions are claimed to be performed by compilers rather than by hardware.

  • Relaxing Constraints due to Data and Control Dependences

    Katsuhiko METSUGI  Kazuaki MURAKAMI  

     
    PAPER-Computer Systems

      Vol:
    E86-D No:5
      Page(s):
    920-928

    TLSP (Thread-Level Speculative Parallel processing) architecture is a growing processor architecture. Parallelism of a program executed on this architecture is ruled by the combination of techniques which relax data dependences. In this paper, we evaluate the limits of parallelism of the TLSP architecture by using abstract machine models. We have three major results. First, if we use solely each technique which relaxes data dependences, "renaming" has a large effect on the TLSP architecture. Second, combinatorial use of "memory disambiguation" and "renaming" leads to huge parallelism. Third, constant effects are obtained by concurrent use of "value prediction" and other techniques.

  • Trends in High-Performance, Low-Power Processor Architectures

    Kazuaki MURAKAMI  Hidetaka MAGOSHI  

     
    PAPER

      Vol:
    E84-C No:2
      Page(s):
    131-138

    This paper briefly surveys architectural technologies of recent or future high-performance, low-power processors for improving the performance and power/energy consumption simultaneously. Achieving both high performance and low power at the same time imposes a lot of challenges on processor design, and therefore gives us a lot of opportunities for devising new technologies. The paper also tries to provide some insights into the technology direction in future.

  • Reducing On-Chip DRAM Energy via Data Transfer Size Optimization

    Takatsugu ONO  Koji INOUE  Kazuaki MURAKAMI  Kenji YOSHIDA  

     
    PAPER

      Vol:
    E92-C No:4
      Page(s):
    433-443

    This paper proposes a software-controllable variable line-size (SC-VLS) cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small cache memory to improve the performance. We exploit the cache to reduce the DRAM energy consumption. During application program executions, an adequate cache line size which produces the lowest cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large cache line size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small line size because of the huge number of banks are accessed. The SC-VLS cache is able to change a line size to an adequate one at runtime with a small area and power overheads. We analyze the adequate line size and insert line size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS cache reduces the DRAM energy consumption up to 88%, compared to a conventional cache with fixed 256 B lines.

  • Identifying Processor Bottlenecks in Virtual Machine Based Execution of Java Bytecode

    Pradeep RAO  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E92-C No:10
      Page(s):
    1265-1275

    Despite the prevalence of Java workloads across a variety of processor architectures, there is very little published data on the impact of the various processor design decisions on Java performance. We attribute the lack of data to the large design space resulting from the complexity of the modern superscalar processor and the additional complexities associated with executing Java bytecode using a virtual machine. To address this shortcoming, we use a statistically rigorous methodology to systematically quantify the the impact of the various processor microarchitecture parameters on Java execution performance. The adopted methodology enables efficient screening of significant factor effects in a large design space consisting of 35 factors (32-billion potential configurations) using merely 72 observations per benchmark application. We quantify and tabulate the significance of each of the 35 factors for 13 benchmark applications. While these tables provide various insights into Java performance, they consistently highlight the performance significance of the instruction delivery mechanism, especially the instruction cache and the ITLB design parameters. Furthermore, these tables enable the architect to identify processor bottlenecks for Java workloads by providing an estimate of the relative impact of various design decisions.

  • A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips

    Mariko SAKAMOTO  Akira KATSUNO  Go SUGIZAKI  Toshio YOSHIDA  Aiichiro INOUE  Koji INOUE  Kazuaki MURAKAMI  

     
    PAPER-VLSI Architecture for Communication/Server Systems

      Vol:
    E90-C No:10
      Page(s):
    1972-1982

    Broadcast and synchronization techniques are used for cache coherence control in conventional larger scale snoop-based SMP systems. The penalty for synchronization is directly proportional to system size. Meanwhile, advances in LSI technology now enable placing a memory controller on a CPU die. The latency to access directly linked memory is drastically reduced by an on-die controller. Developing an enterprise server system with these CPUs allows us an opportunity to achieve higher performance. Though the penalty of synchronization is counted whenever a cache miss occurs, it is necessary to improve the coherence method to receive the full benefit of this effect. In this paper, we demonstrate a coherence directory organization that fits into DSM enterprise server systems. Originally, a directory-based method was adopted in high performance computing systems because of its huge scalability in comparison with snoop-based method. Though directory capacity miss and long directory access latency are the major problems of this method, the relaxed scalability requirement of enterprise servers is advantageous to us to solve these problems along with an advanced LSI technology. Our proposed directory solves both problems by implementing a full bit vector level map of the coherence directory on an LSI chip. Our experimental results validate that a system controlled by our proposed directory can surpass a snoop-based system in performance even without applying data localization optimization to an online transaction processing (OLTP) workload.

  • Cell Library Development Methodology for Throughput Enhancement of Character Projection Equipment

    Makoto SUGIHARA  Taiga TAKATA  Kenta NAKAMURA  Ryoichi INANAMI  Hiroaki HAYASHI  Katsumi KISHIMOTO  Tetsuya HASEBE  Yukihiro KAWANO  Yusuke MATSUNAGA  Kazuaki MURAKAMI  Katsuya OKUMURA  

     
    PAPER-CAD

      Vol:
    E89-C No:3
      Page(s):
    377-383

    We propose a cell library development methodology for throughput enhancement of character projection equipment. First, an ILP (Integer Linear Programming)-based cell selection is proposed for the equipment for which both of the CP (Character Projection) and VSB (Variable Shaped Beam) methods are available, in order to minimize the number of electron beam (EB) shots, that is, time to fabricate chips. Secondly, the influence of cell directions on area and delay time of chips is examined. The examination helps to reduce the number of EB shots with a little deterioration of area and delay time because unnecessary directions of cells can be removed. Finally, a case study is shown in which the numbers of EB shots are shown for several cases.

  • A Reconfigurable Data-Path Accelerator Based on Single Flux Quantum Circuits Open Access

    Hiroshi KATAOKA  Hiroaki HONDA  Farhad MEHDIPOUR  Nobuyuki YOSHIKAWA  Akira FUJIMAKI  Hiroyuki AKAIKE  Naofumi TAKAGI  Kazuaki MURAKAMI  

     
    INVITED PAPER

      Vol:
    E97-C No:3
      Page(s):
    141-148

    The single flux quantum (SFQ) is expected to be a next-generation high-speed and low-power technology in the field of logic circuits. CMOS as the dominant technology for conventional processors cannot be replaced with SFQ technology due to the difficulty of implementing feedback loops and conditional branches using SFQ circuits. This paper investigates the applicability of a reconfigurable data-path (RDP) accelerator based on SFQ circuits. The authors introduce detailed specifications of the SFQ-RDP architecture and compare its performance and power/performance ratio with those of a graphics-processing unit (GPU). The results show at most 1600 times higher efficiency in terms of Flops/W (floating-point operations per second/Watt) for some high-performance computing application programs.

  • A Reconfigurable Functional Unit with Conditional Execution for Multi-Exit Custom Instructions

    Hamid NOORI  Farhad MEHDIPOUR  Koji INOUE  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    497-508

    Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of these custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. A quantitative approach is utilized to propose an efficient architecture for the RFU and fix its constraints. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple exits custom instructions are proposed. Conditional execution has been added to the RFU to support the multi-exit feature of custom instructions. Experimental results show that multi-exit custom instructions enhance the performance by an average of 67% compared to custom instructions limited to one basic block. A maximum speedup of 4.7, compared to a general embedded processor, and an average speedup of 1.85 was achieved on MiBench benchmark suite.

  • Temperature-Aware Configurable Cache to Reduce Energy in Embedded Systems

    Hamid NOORI  Maziar GOUDARZI  Koji INOUE  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    418-431

    Energy consumption is a major concern in embedded computing systems. Several studies have shown that cache memories account for 40% or more of the total energy consumed in these systems. Active power used to be the primary contributor to total power dissipation of CMOS designs, but with the technology scaling, the share of leakage in total power consumption of digital systems continues to grow. Moreover, temperature is another factor that exponentially increases the leakage current. In this paper, we show the effect of temperature on the optimal (minimum-energy-consuming) cache configuration for low energy embedded systems. Our results show that for a given application and technology, the optimal cache size moves toward smaller caches at higher temperatures, due to the larger leakage. Consequently, a Temperature-Aware Configurable Cache (TACC) is an effective way to save energy in finer technologies when the embedded system is used in different temperatures. Our results show that using a TACC, up to 61% energy can be saved for instruction cache and 77% for data cache compared to a configurable cache that has been configured for only the corner-case temperature (100). Furthermore, the TACC also enhances the performance by up to 28% for the instruction cache and up to 17% for the data cache.

  • Proposal of a Desk-Side Supercomputer with Reconfigurable Data-Paths Using Rapid Single-Flux-Quantum Circuits

    Naofumi TAKAGI  Kazuaki MURAKAMI  Akira FUJIMAKI  Nobuyuki YOSHIKAWA  Koji INOUE  Hiroaki HONDA  

     
    INVITED PAPER

      Vol:
    E91-C No:3
      Page(s):
    350-355

    We propose a desk-side supercomputer with large-scale reconfigurable data-paths (LSRDPs) using superconducting rapid single-flux-quantum (RSFQ) circuits. It has several sets of computing unit which consists of a general-purpose microprocessor, an LSRDP and a memory. An LSRDP consists of a lot of, e.g., a few thousand, floating-point units (FPUs) and operand routing networks (ORNs) which connect the FPUs. We reconfigure the LSRDP to fit a computation, i.e., a group of floating-point operations, which appears in a 'for' loop of numerical programs by setting the route in ORNs before the execution of the loop. We propose to implement the LSRDPs by RSFQ circuits. The processors and the memories can be implemented by semiconductor technology. We expect that a 10 TFLOPS supercomputer, as well as a refrigerating engine, will be housed in a desk-side rack, using a near-future RSFQ process technology, such as 0.35 µm process.

  • Improving Performance and Energy Saving in a Reconfigurable Processor via Accelerating Control Data Flow Graphs

    Farhad MEHDIPOUR  Hamid NOORI  Morteza SAHEB ZAMANI  Koji INOUE  Kazuaki MURAKAMI  

     
    PAPER-Reconfigurable Device and Design Tools

      Vol:
    E90-D No:12
      Page(s):
    1956-1966

    Extracting frequently executed (hot) portions of the application and executing their corresponding data flow graph (DFG) on the hardware accelerator brings about more speedup and energy saving for embedded systems comprising a base processor integrated with a tightly coupled accelerator. Extending DFGs to support control instructions and using Control DFGs (CDFGs) instead of DFGs results in more coverage of application code portion are being accelerated hence, more speedup and energy saving. In this paper, motivations for extending DFGs to CDFGs and handling control instructions are introduced. In addition, basic requirements for an accelerator with conditional execution support are proposed. Then, two algorithms are presented for temporal partitioning of CDFGs considering the target accelerator architectural constraints. To demonstrate effectiveness of the proposed ideas, they are applied to the accelerator of a reconfigurable processor called AMBER. Experimental results approve the remarkable effectiveness of covering control instructions and using CDFGs versus DFGs in the aspects of performance and energy reduction.

  • Technology Mapping Technique for Increasing Throughput of Character Projection Lithography

    Makoto SUGIHARA  Kenta NAKAMURA  Yusuke MATSUNAGA  Kazuaki MURAKAMI  

     
    PAPER-Lithography-Related Techniques

      Vol:
    E90-C No:5
      Page(s):
    1012-1020

    The character projection (CP) lithography is utilized for maskless lithography and is a potential for the future photomask fabrication. The drawback of the CP lithography is its low throughput and leads to a price rise of IC devices. This paper discusses a technology mapping technique for enhancing the throughput of the CP lithography. The number of electron beam (EB) shots to project an entire chip directly determines the fabrication time for the chip as well as the throughput of CP equipment. Our technology mapping technique maps EB shot count-effective cells to a circuit in order to increase the throughput of CP equipment. Our technique treats the number of EB shots as an objective to minimize. Comparing with a conventional technology mapping, our technology mapping technique has achieved 26.6% reduction of the number of EB shots for the front-end-of-the-line (FEOL) process without any performance degradation of ICs. Moreover, our technology mapping technique has achieved a 54.6% less number of EB shots under no performance constraints. It is easy for both IC designers and equipment developers to adopt our technique because our technique is a software approach with no additional modification on CP equipment.

  • Reliable Cache Architectures and Task Scheduling for Multiprocessor Systems

    Makoto SUGIHARA  Tohru ISHIHARA  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    410-417

    This paper proposes a task scheduling approach for reliable cache architectures (RCAs) of multiprocessor systems. The RCAs dynamically switch their operation modes for reducing the usage of vulnerable SRAMs under real-time constraints. A mixed integer programming model has been built for minimizing vulnerability under real-time constraints. Experimental results have shown that our task scheduling approach achieved 47.7-99.9% less vulnerability than a conventional one.

  • Optimisations Techniques for the Automatic ISA Customisation Algorithm

    Antoine TROUVE  Kazuaki MURAKAMI  

     
    LETTER-Design Optimisation

      Vol:
    E95-D No:2
      Page(s):
    437-440

    This article introduces some improvements to the already proposed custom instruction candidates selection for the automatic ISA customisation problem targeting reconfigurable processors. It introduces new opportunities to prune the search space, and a technique based on dynamic programming to check the independence between groups. The proposed new algorithm yields one order less measured number of convexity checks than the related work for the same inputs and outputs.

  • Instruction Encoding for Reducing Power Consumption of I-ROMs Based on Execution Locality

    Koji INOUE  Vasily G. MOSHNYAGA  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E86-A No:4
      Page(s):
    799-805

    In this paper, we propose an instruction encoding scheme to reduce power consumption of instruction ROMs. The power consumption of the instruction ROM strongly depends on the switching activity of bit-lines due to their large load capacitance. In our approach, the binary-patterns to be assigned as op-codes are determined based on the frequency of instructions in order to reduce the number of bit-line dis-charging. Simulation results show that our approach can reduce 40% of bit-line switchings from a conventional organization.

  • Test Architecture Optimization for System-on-a-Chip under Floorplanning Constraints

    Makoto SUGIHARA  Kazuaki MURAKAMI  Yusuke MATSUNAGA  

     
    PAPER-Test

      Vol:
    E87-A No:12
      Page(s):
    3174-3184

    In this paper, a test architecture optimization for system-on-a-chip under floorplanning constraints is proposed. The models of previous test architecture optimizations were too ideal to be applied to industrial SOCs. To make matters worse, they couldn't treat topological locality of cores, that is, floorplanning constraints. The optimization proposed in this paper can avoid long wires for TAMs in consideration of floorplanning constraints and finish optimizing test architectures within reasonable computation time.

  • Character Projection Mask Set Optimization for Enhancing Throughput of MCC Projection Systems

    Makoto SUGIHARA  Yusuke MATSUNAGA  Kazuaki MURAKAMI  

     
    PAPER-Physical Level Design

      Vol:
    E91-A No:12
      Page(s):
    3451-3460

    Character projection (CP) lithography is utilized for maskless lithography and is a potential for the future photomask manufacture because it can project ICs much faster than point beam projection or variable-shaped beam (VSB) projection. In this paper, we first present a projection mask set development methodology for multi-column-cell (MCC) systems, in which column-cells can project patterns in parallel with the CP and VSB lithographies. Next, we present an INLP (integer nonlinear programming) model as well as an ILP (integer linear programming) model for optimizing a CP mask set of an MCC projection system so that projection time is reduced. The experimental results show that our optimization has achieved 33.4% less projection time in the best case than a naive CP mask development approach. The experimental results indicate that our CP mask set optimization method has virtually increased cell pattern objects on CP masks and has decreased VSB projection so that it has achieved higher projection throughput than just parallelizing two column-cells with conventional CP masks.

  • FOREWORD

    Kazuaki MURAKAMI  

     
    FOREWORD

      Vol:
    E81-C No:9
      Page(s):
    1373-1373
1-20hit(33hit)