IEICE global.ieice.org Site

Author Search Result

[Author] Kazuaki MURAKAMI(33hit)

1-20hit(33hit)

Architectural-Level Soft-Error Modeling for Estimating Reliability of Computer Systems
Makoto SUGIHARA Tohru ISHIHARA Kazuaki MURAKAMI

PAPER-VLSI Design Technology

Vol:
E90-C No:10
Page(s):
1983-1991
This paper proposes a soft-error model for accurately estimating reliability of a computer system at the architectural level within reasonable computation time. The architectural-level soft-error model identifies which part of memory modules are utilized temporally and spatially and which single event upsets (SEUs) are critical to the program execution of the computer system at the cycle accurate instruction set simulation (ISS) level. The soft-error model is capable of estimating reliability of a computer system that has several memory hierarchies with it and finding which memory module is vulnerable in the computer system. Reliability estimation helps system designers apply reliable design techniques to vulnerable part of their design. The experimental results have shown that the usage of the soft-error model achieved more accurate reliability estimation than conventional approaches. The experimental results demonstrate that reliability of computer systems depends on not only soft error rates (SERs) of memories but also the behavior of software running in computer systems.
Tradeoffs in Processor Design for Superscalar Architectures
Kazuaki MURAKAMI Morihiro KUGA Oubong GWUN Shinji TOMITA

PAPER-Computer Systems

Vol:
E74-D No:11
Page(s):
3883-3893
Superscalar processors can improve uniprocessor performance further byond RISC performance by exploiting spatial instruction-level parallelism. Superscalar processor design presents more opportunities for tradeoffs than conventional RISC design. In order to utilize processor resources augmented by the superscalar approaches, processors must be carefully designed and implemented. This paper examines the various aspects of superscalar processors and discusses the design features and tradeoffs. Specific aspects of superscalar processors that are examined include: instruction fetch boundary, instruction-cache line crossing, branch prediction, data-hazard resolution, control-hazard resolution, and precise or imprecise interrupts. This paper uses a superscalar simulator that modeled a DDU (Dynamically-hazard-resolved, Dynamic-code-scheduled, Uniform) superscalar architecture, called SIMP (Single Instructions stream/Multiple instruction Pipelining), and evaluate many different SIMP hardware organizations. This paper concludes that a superscalar processor can increase the performance with major five hardwary features: instruction aligning, branch prediction with branch-target buffer, code scheduling, speculative execution with conditional mode, and imprecise interrupts. However, the first three functions are claimed to be performed by compilers rather than by hardware.
Relaxing Constraints due to Data and Control Dependences
Katsuhiko METSUGI Kazuaki MURAKAMI

PAPER-Computer Systems

Vol:
E86-D No:5
Page(s):
920-928
TLSP (Thread-Level Speculative Parallel processing) architecture is a growing processor architecture. Parallelism of a program executed on this architecture is ruled by the combination of techniques which relax data dependences. In this paper, we evaluate the limits of parallelism of the TLSP architecture by using abstract machine models. We have three major results. First, if we use solely each technique which relaxes data dependences, "renaming" has a large effect on the TLSP architecture. Second, combinatorial use of "memory disambiguation" and "renaming" leads to huge parallelism. Third, constant effects are obtained by concurrent use of "value prediction" and other techniques.
Trends in High-Performance, Low-Power Processor Architectures
Kazuaki MURAKAMI Hidetaka MAGOSHI

PAPER

Vol:
E84-C No:2
Page(s):
131-138
This paper briefly surveys architectural technologies of recent or future high-performance, low-power processors for improving the performance and power/energy consumption simultaneously. Achieving both high performance and low power at the same time imposes a lot of challenges on processor design, and therefore gives us a lot of opportunities for devising new technologies. The paper also tries to provide some insights into the technology direction in future.
Reducing On-Chip DRAM Energy via Data Transfer Size Optimization
Takatsugu ONO Koji INOUE Kazuaki MURAKAMI Kenji YOSHIDA

PAPER

Vol:
E92-C No:4
Page(s):
433-443
This paper proposes a software-controllable variable line-size (SC-VLS) cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small cache memory to improve the performance. We exploit the cache to reduce the DRAM energy consumption. During application program executions, an adequate cache line size which produces the lowest cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large cache line size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small line size because of the huge number of banks are accessed. The SC-VLS cache is able to change a line size to an adequate one at runtime with a small area and power overheads. We analyze the adequate line size and insert line size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS cache reduces the DRAM energy consumption up to 88%, compared to a conventional cache with fixed 256 B lines.
Identifying Processor Bottlenecks in Virtual Machine Based Execution of Java Bytecode
Pradeep RAO Kazuaki MURAKAMI

PAPER

Vol:
E92-C No:10
Page(s):
1265-1275
Despite the prevalence of Java workloads across a variety of processor architectures, there is very little published data on the impact of the various processor design decisions on Java performance. We attribute the lack of data to the large design space resulting from the complexity of the modern superscalar processor and the additional complexities associated with executing Java bytecode using a virtual machine. To address this shortcoming, we use a statistically rigorous methodology to systematically quantify the the impact of the various processor microarchitecture parameters on Java execution performance. The adopted methodology enables efficient screening of significant factor effects in a large design space consisting of 35 factors (32-billion potential configurations) using merely 72 observations per benchmark application. We quantify and tabulate the significance of each of the 35 factors for 13 benchmark applications. While these tables provide various insights into Java performance, they consistently highlight the performance significance of the instruction delivery mechanism, especially the instruction cache and the ITLB design parameters. Furthermore, these tables enable the architect to identify processor bottlenecks for Java workloads by providing an estimate of the relative impact of various design decisions.
A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips
Mariko SAKAMOTO Akira KATSUNO Go SUGIZAKI Toshio YOSHIDA Aiichiro INOUE Koji INOUE Kazuaki MURAKAMI

PAPER-VLSI Architecture for Communication/Server Systems

Vol:
E90-C No:10
Page(s):
1972-1982
Broadcast and synchronization techniques are used for cache coherence control in conventional larger scale snoop-based SMP systems. The penalty for synchronization is directly proportional to system size. Meanwhile, advances in LSI technology now enable placing a memory controller on a CPU die. The latency to access directly linked memory is drastically reduced by an on-die controller. Developing an enterprise server system with these CPUs allows us an opportunity to achieve higher performance. Though the penalty of synchronization is counted whenever a cache miss occurs, it is necessary to improve the coherence method to receive the full benefit of this effect. In this paper, we demonstrate a coherence directory organization that fits into DSM enterprise server systems. Originally, a directory-based method was adopted in high performance computing systems because of its huge scalability in comparison with snoop-based method. Though directory capacity miss and long directory access latency are the major problems of this method, the relaxed scalability requirement of enterprise servers is advantageous to us to solve these problems along with an advanced LSI technology. Our proposed directory solves both problems by implementing a full bit vector level map of the coherence directory on an LSI chip. Our experimental results validate that a system controlled by our proposed directory can surpass a snoop-based system in performance even without applying data localization optimization to an online transaction processing (OLTP) workload.
Cell Library Development Methodology for Throughput Enhancement of Character Projection Equipment
Makoto SUGIHARA Taiga TAKATA Kenta NAKAMURA Ryoichi INANAMI Hiroaki HAYASHI Katsumi KISHIMOTO Tetsuya HASEBE Yukihiro KAWANO Yusuke MATSUNAGA Kazuaki MURAKAMI Katsuya OKUMURA

PAPER-CAD

Vol:
E89-C No:3
Page(s):
377-383
We propose a cell library development methodology for throughput enhancement of character projection equipment. First, an ILP (Integer Linear Programming)-based cell selection is proposed for the equipment for which both of the CP (Character Projection) and VSB (Variable Shaped Beam) methods are available, in order to minimize the number of electron beam (EB) shots, that is, time to fabricate chips. Secondly, the influence of cell directions on area and delay time of chips is examined. The examination helps to reduce the number of EB shots with a little deterioration of area and delay time because unnecessary directions of cells can be removed. Finally, a case study is shown in which the numbers of EB shots are shown for several cases.
A Reconfigurable Data-Path Accelerator Based on Single Flux Quantum Circuits Open Access
Hiroshi KATAOKA Hiroaki HONDA Farhad MEHDIPOUR Nobuyuki YOSHIKAWA Akira FUJIMAKI Hiroyuki AKAIKE Naofumi TAKAGI Kazuaki MURAKAMI

INVITED PAPER

Vol:
E97-C No:3
Page(s):
141-148
The single flux quantum (SFQ) is expected to be a next-generation high-speed and low-power technology in the field of logic circuits. CMOS as the dominant technology for conventional processors cannot be replaced with SFQ technology due to the difficulty of implementing feedback loops and conditional branches using SFQ circuits. This paper investigates the applicability of a reconfigurable data-path (RDP) accelerator based on SFQ circuits. The authors introduce detailed specifications of the SFQ-RDP architecture and compare its performance and power/performance ratio with those of a graphics-processing unit (GPU). The results show at most 1600 times higher efficiency in terms of Flops/W (floating-point operations per second/Watt) for some high-performance computing application programs.
A Reconfigurable Functional Unit with Conditional Execution for Multi-Exit Custom Instructions
Hamid NOORI Farhad MEHDIPOUR Koji INOUE Kazuaki MURAKAMI

PAPER

Vol:
E91-C No:4
Page(s):
497-508
Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of these custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. A quantitative approach is utilized to propose an efficient architecture for the RFU and fix its constraints. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple exits custom instructions are proposed. Conditional execution has been added to the RFU to support the multi-exit feature of custom instructions. Experimental results show that multi-exit custom instructions enhance the performance by an average of 67% compared to custom instructions limited to one basic block. A maximum speedup of 4.7, compared to a general embedded processor, and an average speedup of 1.85 was achieved on MiBench benchmark suite.
Temperature-Aware Configurable Cache to Reduce Energy in Embedded Systems
Hamid NOORI Maziar GOUDARZI Koji INOUE Kazuaki MURAKAMI

PAPER

Vol:
E91-C No:4
Page(s):
418-431
Energy consumption is a major concern in embedded computing systems. Several studies have shown that cache memories account for 40% or more of the total energy consumed in these systems. Active power used to be the primary contributor to total power dissipation of CMOS designs, but with the technology scaling, the share of leakage in total power consumption of digital systems continues to grow. Moreover, temperature is another factor that exponentially increases the leakage current. In this paper, we show the effect of temperature on the optimal (minimum-energy-consuming) cache configuration for low energy embedded systems. Our results show that for a given application and technology, the optimal cache size moves toward smaller caches at higher temperatures, due to the larger leakage. Consequently, a Temperature-Aware Configurable Cache (TACC) is an effective way to save energy in finer technologies when the embedded system is used in different temperatures. Our results show that using a TACC, up to 61% energy can be saved for instruction cache and 77% for data cache compared to a configurable cache that has been configured for only the corner-case temperature (100). Furthermore, the TACC also enhances the performance by up to 28% for the instruction cache and up to 17% for the data cache.
Proposal of a Desk-Side Supercomputer with Reconfigurable Data-Paths Using Rapid Single-Flux-Quantum Circuits
Naofumi TAKAGI Kazuaki MURAKAMI Akira FUJIMAKI Nobuyuki YOSHIKAWA Koji INOUE Hiroaki HONDA

INVITED PAPER

Vol:
E91-C No:3
Page(s):
350-355
We propose a desk-side supercomputer with large-scale reconfigurable data-paths (LSRDPs) using superconducting rapid single-flux-quantum (RSFQ) circuits. It has several sets of computing unit which consists of a general-purpose microprocessor, an LSRDP and a memory. An LSRDP consists of a lot of, e.g., a few thousand, floating-point units (FPUs) and operand routing networks (ORNs) which connect the FPUs. We reconfigure the LSRDP to fit a computation, i.e., a group of floating-point operations, which appears in a 'for' loop of numerical programs by setting the route in ORNs before the execution of the loop. We propose to implement the LSRDPs by RSFQ circuits. The processors and the memories can be implemented by semiconductor technology. We expect that a 10 TFLOPS supercomputer, as well as a refrigerating engine, will be housed in a desk-side rack, using a near-future RSFQ process technology, such as 0.35 µm process.
Improving Performance and Energy Saving in a Reconfigurable Processor via Accelerating Control Data Flow Graphs
Farhad MEHDIPOUR Hamid NOORI Morteza SAHEB ZAMANI Koji INOUE Kazuaki MURAKAMI

PAPER-Reconfigurable Device and Design Tools

Vol:
E90-D No:12
Page(s):
1956-1966
Extracting frequently executed (hot) portions of the application and executing their corresponding data flow graph (DFG) on the hardware accelerator brings about more speedup and energy saving for embedded systems comprising a base processor integrated with a tightly coupled accelerator. Extending DFGs to support control instructions and using Control DFGs (CDFGs) instead of DFGs results in more coverage of application code portion are being accelerated hence, more speedup and energy saving. In this paper, motivations for extending DFGs to CDFGs and handling control instructions are introduced. In addition, basic requirements for an accelerator with conditional execution support are proposed. Then, two algorithms are presented for temporal partitioning of CDFGs considering the target accelerator architectural constraints. To demonstrate effectiveness of the proposed ideas, they are applied to the accelerator of a reconfigurable processor called AMBER. Experimental results approve the remarkable effectiveness of covering control instructions and using CDFGs versus DFGs in the aspects of performance and energy reduction.
Technology Mapping Technique for Increasing Throughput of Character Projection Lithography
Makoto SUGIHARA Kenta NAKAMURA Yusuke MATSUNAGA Kazuaki MURAKAMI

PAPER-Lithography-Related Techniques

Vol:
E90-C No:5
Page(s):
1012-1020
The character projection (CP) lithography is utilized for maskless lithography and is a potential for the future photomask fabrication. The drawback of the CP lithography is its low throughput and leads to a price rise of IC devices. This paper discusses a technology mapping technique for enhancing the throughput of the CP lithography. The number of electron beam (EB) shots to project an entire chip directly determines the fabrication time for the chip as well as the throughput of CP equipment. Our technology mapping technique maps EB shot count-effective cells to a circuit in order to increase the throughput of CP equipment. Our technique treats the number of EB shots as an objective to minimize. Comparing with a conventional technology mapping, our technology mapping technique has achieved 26.6% reduction of the number of EB shots for the front-end-of-the-line (FEOL) process without any performance degradation of ICs. Moreover, our technology mapping technique has achieved a 54.6% less number of EB shots under no performance constraints. It is easy for both IC designers and equipment developers to adopt our technique because our technique is a software approach with no additional modification on CP equipment.
Reliable Cache Architectures and Task Scheduling for Multiprocessor Systems
Makoto SUGIHARA Tohru ISHIHARA Kazuaki MURAKAMI

PAPER

Vol:
E91-C No:4
Page(s):
410-417
This paper proposes a task scheduling approach for reliable cache architectures (RCAs) of multiprocessor systems. The RCAs dynamically switch their operation modes for reducing the usage of vulnerable SRAMs under real-time constraints. A mixed integer programming model has been built for minimizing vulnerability under real-time constraints. Experimental results have shown that our task scheduling approach achieved 47.7-99.9% less vulnerability than a conventional one.
Optimisations Techniques for the Automatic ISA Customisation Algorithm
Antoine TROUVE Kazuaki MURAKAMI

LETTER-Design Optimisation

Vol:
E95-D No:2
Page(s):
437-440
- HTML
- PDF(164.7KB) >> Buy this Article
- Errata[Uploaded on February 1,2012]
This article introduces some improvements to the already proposed custom instruction candidates selection for the automatic ISA customisation problem targeting reconfigurable processors. It introduces new opportunities to prune the search space, and a technique based on dynamic programming to check the independence between groups. The proposed new algorithm yields one order less measured number of convexity checks than the related work for the same inputs and outputs.
Instruction Encoding for Reducing Power Consumption of I-ROMs Based on Execution Locality
Koji INOUE Vasily G. MOSHNYAGA Kazuaki MURAKAMI

PAPER

Vol:
E86-A No:4
Page(s):
799-805
In this paper, we propose an instruction encoding scheme to reduce power consumption of instruction ROMs. The power consumption of the instruction ROM strongly depends on the switching activity of bit-lines due to their large load capacitance. In our approach, the binary-patterns to be assigned as op-codes are determined based on the frequency of instructions in order to reduce the number of bit-line dis-charging. Simulation results show that our approach can reduce 40% of bit-line switchings from a conventional organization.
Test Architecture Optimization for System-on-a-Chip under Floorplanning Constraints
Makoto SUGIHARA Kazuaki MURAKAMI Yusuke MATSUNAGA

PAPER-Test

Vol:
E87-A No:12
Page(s):
3174-3184
In this paper, a test architecture optimization for system-on-a-chip under floorplanning constraints is proposed. The models of previous test architecture optimizations were too ideal to be applied to industrial SOCs. To make matters worse, they couldn't treat topological locality of cores, that is, floorplanning constraints. The optimization proposed in this paper can avoid long wires for TAMs in consideration of floorplanning constraints and finish optimizing test architectures within reasonable computation time.
Character Projection Mask Set Optimization for Enhancing Throughput of MCC Projection Systems
Makoto SUGIHARA Yusuke MATSUNAGA Kazuaki MURAKAMI

PAPER-Physical Level Design

Vol:
E91-A No:12
Page(s):
3451-3460
Character projection (CP) lithography is utilized for maskless lithography and is a potential for the future photomask manufacture because it can project ICs much faster than point beam projection or variable-shaped beam (VSB) projection. In this paper, we first present a projection mask set development methodology for multi-column-cell (MCC) systems, in which column-cells can project patterns in parallel with the CP and VSB lithographies. Next, we present an INLP (integer nonlinear programming) model as well as an ILP (integer linear programming) model for optimizing a CP mask set of an MCC projection system so that projection time is reduced. The experimental results show that our optimization has achieved 33.4% less projection time in the best case than a naive CP mask development approach. The experimental results indicate that our CP mask set optimization method has virtually increased cell pattern objects on CP masks and has decreased VSB projection so that it has achieved higher projection throughput than just parallelizing two column-cells with conventional CP masks.
FOREWORD
Kazuaki MURAKAMI

FOREWORD

Vol:
E81-C No:9
Page(s):
1373-1373

1-20hit(33hit)

Author Search Result

[Author] Kazuaki MURAKAMI(33hit)

Architectural-Level Soft-Error Modeling for Estimating Reliability of Computer Systems

Tradeoffs in Processor Design for Superscalar Architectures

Relaxing Constraints due to Data and Control Dependences

Trends in High-Performance, Low-Power Processor Architectures

Reducing On-Chip DRAM Energy via Data Transfer Size Optimization

Identifying Processor Bottlenecks in Virtual Machine Based Execution of Java Bytecode

A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips

Cell Library Development Methodology for Throughput Enhancement of Character Projection Equipment

A Reconfigurable Data-Path Accelerator Based on Single Flux Quantum Circuits Open Access

A Reconfigurable Functional Unit with Conditional Execution for Multi-Exit Custom Instructions

Temperature-Aware Configurable Cache to Reduce Energy in Embedded Systems

Proposal of a Desk-Side Supercomputer with Reconfigurable Data-Paths Using Rapid Single-Flux-Quantum Circuits

Improving Performance and Energy Saving in a Reconfigurable Processor via Accelerating Control Data Flow Graphs

Technology Mapping Technique for Increasing Throughput of Character Projection Lithography

Reliable Cache Architectures and Task Scheduling for Multiprocessor Systems

Optimisations Techniques for the Automatic ISA Customisation Algorithm

Instruction Encoding for Reducing Power Consumption of I-ROMs Based on Execution Locality

Test Architecture Optimization for System-on-a-Chip under Floorplanning Constraints

Character Projection Mask Set Optimization for Enhancing Throughput of MCC Projection Systems

FOREWORD

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles