The search functionality is under construction.

Keyword Search Result

[Keyword] microarchitecture(10hit)

1-10hit
  • An Inductive Method to Select Simulation Points

    MinSeong CHOI  Takashi FUKUDA  Masahiro GOSHIMA  Shuichi SAKAI  

     
    PAPER-Architecture

      Pubricized:
    2016/08/24
      Vol:
    E99-D No:12
      Page(s):
    2891-2900

    The time taken for processor simulation can be drastically reduced by selecting simulation points, which are dynamic sections obtained from the simulation result of processors. The overall behavior of the program can be estimated by simulating only these sections. The existing methods to select simulation points, such as SimPoint, used for selecting simulation points are deductive and based on the idea that dynamic sections executing the same static section of the program are of the same phase. However, there are counterexamples for this idea. This paper proposes an inductive method, which selects simulation points from the results obtained by pre-simulating several processors with distinctive microarchitectures, based on assumption that sections in which all the distinctive processors have similar istructions per cycle (IPC) values are of the same phase. We evaluated the first 100G instructions of SPEC 2006 programs. Our method achieved an IPC estimation error of approximately 0.1% by simulating approximately 0.05% of the 100G instructions.

  • Using Cacheline Reuse Characteristics for Prefetcher Throttling

    Hidetsugu IRIE  Takefumi MIYOSHI  Goki HONJO  Kei HIRAKI  Tsutomu YOSHINAGA  

     
    PAPER-Computer Architecture

      Vol:
    E95-D No:12
      Page(s):
    2928-2938

    One of the significant issues of processor architecture is to overcome memory latency. Prefetching can greatly improve cache performance, but it has the drawback of cache pollution, unless its aggressiveness is properly set. Several techniques that have been proposed for prefetcher throttling use accuracy as a metric, but their robustness were not sufficient because of the variations in programs' working set sizes and cache capacities. In this study, we revisit prefetcher throttling from the viewpoint of data lifetime. Exploiting the characteristics of cache line reuse, we propose Cache-Convection-Control-based Prefetch Optimization Plus (CCCPO+), which enhances the feedback algorithm of our previous CCCPO. Evaluation results showed that this novel approach achieved a 30% improvement over no prefetching in the geometric mean of the SPEC CPU 2006 benchmark suite with 256 KB LLC, 1.8% over the latest prefetcher throttling, and 0.5% over our previous CCCPO. Moreover, it showed superior stability compared to related works, while lowering the hardware cost.

  • Communication Synthesis for Interconnect Minimization Targeting Distributed Register-File Microarchitecture

    Juinn-Dar HUANG  Chia-I CHEN  Yen-Ting LIN  Wan-Ling HSU  

     
    LETTER-VLSI Design Technology and CAD

      Vol:
    E94-A No:4
      Page(s):
    1151-1155

    In deep-submicron era, wire delay is becoming a bottleneck while pursuing even higher system clock speed. Several distributed register (DR) architectures have been proposed to cope with this problem by keeping most wires local. In this article, we propose a new resource-constrained communication synthesis algorithm for optimizing both inter-island connections (IICs) and latency targeting on distributed register-file microarchitecture (DRFM). The experimental results show that up to 24.7% and 12.7% reduction on IIC and latency can be achieved respectively as compared to the previous work.

  • Register File Size Reduction through Instruction Pre-Execution Incorporating Value Prediction

    Yusuke TANAKA  Hideki ANDO  

     
    PAPER-Computer System

      Vol:
    E93-D No:12
      Page(s):
    3294-3305

    Two-step physical register deallocation (TSD) is an architectural scheme that enhances memory-level parallelism (MLP) by pre-executing instructions. Ideally, TSD allows exploitation of MLP under an unlimited number of physical registers, and consequently only a small register file is needed for MLP. In practice, however, the amount of MLP exploitable is limited, because there are cases where either 1) pre-execution is not performed; or 2) the timing of pre-execution is delayed. Both are due to data dependencies among the pre-executed instructions. This paper proposes the use of value prediction to solve these problems. This paper proposes the use of value prediction to solve these problems. Evaluation results using the SPECfp2000 benchmark confirm that the proposed scheme with value prediction for predicting addresses achieves equivalent IPC, with a smaller register file, to the previous TSD scheme. The reduction rate of the register file size is 21%.

  • Energy-Efficient Pre-Execution Techniques in Two-Step Physical Register Deallocation

    Kazunaga HYODO  Kengo IWAMOTO  Hideki ANDO  

     
    PAPER-Computer Systems

      Vol:
    E92-D No:11
      Page(s):
    2186-2195

    Instruction pre-execution is an effective way to prefetch data. We previously proposed an instruction pre-execution scheme, which we call two-step physical register deallocation (TSD). The TSD realizes pre-execution by exploiting the difference between the amount of instruction-level parallelism available with an unlimited number of physical registers and that available with an actual number of physical registers. Although previous TSD study has successfully improved performance, it still has an inefficient energy consumption. This is because attempts are made for instructions to be pre-executed as much as possible, independently of whether or not they can significantly contribute to load latency reduction, allowing for maximal performance improvement. This paper presents a scheme that improves the energy efficiency of the TSD by pre-executing only those instructions that have great benefit. Our evaluation results using the SPECfp2000 benchmark show that our scheme reduces the dynamic pre-executed instruction count by 76%, compared with the original scheme. This reduction saves 7% energy consumption of the execution core with 2% overhead. Performance degrades by 2%, compared with that of the original scheme, but is still 15% higher than that of the normal processor without the TSD.

  • Identifying Processor Bottlenecks in Virtual Machine Based Execution of Java Bytecode

    Pradeep RAO  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E92-C No:10
      Page(s):
    1265-1275

    Despite the prevalence of Java workloads across a variety of processor architectures, there is very little published data on the impact of the various processor design decisions on Java performance. We attribute the lack of data to the large design space resulting from the complexity of the modern superscalar processor and the additional complexities associated with executing Java bytecode using a virtual machine. To address this shortcoming, we use a statistically rigorous methodology to systematically quantify the the impact of the various processor microarchitecture parameters on Java execution performance. The adopted methodology enables efficient screening of significant factor effects in a large design space consisting of 35 factors (32-billion potential configurations) using merely 72 observations per benchmark application. We quantify and tabulate the significance of each of the 35 factors for 13 benchmark applications. While these tables provide various insights into Java performance, they consistently highlight the performance significance of the instruction delivery mechanism, especially the instruction cache and the ITLB design parameters. Furthermore, these tables enable the architect to identify processor bottlenecks for Java workloads by providing an estimate of the relative impact of various design decisions.

  • An Energy Efficient Instruction Window for Scalable Processor Architecture

    Min CHOI  Seungryoul MAENG  

     
    PAPER

      Vol:
    E91-C No:9
      Page(s):
    1427-1436

    Modern microprocessors achieve high application performance at the acceptable level of power dissipation. In terms of power to performance trade-off, the instruction window is particularly important. This is because enlarging the window size achieves high performance but naive scaling of the conventional instruction window can severely increase the complexity and power consumption. In this paper, we propose low-power instruction window techniques for contemporary microprocessors. First, the small reorder buffer (SROB) reduces power dissipation by deferred allocation and early release. The deferred allocation delays the SROB allocation of instructions until their all data dependencies are resolved. Then, the instructions are executed in program order and they are released faster from the SROB. This results in higher resource utilization and low power consumption. Second, we replace a conventional issue queue by a direct lookup table (DLT) with an efficient tag translation technique. The translation scheme resolves the instruction dependency, especially for the case of one producer to multiple consumers. The efficiency of the translation scheme stems from the fact that the vast majority of instruction dependency exists within a basic block. Experimental results show that our proposed design reduces the power consumption significantly for SPEC2000 benchmarks.

  • A Low-Power Branch Predictor for Embedded Processors

    Sung Woo CHUNG  Gi Ho PARK  Sung Bae PARK  

     
    LETTER-Computer Systems

      Vol:
    E87-D No:9
      Page(s):
    2253-2257

    Even in embedded processors, the accuracy in a branch prediction significantly affects the performance. In designing a branch predictor, in addition to accuracy, microarchitects should consider area, delay and power consumption. We propose two techniques to reduce the power consumption; these techniques do not requires any additional storage arrays, do not incur additional delay (except just one MUX delay) and never deteriorate accuracy. One is to look up two predictions at a time by increasing the width (decreasing the depth) of the PHT (Prediction History Table). The other is to reduce unnecessary accesses to the BTB (Branch Target Buffer) by accessing the PHT in advance. Analysis results with Samsung Memory Compiler show that the proposed techniques reduce the power consumption of the branch predictor by 15-52%.

  • A Low-Power Tournament Branch Predictor

    Sung Woo CHUNG  Gi Ho PARK  Sung Bae PARK  

     
    LETTER-Computer Systems

      Vol:
    E87-D No:7
      Page(s):
    1962-1964

    This letter proposes a low-power tournament branch predictor, in which the number of accesses to the branch predictors (local predictor or global predictor) is reduced. Analysis results with Samsung Memory Compiler show that the proposed branch predictor reduces the power consumption by 24-45%, compared to the conventional tournament branch predictor, not requiring any additional storage arrays, not incurring any additional delay and never harming accuracy.

  • Design Development of SPARC64 V Microprocessor

    Mariko SAKAMOTO  Akira KATSUNO  Aiichiro INOUE  Takeo ASAKAWA  Kuniki MORITA  Tsuyoshi MOTOKURUMADA  Yasunori KIMURA  

     
    INVITED PAPER

      Vol:
    E86-D No:10
      Page(s):
    1955-1965

    We developed a SPARC-V9 processor, the SPARC64 V. It has an operating frequency of 1.35 GHz and contains 191 million transistors fabricated using 0.13-µm CMOS technology with eight-layer copper metallization. SPECjbb2000 (CPU# 32) is 492683, highest on the market and 42% higher than the next highest system. SPEC CPU2000 performance is 858 for SPECint and 1228 for SPECfp. The processor is designed to provide the high system performance and high reliability required of enterprise server systems. It is also designed to address the performance requirements of high-performance computing. During our development of several generations of mainframe processors, we conducted many related experiments, and obtained enterprise server system (EPS) development skills, an understanding of EPS workload characteristics, and technology that provides high reliability, availability, and serviceability. We used those as bases of the new processor development. The approach quite effectively moves beyond differences between mainframe and SPARC systems. At the beginning of development and before the start of hardware design, we developed a software performance simulator so we could understand the performance impacts of created specifications, thereby enabling us to make appropriate decisions about hardware design. We took this approach to solve performance problems before tape-out and avoid spending additional time on design update and physical machine reconstruction. We were successful, completing the high-performance processor development on schedule and in a short time. This paper describes the SPARC64 V microprocessor and performance analyses for development of its design.