The search functionality is under construction.

Author Search Result

[Author] Toshinori SATO(16hit)

1-16hit
  • A Microprocessor Architecture Utilizing Histories of Dynamic Sequences Saved in Distributed Memories

    Toshinori SATO  

     
    PAPER

      Vol:
    E81-C No:9
      Page(s):
    1398-1407

    In order to improve microprocessor performance, we propose to utilize histories of dynamic instruction sequences. A lot of special purpose memories integrated in a processor chip hold the histories. In this paper, we describe the usefulness of using two special purpose memories: Non-Consecutive basic block Buffer (NCB) and Reference Prediction Table (RPT). The NCB improves instruction fetching efficiency in order to relieve control dependences. The RPT predicts data addresses in order to speculate data dependences. From the simulation study, it has been found that the proposed mechanisms improve processor performance by up to 49. 2%.

  • An Accuracy-Configurable Adder for Low-Power Applications

    Tongxin YANG  Toshinori SATO  Tomoaki UKEZONO  

     
    PAPER

      Vol:
    E103-C No:3
      Page(s):
    68-76

    Addition is a key fundamental function for many error-tolerant applications. Approximate addition is considered to be an efficient technique for trading off energy against performance and accuracy. This paper proposes a carry-maskable adder whose accuracy can be configured at runtime. The proposed scheme can dynamically select the length of the carry propagation to satisfy the quality requirements flexibly. Compared with a conventional ripple carry adder and a conventional carry look-ahead adder, the proposed 16-bit adder reduced the power consumption by 54.1% and 57.5%, respectively, and the critical path delay by 72.5% and 54.2%, respectively. In addition, results from an image processing application indicate that the quality of processed images can be controlled by the proposed adder. Good scalability of the proposed adder is demonstrated from the evaluation results using a 32-bit length.

  • A Low-Power Instruction Issue Queue for Microprocessors

    Shingo WATANABE  Akihiro CHIYONOBU  Toshinori SATO  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    400-409

    Instruction issue queue is a key component which extracts instruction level parallelism (ILP) in modern out-of-order microprocessors. In order to exploit ILP for improving processor performance, instruction queue size should be increased. However, it is difficult to increase the size, since instruction queue is implemented by a content addressable memory (CAM) whose power and delay are much large. This paper introduces a low power and scalable instruction queue that replaces the CAM with a RAM. In this queue, instructions are explicitly woken up. Evaluation results show that the proposed instruction queue decreases processor performance by only 1.9% on average. Furthermore, the total energy consumption is reduced by 54% on average.

  • Design and Analysis of A Low-Power High-Speed Accuracy-Controllable Approximate Multiplier

    Tongxin YANG  Tomoaki UKEZONO  Toshinori SATO  

     
    PAPER

      Vol:
    E101-A No:12
      Page(s):
    2244-2253

    Multiplication is a key fundamental function for many error-tolerant applications. Approximate multiplication is considered to be an efficient technique for trading off energy against performance and accuracy. This paper proposes an accuracy-controllable multiplier whose final product is generated by a carry-maskable adder. The proposed scheme can dynamically select the length of the carry propagation to satisfy the accuracy requirements flexibly. The partial product tree of the multiplier is approximated by the proposed tree compressor. An 8×8 multiplier design is implemented by employing the carry-maskable adder and the compressor. Compared with a conventional Wallace tree multiplier, the proposed multiplier reduced power consumption by between 47.3% and 56.2% and critical path delay by between 29.9% and 60.5%, depending on the required accuracy. Its silicon area was also 44.6% smaller. In addition, results from two image processing applications demonstrate that the quality of the processed images can be controlled by the proposed multiplier design.

  • An Energy-Efficient Clustered Superscalar Processor

    Toshinori SATO  Akihiro CHIYONOBU  

     
    PAPER-Digital

      Vol:
    E88-C No:4
      Page(s):
    544-551

    Power consumption is a major concern in embedded microprocessors design. Reducing power has also been a critical design goal for general-purpose microprocessors. Since they require high performance as well as low power, power reduction at the cost of performance cannot be accepted. There are a lot of device-level techniques that reduce power with maintaining performance. They select non-critical paths as candidates for low-power design, and performance-oriented design is used only in speed-critical paths. The same philosophy can be applied to architectural-level design. We evaluate a technique, which exploits dynamic information regarding instruction criticality in order to reduce power. We evaluate an instruction steering policy for a clustered microarchitecture, which is based on instruction criticality, and find it is substantially energy-efficient while it suffers performance degradation.

  • Design and Analysis of Approximate Multipliers with a Tree Compressor

    Tongxin YANG  Tomoaki UKEZONO  Toshinori SATO  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E102-A No:3
      Page(s):
    532-543

    Many applications, such as image signal processing, has an inherent tolerance for insignificant inaccuracies. Multiplication is a key arithmetic function for many applications. Approximate multipliers are considered an efficient technique to trade off energy relative to performance and accuracy for the error-tolerant applications. Here, we design and analyze four approximate multipliers that demonstrate lower power consumption and shorter critical path delay than the conventional multiplier. They employ an approximate tree compressor that halves the height of the partial product tree and generates a vector to compensate accuracy. Compared with the conventional Wallace tree multiplier, one of the evaluated 8-bit approximate multipliers reduces power consumption and critical path delay by 36.9% and 38.9%, respectively. With a 0.25% normalized mean error distance, the silicon area required to implement the multiplier is reduced by 50.3%. Our multipliers outperform the previously proposed approximate multipliers relative to power consumption, critical path delay, and design area. Results from two image processing applications also demonstrate that the qualities of the images processed by our multipliers are sufficiently accurate for such error-tolerant applications.

  • Trading Accuracy for Power with a Configurable Approximate Adder

    Toshinori SATO  Tongxin YANG  Tomoaki UKEZONO  

     
    PAPER

      Vol:
    E102-C No:4
      Page(s):
    260-268

    Approximate computing is a promising paradigm to realize fast, small, and low power characteristics, which are essential for modern applications, such as Internet of Things (IoT) devices. This paper proposes the Carry-Predicting Adder (CPredA), an approximate adder that is scalable relative to accuracy and power consumption. The proposed CPredA improves the accuracy of a previously studied adder by performing carry prediction. Detailed simulations reveal that, compared to the existing approximate adder, accuracy is improved by approximately 50% with comparable energy efficiency. Two application-level evaluations demonstrate that the proposed approximate adder is sufficiently accurate for practical use.

  • A Transparent Transient Faults Tolerance Mechanism for Superscalar Processors

    Toshinori SATO  

     
    PAPER-Dependable Systems

      Vol:
    E86-D No:12
      Page(s):
    2508-2516

    In this paper, we propose a fault-tolerance mechanism for microprocessors, which detects transient faults and recovers from them. The investigation of fault-tolerance techniques for microprocessors is driven by two issues: One regards deep submicron fabrication technologies. Future semiconductor technologies could become more susceptible to alpha particles and other cosmic radiation. The other is the increasing popularity of mobile platforms. Cellular telephones are currently used for applications which are critical to our financial security, such as mobile banking, mobile trading, and making airline ticket reservations. Such applications demand that computer systems work correctly. In light of this, we propose a mechanism which is based on an instruction reissue technique for incorrect data speculation recovery and utilizes time redundancy, and evaluate our proposal using a timing simulator.

  • Hiding Data Cache Latency with Load Address Prediction

    Toshinori SATO  Hiroshige FUJII  Seigo SUZUKI  

     
    PAPER-Computer Systems

      Vol:
    E79-D No:11
      Page(s):
    1523-1532

    A new prediction method for the effective address is presented. This method works with the buffer named the address prediction buffer, and allows the data cache to be accessed speculatively. As a consequence of the trend toward increasing clock frequency, the internal cache is no longer able to fill the speed gap between the processor and the external memory, and the data cache latency degrades the processor performance. In order to hide this latency, the prediction method is proposed. By this method, the load address is predicted, and the data is fetched earlier than the memory access stage. In the case that the prediction is correct, the latency is hidden. Even if the prediction is incorrect, the performance is not degraded by any miss penalties. We have found that the prediction accuracy is 81.9% on average, and thus the performance is improved by 6.6% on average and a maximum of 12.1% for the integer programs.

  • Performance Evaluation of a Processing Element for an On-Chip Multiprocessor

    Masafumi TAKAHASHI  Hiroshige FUJII  Emi KANEKO  Takeshi YOSHIDA  Toshinori SATO  Hiroyuki TAKANO  Haruyuki TAGO  Seigo SUZUKI  Nobuyuki GOTO  

     
    PAPER

      Vol:
    E77-C No:7
      Page(s):
    1092-1100

    A 250-MIPS, 125-MFLOPS peak performance processing element (PE), which is being developed for an on-chip multiprocessor, has been modeled and evaluated. The PE includes the following new architecture components: an FPU shared by several IUs in order to increase the efficiency of the FPU pipelines, an on-chip data cache with a prefetch mechanism to reduce clock cycles waiting for memory, and an interface to high speed DRAM, such as Rambus DRAM and Synchronous DRAM. As a result, a PE model with an FPU shared by four or eight IUs causes only 10% performance reduction compared to a model with an un-shared FPU model while saving the cost of three FPUs. Furthermore, a PE model with prefetch operates 1.2 to 1.8 times faster than a model without prefetch at 250-MHz clock rate when the Rambus DRAM is connected. It becomes clear that this PE architecture can bring a high effective performance at over 250-MHz, and is cost-effective for the on-chip multiprocessor.

  • Exploiting Configurable Approximations for Tolerating Aging-induced Timing Violations

    Toshinori SATO  Tomoaki UKEZONO  

     
    PAPER

      Vol:
    E103-A No:9
      Page(s):
    1028-1036

    This paper proposes a technique that increases the lifetime of large scale integration (LSI) devices. As semiconductor technology improves at miniaturizing transistors, aging effects due to bias temperature instability (BTI) seriously affects their lifetime. BTI increases the threshold voltage of transistors thereby also increasing the delay of an electronics device, resulting in failures due to timing violations. To compensate for aging-induced timing violations, we exploit configurable approximate computing. Assuming that target circuits have exact and approximate modes, they are configured for the approximate mode if an aging sensor predicts violations. Experiments using an example circuit revealed an increase in its lifetime to >10 years.

  • A Simple Mechanism for Collapsing Instructions under Timing Speculation

    Toshinori SATO  

     
    PAPER

      Vol:
    E91-C No:9
      Page(s):
    1394-1401

    The deep submicron semiconductor technologies will make the worst-case design impossible, since they can not provide design margins that it requires. We are investigating a typical-case design methodology, which we call the Constructive Timing Violation (CTV). This paper extends the CTV concept to collapse dependent instructions, resulting in performance improvement. Based on detailed simulations, we find the proposed mechanism effectively collapses dependent instructions.

  • Short Term Cell-Flipping Technique for Mitigating SNM Degradation Due to NBTI

    Yuji KUNITAKE  Toshinori SATO  Hiroto YASUURA  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    520-529

    Negative Bias Temperature Instability (NBTI) is one of the major reliability problems in advanced technologies. NBTI causes threshold voltage shift in a PMOS transistor. When the PMOS transistor is biased to negative voltage, threshold voltage shifts to negatively. On the other hand, the threshold voltage recovers if the PMOS transistor is positively biased. In an SRAM cell, due to NBTI, threshold voltage degrades in the load PMOS transistors. The degradation has the impact on Static Noise Margin (SNM), which is a measure of read stability of a 6-T SRAM cell. In this paper, we discuss the relationship between NBTI degradation in an SRAM cell and the dynamic stress and recovery condition. There are two important characteristics. One is a stress probability, which is defined as the rate that the PMOS transistor is negatively biased. The other is a stress and recovery cycle, which is defined as the switching interval of an SRAM value. In our observations, in order to mitigate the NBTI degradation, the stress probability should be small and the stress and recovery cycle should be shorter than 10 msec. Based on the observations, we propose a novel cell-flipping technique, which makes the stress probability close to 50%. In addition, we show results of the case studies, which apply the cell-flipping technique to register file and cache memories.

  • Potential of Constructive Timing-Violation

    Toshinori SATO  Itsujiro ARITA  

     
    PAPER-High-Performance Technologies

      Vol:
    E85-C No:2
      Page(s):
    323-330

    This paper proposes constructive timing-violation (CTV) and evaluates its potential. It can be utilized both for increasing clock frequency and for reducing energy consumption. Increasing clock frequency over that determined by the critical paths causes timing violations. On the other hand, while supply voltage reduction can result in substantial power savings, it also causes larger gate delay and thus clock must be slow down in order not to violate timing constraints of critical paths. However, if any tolerant mechanisms are provided for the timing violations, it is not necessary to keep the constraints. Rather, the violations would be constructive for high clock frequency or for energy savings. From these observations, we propose the CTV, which is supported by the tolerant mechanism based on contemporary speculative execution mechanisms. We evaluate the CTV using a cycle-by-cycle simulator and present its considerably promising potential.

  • Resolving Load Data Dependency Using Tunneling-Load Technique

    Toshinori SATO  

     
    PAPER-Computer Systems

      Vol:
    E81-D No:8
      Page(s):
    829-838

    The new technique for reducing the load latency is presented. This technique, named tunneling-load, utilizes the register specifier buffer in order to reduce the load latency without fetching the data cache speculatively, and thus eliminates the drawback of any load address prediction techniques. As a consequence of the trend toward increasing clock frequency, the internal cache is no longer able to fill the speed gap between the processor and the external memory, and the data cache latency degrades the processor performance. In order to hide this latency, several techniques predicting the load address have been proposed. These techniques carry out the speculative data cache fetching, which causes the explosion of the memory traffic and the pollution of the data cache. The tunneling-load solves these problems. We have evaluated the effects of the tunneling-load, and found that in an in-order-issue superscalar platform the instruction level parallelism is increased by approximately 10%.

  • Enhancements of a Circuit-Level Timing Speculation Technique and Their Evaluations Using a Co-simulation Environment

    Yuji KUNITAKE  Kazuhiro MIMA  Toshinori SATO  Hiroto YASUURA  

     
    PAPER

      Vol:
    E92-C No:4
      Page(s):
    483-491

    A deep submicron semiconductor technology has increased process variations. This fact makes the estimate of the worst-case design margin difficult. In order to realize robust designs, we are investigating such a typical-case design methodology, which we call Constructive Timing Violation (CTV). In the CTV-based design, we can relax timing constraints. However, relaxing timing constraints might cause some timing errors. While we have applied the CTV-based design to a processor, unfortunately, the timing error recovery has serious impact on processor performance. In this paper, we investigate enhancement techniques of the CTV-based design. In addition, in order to accurately evaluate the CTV-based design, we build a co-simulation framework to consider circuit delay at the architectural level. From the co-simulation results, we find the performance penalty is significantly reduced by the enhancement techniques.