The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] approximate computing(19hit)

1-19hit
  • Identification of Redundant Flip-Flops Using Fault Injection for Low-Power Approximate Computing Circuits

    Jiaxuan LU  Yutaka MASUDA  Tohru ISHIHARA  

     
    PAPER-VLSI Design Technology and CAD

      Pubricized:
    2023/08/31
      Vol:
    E107-A No:3
      Page(s):
    540-548

    Approximate computing (AC) saves energy and improves performance by introducing approximation into computation in error-torrent applications. This work focuses on an AC strategy that accurately performs important computations and approximates others. In order to make AC circuits practical, we need to determine which computation is how important carefully, and thus need to appropriately approximate the redundant computation for maintaining the required computational quality. In this paper, we focus on the importance of computations at the flip-flop (FF) level and propose a novel importance evaluation methodology. The key idea of the proposed methodology is a two-step fault injection algorithm to extract the near-optimal set of redundant FFs in the circuit. In the first step, the proposed methodology performs the FI simulation for each FF and extracts the candidates of redundant FFs. Then, in the second step, the proposed methodology extracts the set of redundant FFs in a binary search manner. Thanks to the two-step strategy, the proposed algorithm reduces the complexity of architecture exploration from an exponential order to a linear order without understanding the functionality and behavior of the target application program. Experimental results show that the proposed methodology identifies the candidates of redundant FFs depending on the given constraints. In a case study of an image processing accelerator, the truncation for identified redundant FFs reduces the circuit area by 29.6% and saves power dissipation by 44.8% under the ASIC implementation while satisfying the PSNR constraint. Similarly, the dynamic power dissipation is saved by 47.2% under the FPGA implementation.

  • Dynamic Verification Framework of Approximate Computing Circuits using Quality-Aware Coverage-Based Grey-Box Fuzzing

    Yutaka MASUDA  Yusei HONDA  Tohru ISHIHARA  

     
    PAPER

      Pubricized:
    2022/09/02
      Vol:
    E106-A No:3
      Page(s):
    514-522

    Approximate computing (AC) has recently emerged as a promising approach to the energy-efficient design of digital systems. For realizing the practical AC design, we need to verify whether the designed circuit can operate correctly under various operating conditions. Namely, the verification needs to efficiently find fatal logic errors or timing errors that violate the constraint of computational quality. This work focuses on the verification where the computational results can be observed, the computational quality can be calculated from computational results, and the constraint of computational quality is given and defined as the constraint which is set to the computational quality of designed AC circuit with given workloads. Then, this paper proposes a novel dynamic verification framework of the AC circuit. The key idea of the proposed framework is to incorporate a quality assessment capability into the Coverage-based Grey-box Fuzzing (CGF). CGF is one of the most promising techniques in the research field of software security testing. By repeating (1) mutation of test patterns, (2) execution of the program under test (PUT), and (3) aggregation of coverage information and feedback to the next test pattern generation, CGF can explore the verification space quickly and automatically. On the other hand, CGF originally cannot consider the computational quality by itself. For overcoming this quality unawareness in CGF, the proposed framework additionally embeds the Design Under Verification (DUV) component into the calculation part of computational quality. Thanks to the DUV integration, the proposed framework realizes the quality-aware feedback loop in CGF and thus quickly enhances the verification coverage for test patterns that violate the quality constraint. In this work, we quantitatively compared the verification coverage of the approximate arithmetic circuits between the proposed framework and the random test. In a case study of an approximate multiply-accumulate (MAC) unit, we experimentally confirmed that the proposed framework achieved 3.85 to 10.36 times higher coverage than the random test.

  • An Accuracy Reconfigurable Vector Accelerator based on Approximate Logarithmic Multipliers for Energy-Efficient Computing

    Lingxiao HOU  Yutaka MASUDA  Tohru ISHIHARA  

     
    PAPER

      Pubricized:
    2022/09/02
      Vol:
    E106-A No:3
      Page(s):
    532-541

    The approximate logarithmic multiplier proposed by Mitchell provides an efficient alternative for processing dense multiplication or multiply-accumulate operations in applications such as image processing and real-time robotics. It offers the advantages of small area, high energy efficiency and is suitable for applications that do not necessarily achieve high accuracy. However, its maximum error of 11.1% makes it challenging to deploy in applications requiring relatively high accuracy. This paper proposes a novel operand decomposition method (OD) that decomposes one multiplication into the sum of multiple approximate logarithmic multiplications to widely reduce Mitchell multiplier errors while taking full advantage of its area savings. Based on the proposed OD method, this paper also proposes an accuracy reconfigurable multiply-accumulate (MAC) unit that provides multiple reconfigurable accuracies with high parallelism. Compared to a MAC unit consisting of accurate multipliers, the area is significantly reduced to less than half, improving the hardware parallelism while satisfying the required accuracy for various scenarios. The experimental results show the excellent applicability of our proposed MAC unit in image smoothing and robot localization and mapping application. We have also designed a prototype processor that integrates the minimum functionality of this MAC unit as a vector accelerator and have implemented a software-level accuracy reconfiguration in the form of an instruction set extension. We experimentally confirmed the correct operation of the proposed vector accelerator, which provides the different degrees of accuracy and parallelism at the software level.

  • Low-Power Design Methodology of Voltage Over-Scalable Circuit with Critical Path Isolation and Bit-Width Scaling Open Access

    Yutaka MASUDA  Jun NAGAYAMA  TaiYu CHENG  Tohru ISHIHARA  Yoichi MOMIYAMA  Masanori HASHIMOTO  

     
    PAPER

      Pubricized:
    2021/08/31
      Vol:
    E105-A No:3
      Page(s):
    509-517

    This work proposes a design methodology that saves the power dissipation under voltage over-scaling (VOS) operation. The key idea of the proposed design methodology is to combine critical path isolation (CPI) and bit-width scaling (BWS) under the constraint of computational quality, e.g., Peak Signal-to-Noise Ratio (PSNR) in the image processing domain. Conventional CPI inherently cannot reduce the delay of intrinsic critical paths (CPs), which may significantly restrict the power saving effect. On the other hand, the proposed methodology tries to reduce both intrinsic and non-intrinsic CPs. Therefore, our design dramatically reduces the supply voltage and power dissipation while satisfying the quality constraint. Moreover, for reducing co-design exploration space, the proposed methodology utilizes the exclusiveness of the paths targeted by CPI and BWS, where CPI aims at reducing the minimum supply voltage of non-intrinsic CP, and BWS focuses on intrinsic CPs in arithmetic units. From this key exclusiveness, the proposed design splits the simultaneous optimization problem into three sub-problems; (1) the determination of bit-width reduction, (2) the timing optimization for non-intrinsic CPs, and (3) investigating the minimum supply voltage of the BWS and CPI-applied circuit under quality constraint, for reducing power dissipation. Thanks to the problem splitting, the proposed methodology can efficiently find quality-constrained minimum-power design. Evaluation results show that CPI and BWS are highly compatible, and they significantly enhance the efficacy of VOS. In a case study of a GPGPU processor, the proposed design saves the power dissipation by 42.7% with an image processing workload and by 51.2% with a neural network inference workload.

  • Energy Efficient Approximate Storing of Image Data for MTJ Based Non-Volatile Flip-Flops and MRAM

    Yoshinori ONO  Kimiyoshi USAMI  

     
    PAPER

      Pubricized:
    2021/01/06
      Vol:
    E104-C No:7
      Page(s):
    338-349

    A non-volatile memory (NVM) employing MTJ has a lot of strong points such as read/write performance, best endurance and operating-voltage compatibility with standard CMOS. However, it consumes a lot of energy when writing the data. This becomes an obstacle when applying to battery-operated mobile devices. To solve this problem, we propose an approach to augment the capability of the precision scaling technique for the write operation in NVM. Precision scaling is an approximate computing technique to reduce the bit width of data (i.e. precision) for energy reduction. When writing image data to NVM with the precision scaling, the write energy and the image quality are changed according to the write time and the target bit range. We propose an energy-efficient approximate storing scheme for non-volatile flip-flops and a magnetic random-access memory (MRAM) that allows us to write the data by optimizing the bit positions to split the data and the write time for each bit range. By using the statistical model, we obtained optimal values for the write time and the targeted bit range under the trade-off between the write energy reduction and image quality degradation. Simulation results have demonstrated that by using these optimal values the write energy can be reduced up to 50% while maintaining the acceptable image quality. We also investigated the relationship between the input images and the output image quality when using this approach in detail. In addition, we evaluated the energy benefits when applying our approach to nine types of image processing including linear filters and edge detectors. Results showed that the write energy is reduced by further 12.5% at the maximum.

  • Preliminary Performance Analysis of Distributed DNN Training with Relaxed Synchronization

    Koichi SHIRAHATA  Amir HADERBACHE  Naoto FUKUMOTO  Kohta NAKASHIMA  

     
    BRIEF PAPER

      Pubricized:
    2020/12/01
      Vol:
    E104-C No:6
      Page(s):
    257-260

    Scalability of distributed DNN training can be limited by slowdown of specific processes due to unexpected hardware failures. We propose a dynamic process exclusion technique so that training throughput is maximized. Our evaluation using 32 processes with ResNet-50 shows that our proposed technique reduces slowdown by 12.5% to 50% without accuracy loss through excluding the slow processes.

  • A Feasibility Study of Multi-Domain Stochastic Computing Circuit Open Access

    Tati ERLINA  Renyuan ZHANG  Yasuhiko NAKASHIMA  

     
    PAPER-Integrated Electronics

      Pubricized:
    2020/10/29
      Vol:
    E104-C No:5
      Page(s):
    153-163

    An efficient approximate computing circuit is developed for polynomial functions through the hybrid of analog and stochastic domains. Different from the ordinary time-based stochastic computing (TBSC), the proposed circuit exploits not only the duty cycle of pulses but also the pulse strength of the analog current to carry information for multiplications. The accumulation of many multiplications is performed by merely collecting the stochastic-current. As the calculation depth increases, the growth of latency (while summations), signal power weakening, and disparity of output signals (while multiplications) are substantially avoidable in contrast to that in the conventional TBSC. Furthermore, the calculation range spreads to bipolar infinite without scaling, theoretically. The proposed multi-domain stochastic computing (MDSC) is designed and simulated in a 0.18 µm CMOS technology by employing a set of current mirrors and an improved scheme of the TBSC circuit based on the Neuron-MOS mechanism. For proof-of-concept, the multiply and accumulate calculations (MACs) are implemented, achieving an average accuracy of 95.3%. More importantly, the transistor counting, power consumption, and latency decrease to 6.1%, 55.4%, and 4.2% of the state-of-art TBSC circuit, respectively. The robustness against temperature and process variations is also investigated and presented in detail.

  • Transient Fault Tolerant State Assignment for Stochastic Computing Based on Linear Finite State Machines

    Hideyuki ICHIHARA  Motoi FUKUDA  Tsuyoshi IWAGAKI  Tomoo INOUE  

     
    PAPER

      Vol:
    E103-A No:12
      Page(s):
    1464-1471

    Stochastic computing (SC), which is an approximate computation with probabilities, has attracted attention owing to its small area, small power consumption and high fault tolerance. In this paper, we focus on the transient fault tolerance of SC based on linear finite state machines (linear FSMs). We show that state assignment of FSMs considerably affects the fault tolerance of linear FSM-based SC circuits, and present a Markov model for representing the impact of the state assignment on the behavior of faulty FSMs and estimating the expected error significance of the faulty FSM-based SC circuits. Furthermore, we propose a heuristic algorithm for appropriate state assignment that can mitigate the influence of transient faults. Experimental analysis shows that the state assignment has an impact on the transient fault tolerance of linear FSM-based SC circuits and the proposed state assignment algorithm can achieve a quasi-optimal state assignment in terms of high fault tolerance.

  • Exploiting Configurable Approximations for Tolerating Aging-induced Timing Violations

    Toshinori SATO  Tomoaki UKEZONO  

     
    PAPER

      Vol:
    E103-A No:9
      Page(s):
    1028-1036

    This paper proposes a technique that increases the lifetime of large scale integration (LSI) devices. As semiconductor technology improves at miniaturizing transistors, aging effects due to bias temperature instability (BTI) seriously affects their lifetime. BTI increases the threshold voltage of transistors thereby also increasing the delay of an electronics device, resulting in failures due to timing violations. To compensate for aging-induced timing violations, we exploit configurable approximate computing. Assuming that target circuits have exact and approximate modes, they are configured for the approximate mode if an aging sensor predicts violations. Experiments using an example circuit revealed an increase in its lifetime to >10 years.

  • Approximate FPGA-Based Multipliers Using Carry-Inexact Elementary Modules

    Yi GUO  Heming SUN  Ping LEI  Shinji KIMURA  

     
    PAPER

      Vol:
    E103-A No:9
      Page(s):
    1054-1062

    Approximate multiplier design is an effective technique to improve hardware performance at the cost of accuracy loss. The current approximate multipliers are mostly ASIC-based and are dedicated for one particular application. In contrast, FPGA has been an attractive choice for many applications because of its high performance, reconfigurability, and fast development round. This paper presents a novel methodology for designing approximate multipliers by employing the FPGA-based fabrics (primarily look-up tables and carry chains). The area and latency are significantly reduced by applying approximation on carry results and cutting the carry propagation path in the multiplier. Moreover, we explore higher-order multipliers on architectural space by using our proposed small-size approximate multipliers as elementary modules. For different accuracy-hardware requirements, eight configurations for approximate 8×8 multiplier are discussed. In terms of mean relative error distance (MRED), the error of the proposed 8×8 multiplier is as low as 1.06%. Compared with the exact multiplier, our proposed design can reduce area by 43.66% and power by 24.24%. The critical path latency reduction is up to 29.50%. The proposed multiplier design has a better accuracy-hardware tradeoff than other designs with comparable accuracy. Moreover, image sharpening processing is used to assess the efficiency of approximate multipliers on application.

  • An Accuracy-Configurable Adder for Low-Power Applications

    Tongxin YANG  Toshinori SATO  Tomoaki UKEZONO  

     
    PAPER

      Vol:
    E103-C No:3
      Page(s):
    68-76

    Addition is a key fundamental function for many error-tolerant applications. Approximate addition is considered to be an efficient technique for trading off energy against performance and accuracy. This paper proposes a carry-maskable adder whose accuracy can be configured at runtime. The proposed scheme can dynamically select the length of the carry propagation to satisfy the quality requirements flexibly. Compared with a conventional ripple carry adder and a conventional carry look-ahead adder, the proposed 16-bit adder reduced the power consumption by 54.1% and 57.5%, respectively, and the critical path delay by 72.5% and 54.2%, respectively. In addition, results from an image processing application indicate that the quality of processed images can be controlled by the proposed adder. Good scalability of the proposed adder is demonstrated from the evaluation results using a 32-bit length.

  • Design of Low-Cost Approximate Multipliers Based on Probability-Driven Inexact Compressors

    Yi GUO  Heming SUN  Ping LEI  Shinji KIMURA  

     
    PAPER

      Vol:
    E102-A No:12
      Page(s):
    1781-1791

    Approximate computing has emerged as a promising approach for error-tolerant applications to improve hardware performance at the cost of some loss of accuracy. Multiplication is a key arithmetic operation in these applications. In this paper, we propose a low-cost approximate multiplier design by employing new probability-driven inexact compressors. This compressor design is introduced to reduce the height of partial product matrix into two rows, based on the probability distribution of the sum result of partial products. To compensate the accuracy loss of the multiplier, a grouped error recovery scheme is proposed and achieves different levels of accuracy. In terms of mean relative error distance (MRED), the accuracy losses of the proposed multipliers are from 1.07% to 7.86%. Compared with the Wallace multiplier using 40nm process, the most accurate variant of the proposed multipliers can reduce power by 59.75% and area by 42.47%. The critical path delay reduction is larger than 12.78%. The proposed multiplier design has a better accuracy-performance trade-off than other designs with comparable accuracy. In addition, the efficiency of the proposed multiplier design is assessed in an image processing application.

  • Programmable Analog Calculation Unit with Two-Stage Architecture: A Solution of Efficient Vector-Computation Open Access

    Renyuan ZHANG  Takashi NAKADA  Yasuhiko NAKASHIMA  

     
    PAPER

      Vol:
    E102-A No:7
      Page(s):
    878-885

    A programmable analog calculation unit (ACU) is designed for vector computations in continuous-time with compact circuit scale. From our early study, it is feasible to retrieve arbitrary two-variable functions through support vector regression (SVR) in silicon. In this work, the dimensions of regression are expanded for vector computations. However, the hardware cost and computing error greatly increase along with the expansion of dimensions. A two-stage architecture is proposed to organize multiple ACUs for high dimensional regression. The computation of high dimensional vectors is separated into several computations of lower dimensional vectors, which are implemented by the free combination of several ACUs with lower cost. In this manner, the circuit scale and regression error are reduced. The proof-of-concept ACU is designed and simulated in a 0.18μm technology. From the circuit simulation results, all the demonstrated calculations with nine operands are executed without iterative clock cycles by 4960 transistors. The calculation error of example functions is below 8.7%.

  • Trading Accuracy for Power with a Configurable Approximate Adder

    Toshinori SATO  Tongxin YANG  Tomoaki UKEZONO  

     
    PAPER

      Vol:
    E102-C No:4
      Page(s):
    260-268

    Approximate computing is a promising paradigm to realize fast, small, and low power characteristics, which are essential for modern applications, such as Internet of Things (IoT) devices. This paper proposes the Carry-Predicting Adder (CPredA), an approximate adder that is scalable relative to accuracy and power consumption. The proposed CPredA improves the accuracy of a previously studied adder by performing carry prediction. Detailed simulations reveal that, compared to the existing approximate adder, accuracy is improved by approximately 50% with comparable energy efficiency. Two application-level evaluations demonstrate that the proposed approximate adder is sufficiently accurate for practical use.

  • Design and Analysis of Approximate Multipliers with a Tree Compressor

    Tongxin YANG  Tomoaki UKEZONO  Toshinori SATO  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E102-A No:3
      Page(s):
    532-543

    Many applications, such as image signal processing, has an inherent tolerance for insignificant inaccuracies. Multiplication is a key arithmetic function for many applications. Approximate multipliers are considered an efficient technique to trade off energy relative to performance and accuracy for the error-tolerant applications. Here, we design and analyze four approximate multipliers that demonstrate lower power consumption and shorter critical path delay than the conventional multiplier. They employ an approximate tree compressor that halves the height of the partial product tree and generates a vector to compensate accuracy. Compared with the conventional Wallace tree multiplier, one of the evaluated 8-bit approximate multipliers reduces power consumption and critical path delay by 36.9% and 38.9%, respectively. With a 0.25% normalized mean error distance, the silicon area required to implement the multiplier is reduced by 50.3%. Our multipliers outperform the previously proposed approximate multipliers relative to power consumption, critical path delay, and design area. Results from two image processing applications also demonstrate that the qualities of the images processed by our multipliers are sufficiently accurate for such error-tolerant applications.

  • Design and Analysis of A Low-Power High-Speed Accuracy-Controllable Approximate Multiplier

    Tongxin YANG  Tomoaki UKEZONO  Toshinori SATO  

     
    PAPER

      Vol:
    E101-A No:12
      Page(s):
    2244-2253

    Multiplication is a key fundamental function for many error-tolerant applications. Approximate multiplication is considered to be an efficient technique for trading off energy against performance and accuracy. This paper proposes an accuracy-controllable multiplier whose final product is generated by a carry-maskable adder. The proposed scheme can dynamically select the length of the carry propagation to satisfy the accuracy requirements flexibly. The partial product tree of the multiplier is approximated by the proposed tree compressor. An 8×8 multiplier design is implemented by employing the carry-maskable adder and the compressor. Compared with a conventional Wallace tree multiplier, the proposed multiplier reduced power consumption by between 47.3% and 56.2% and critical path delay by between 29.9% and 60.5%, depending on the required accuracy. Its silicon area was also 44.6% smaller. In addition, results from two image processing applications demonstrate that the quality of the processed images can be controlled by the proposed multiplier design.

  • Parallel Precomputation with Input Value Prediction for Model Predictive Control Systems

    Satoshi KAWAKAMI  Takatsugu ONO  Toshiyuki OHTSUKA  Koji INOUE  

     
    PAPER-Real-time Systems

      Pubricized:
    2018/09/18
      Vol:
    E101-D No:12
      Page(s):
    2864-2877

    We propose a parallel precomputation method for real-time model predictive control. The key idea is to use predicted input values produced by model predictive control to solve an optimal control problem in advance. It is well known that control systems are not suitable for multi- or many-core processors because feedback-loop control systems are inherently based on sequential operations. However, since the proposed method does not rely on conventional thread-/data-level parallelism, it can be easily applied to such control systems without changing the algorithm in applications. A practical evaluation using three real-world model predictive control system simulation programs demonstrates drastic performance improvement without degrading control quality offered by the proposed method.

  • Extension and Performance/Accuracy Formulation for Optimal GeAr-Based Approximate Adder Designs

    Ken HAYAMIZU  Nozomu TOGAWA  Masao YANAGISAWA  Youhua SHI  

     
    PAPER

      Vol:
    E101-A No:7
      Page(s):
    1014-1024

    Approximate computing is a promising solution for future energy-efficient designs because it can provide great improvements in performance, area and/or energy consumption over traditional exact-computing designs for non-critical error-tolerant applications. However, the most challenging issue in designing approximate circuits is how to guarantee the pre-specified computation accuracy while achieving energy reduction and performance improvement. To address this problem, this paper starts from the state-of-the-art general approximate adder model (GeAr) and extends it for more possible approximate design candidates by relaxing the design restrictions. And then a maximum-error-distance-based performance/accuracy formulation, which can be used to select the performance/energy-accuracy optimal design from the extended design space, is proposed. Our evaluation results show the effectiveness of the proposed method in terms of area overhead, performance, energy consumption, and computation accuracy.

  • A Systematic Methodology for Design and Worst-Case Error Analysis of Approximate Array Multipliers

    Takahiro YAMAMOTO  Ittetsu TANIGUCHI  Hiroyuki TOMIYAMA  Shigeru YAMASHITA  Yuko HARA-AZUMI  

     
    LETTER

      Vol:
    E100-A No:7
      Page(s):
    1496-1499

    Approximate computing is considered as a promising approach to design of power- or area-efficient digital circuits. This paper proposes a systematic methodology for design and worst-case accuracy analysis of approximate array multipliers. Our methodology systematically designs a series of approximate array multipliers with different area, delay, power and accuracy characteristics so that an LSI designer can select the one which best fits to the requirements of her/his applications. Our experiments explore the trade-offs among area, delay, power and accuracy of the approximate multipliers.