1-17hit |
Yutaka MASUDA Jun NAGAYAMA TaiYu CHENG Tohru ISHIHARA Yoichi MOMIYAMA Masanori HASHIMOTO
This work proposes a design methodology that saves the power dissipation under voltage over-scaling (VOS) operation. The key idea of the proposed design methodology is to combine critical path isolation (CPI) and bit-width scaling (BWS) under the constraint of computational quality, e.g., Peak Signal-to-Noise Ratio (PSNR) in the image processing domain. Conventional CPI inherently cannot reduce the delay of intrinsic critical paths (CPs), which may significantly restrict the power saving effect. On the other hand, the proposed methodology tries to reduce both intrinsic and non-intrinsic CPs. Therefore, our design dramatically reduces the supply voltage and power dissipation while satisfying the quality constraint. Moreover, for reducing co-design exploration space, the proposed methodology utilizes the exclusiveness of the paths targeted by CPI and BWS, where CPI aims at reducing the minimum supply voltage of non-intrinsic CP, and BWS focuses on intrinsic CPs in arithmetic units. From this key exclusiveness, the proposed design splits the simultaneous optimization problem into three sub-problems; (1) the determination of bit-width reduction, (2) the timing optimization for non-intrinsic CPs, and (3) investigating the minimum supply voltage of the BWS and CPI-applied circuit under quality constraint, for reducing power dissipation. Thanks to the problem splitting, the proposed methodology can efficiently find quality-constrained minimum-power design. Evaluation results show that CPI and BWS are highly compatible, and they significantly enhance the efficacy of VOS. In a case study of a GPGPU processor, the proposed design saves the power dissipation by 42.7% with an image processing workload and by 51.2% with a neural network inference workload.
Teruo TANIMOTO Takatsugu ONO Koji INOUE
Correctly understanding microarchitectural bottlenecks is important to optimize performance and energy of OoO (Out-of-Order) processors. Although CPI (Cycles Per Instruction) stack has been utilized for this purpose, it stacks architectural events heuristically by counting how many times the events occur, and the order of stacking affects the result, which may be misleading. It is because CPI stack does not consider the execution path of dynamic instructions. Critical path analysis (CPA) is a well-known method to identify the critical execution path of dynamic instruction execution on OoO processors. The critical path consists of the sequence of events that determines the execution time of a program on a certain processor. We develop a novel representation of CPCI stack (Cycles Per Critical Instruction stack), which is CPI stack based on CPA. The main challenge in constructing CPCI stack is how to analyze a large number of paths because CPA often results in numerous critical paths. In this paper, we show that there are more than ten to the tenth power critical paths in the execution of only one thousand instructions in 35 benchmarks out of 48 from SPEC CPU2006. Then, we propose a statistical method to analyze all the critical paths and show a case study using the benchmarks.
Song BIAN Shumpei MORITA Michihiro SHINTANI Hiromitsu AWANO Masayuki HIROMOTO Takashi SATO
As technology further scales semiconductor devices, aging-induced device degradation has become one of the major threats to device reliability. In addition, aging mechanisms like the negative bias temperature instability (NBTI) are known to be sensitive to workload (i.e., signal probability) that is hard to be assumed at design phase. In this work, we analyze the workload dependence of NBTI degradation using a processor, and propose a novel technique to estimate the worst-case paths. In our approach, we exploit the fact that the deterministic nature of circuit structure limits the amount of NBTI degradation on different paths, and propose a two-stage path extraction algorithm to identify the invariant critical paths (ICPs) in the processor. Utilizing these paths, we also propose an optimization technique for the replacement of internal node control logic that mitigates the NBTI degradation in the design. Through numerical experiment on two processor designs, we achieved nearly 300x reduction in the sheer number of paths on both designs. Utilizing the extracted ICPs, we achieved 96x-197x speedup without loss in mitigation gain.
Jiongyao YE Yu WAN Takahiro WATANABE
Current trends in modern out-of-order processors involve implementing deeper pipelines and a large instruction window to achieve high performance, which lead to the penalty of the branch misprediction recovery being a critical factor in overall processor performance. Multi path execution is proposed to reduce this penalty by executing both paths following a branch, simultaneously. However, there are some drawbacks in this mechanism, such as design complexity caused by processing both paths after a branch and performance degradation due to hardware resource competition between two paths. In this paper, we propose a new recovery mechanism, called Recovery Critical Misprediction (RCM), to reduce the penalty of branch misprediction recovery. The mechanism uses a small trace cache to save the decoded instructions from the alternative path following a branch. Then, during the subsequent predictions, the trace cache is accessed. If there is a hit, the processor forks the second path of this branch at the renamed stage so that the design complexity in the fetch stage and decode stage is alleviated. The most contribution of this paper is that our proposed mechanism employs critical path prediction to identify the branches that will be most harmful if mispredicted. Only the critical branch can save its alternative path into the trace cache, which not only increases the usefulness of a limited size of trace cache but also avoids the performance degradation caused by the forked non-critical branch. Experimental results employing SPECint 2000 benchmark show that a processor with our proposed RCM improves IPC value by 10.05% compared with a conventional processor.
Critical path selection is very important in delay testing. Critical paths found by conventional static timing analysis (STA) tools are inadequate to represent the real timing of the circuit, since neither the testability of paths nor the statistical variation of cell delays caused by process variation is considered. This paper proposed a novel path selection method considering process variation. The circuit is firstly simplified by eliminating non-critical edges under statistical timing model, and then divided into sub-circuits, while each sub-circuit has only one prime input (PI) and one prime output (PO). Critical paths are selected only in critical sub-circuits. The concept of partially critical edges (PCEs) and completely critical edges (CCEs) are introduced to speed up the path selection procedure. Two path selection strategies are also presented to search for a testable critical path set to cover all the critical edges. The experimental results showed that the proposed circuit division approach is efficient in path number reduction, and PCEs and CCEs play an important role as a guideline during path selection.
Yoshiyuki KAWAKAMI Makoto TERAO Masahiro FUKUI Shuji TSUKIYAMA
With the advent of the deep submicron age, circuit performance is strongly impacted by process variations and the influence on the circuit delay to the power-supply voltage increases more and more due to CMOS feature size shrinkage. Power grid optimization which considers the timing error risk caused by the variations and IR drop becomes very important for stable and hi-speed operation of system-on-chip. Conventionally, a lot of power grid optimization algorithms have been proposed, and most of them use IR drop as their object functions. However, the IR drop is an indirect metric and we suspect that it is vague metric for the real goal of LSI design. In this paper, first, we propose an approach which uses the "timing error risk caused by IR drop" as a direct objective function. Second, the critical path map is introduced to express the existence of critical paths distributed in the entire chip. The timing error risk is decreased by using the critical path map and the new objective function. Some experimental results show the effectiveness.
Hyoun Soo PARK Wook KIM Dai Joon HYUN Young Hwan KIM
Block-based SSTA analyzes the timing variation of a chip caused by process variations effectively. However, block-based SSTA cannot identify critical nodes, nodes that highly influence the timing yield of a chip, used as the effective guidance of timing yield optimization. In this paper, we propose a new timing criticality to identify those nodes, referred to as the timing yield criticality (TYC). The proposed TYC is defined as the change in the timing yield, which is induced by the change in the mean arrival time at a node. For efficiency, we estimate the TYC through linear approximation instead of propagating the changed arrival time at a node to its fanouts. In experiments using the ISCAS 85 benchmark circuits, the proposed method estimated TYCs with the expense of 9.8% of the runtime for the exact computation. The proposed method identified the node that gives the greatest effect on the timing yield in all benchmark circuits, except C6288, while existing methods did not identify that for any circuit. In addition, the proposed method identified 98.4% of the critical nodes in the top 1% in the effect on the timing yield, while existing methods identified only about 10%.
Wei-min WANG Du-yan BI Xing-min DU Lin-hua MA
A novel high-speed and area-efficient Reed-Solomon decoder is proposed, which employs pipelining architecture of minimized modified Euclid (ME) algorithm. The logic synthesis and simulation results of its VLSI implementation show that it not only can operate at a higher clock frequency, but also consumes fewer hardware resources.
Feng CHENG Junfa MAO Xiaochun LI
A timing-driven placement algorithm based on path topology analysis is presented. The optimization for path delay is transformed into cell location optimization. The algorithm pays much attention on path topologies and applies an effective force directed method to find cell target locations. Total wire length optimization is combined with the timing-driven placement algorithm. MCNC (Microelectronics Centre of North-Carolina) standard cell benchmarks are experimented and results show that our timing-driven placement algorithm can make the longest path delay improve up to 13% compared with wirelength driven placement.
Toshinori SATO Akihiro CHIYONOBU
Power consumption is a major concern in embedded microprocessors design. Reducing power has also been a critical design goal for general-purpose microprocessors. Since they require high performance as well as low power, power reduction at the cost of performance cannot be accepted. There are a lot of device-level techniques that reduce power with maintaining performance. They select non-critical paths as candidates for low-power design, and performance-oriented design is used only in speed-critical paths. The same philosophy can be applied to architectural-level design. We evaluate a technique, which exploits dynamic information regarding instruction criticality in order to reduce power. We evaluate an instruction steering policy for a clustered microarchitecture, which is based on instruction criticality, and find it is substantially energy-efficient while it suffers performance degradation.
Jing-Jia LIOU Li-C. WANG Angela KRSTIĆ Kwang-Ting (Tim) CHENG
Critical path selection is an indispensable step for AC delay test and timing validation. Traditionally, this step relies on the construction of a set of worse-case paths based upon discrete timing models. However, the assumption of discrete timing models can be invalidated by timing defects and process variation in the deep sub-micron domain, which are often continuous in nature. As a result, critical paths defined in a traditional timing analysis approach may not be truly critical in reality. In this paper, we propose using a statistical delay evaluation framework for estimating the quality of a path set. Based upon the new framework, we demonstrate how the traditional definition of a critical path set may deviate from the true critical path set in the deep sub-micron domain. To remedy the problem, we discuss improvements to the existing path selection strategies by including new objectives. We then compare statistical approaches with traditional approaches based upon experimental analysis of both defect-free and defect-injected cases.
Sandeep KULKARNI Bezawada BRUHADESHWAR
In this paper, we focus on the problem of secure multicast in dynamic groups. In this problem, a group of users communicate using a shared group key. Due to the dynamic nature of these groups, to preserve secrecy, it is necessary to change the group key whenever the group membership changes. While the group key is being changed, the group communication needs to be interrupted until the rekeying is complete. This interruption is especially necessary if the rekeying is done because a user has left (or is removed). We split the rekeying cost into two parts: the cost of the critical path--where each user receives the new group key, and the cost of the non-critical path--where each user receives any other keys that it needs to obtain. We present a family of algorithms that show the tradeoff between the cost of the critical path and the cost of the non-critical path. Our solutions allow the group controller to choose the appropriate algorithm for key distribution by considering the requirements on critical and non-critical cost. In our solutions, the group controller can dynamically change the algorithm for key distribution to adapt to changing application requirements. Moreover, we argue that our solutions allow the group controller to effectively manage heterogeneous groups where users have different requirements/capabilities.
Masashi MIZUNO James OKELLO Hiroshi OCHI
In this paper, we propose a pipelined architecture for an equalizer based on the Multilevel Modified Constant Modulus Algorithm (MMCMA). We also provide the correction factor that mathematically converts the proposed pipelined adaptive equalizer into an equivalent non-pipelined conventional MMCMA based equalizer. The proposed method of pipelining uses modules with 6 filter coefficients, resulting in an overall latency of a single sampling period, along the main transmission line. The basic concept of the proposed architecture is to implement the Finite Impulse Response (FIR) filter and the algorithm portion of the adaptive equalizer, such that the critical path of the whole circuit has a maximum of three complex multipliers and three adders.
This paper presents a heuristic algorithm that minimizes the delay of the given circuit through a two-pass cell selection in cell-based design. First, we introduce a new graph, called candidate web, which conveniently represents all cell combinations available for the implementation of the given circuit. We, then, present an efficient method to obtain a tentative set of optimal cells, while estimating the delay of the longest path between each cell and the primary output on the candidate web. In this step, multiple cells are allowed to bind the same logic gate. Finally, we describe how the proposed approach actually selects the optimal cells from the tentative set, which would minimize the circuit delay. Experimental results on a set of benchmarks show that the proposed approach is effective and efficient in minimizing the delay of the given circuit.
This paper presents an event-driven approach to the timing verification of latch-synchronized systems. The proposed method performs critical path extraction and timing error detection at the same time, and extracts the critical path only if necessary. By doing so, the complexity of analysis is reduced and efficiency is greatly improved over the conventional approaches which detect timing errors after extracting the complete critical paths of the system. Experimental results show that, compared to the existing methods, it provides a more than 12-fold improvement in speed on the average for ISCAS benchmark circuits, and the relative efficiency of analysis improves as the circuit size grows.
This paper describes new problems in delay analysis for high-performance LSI design and presents a static delay analysis tool PCHECK. PCHECK is characterized by (1) a new critical path trace algorithm for avoiding the error caused by signal transient time and (2) a precise delay calculation model for resistive shielding. Experimental results show that the delay calculation error in the worst case is less than 20 ps.
Ikuo HARADA Yuichiro TAKEI Hitoshi KITAZAWA
A timing-driven global routing algorithm is proposed that directly models the path-based timing constraints. By keeping track of the critical path delay and channel densities, and using novel heuristic criteria, it can select routing paths that minimize area as well as satisfy the timing constraints. Using bipolar-specific features, this router can be used to design LSI chips that handle signals with speeds greater that a gigabit per second. Experimental results shows an average delay improvement of 17.6%.