Takeshi KAKEHI Ryoichi SHINKUMA Tutomu MURASE Gen MOTOYOSHI Kyoko YAMORI Tatsuro TAKAHASHI
The market growths of smart-phones and thin clients have been significantly increasing communication traffic in mobile networks. To handle the increased traffic, network operators should consider how to leverage distributed wireless resources such as distributed spots of wireless local access networks. In this paper, we consider the system where multiple moving users share distributed wireless access points on their traveling routes between their start and goal points and formulate as an optimization problem. Then, we come up with three algorithms as a solution for the problem. The key idea here is 'longcut route instruction', in which users are instructed to choose a traveling route where less congested access points are available; even if the moving distance increases, the throughput for users in the system would improve. In this paper, we define the gain function. Moreover, we analyze the basic characteristics of the system using as a simple model as possible.
Antoine TROUVE Kazuaki MURAKAMI
This article introduces some improvements to the already proposed custom instruction candidates selection for the automatic ISA customisation problem targeting reconfigurable processors. It introduces new opportunities to prune the search space, and a technique based on dynamic programming to check the independence between groups. The proposed new algorithm yields one order less measured number of convexity checks than the related work for the same inputs and outputs.
Chao YAN Hongjun DAI Tianzhou CHEN
Soft error has become an increasingly significant concern in modern micro-processor design, it is reported that the instruction-level temporal redundancy in out-of-order cores suffers an performance degradation up to 45%. In this work, we propose a fault tolerant architecture with fast error correcting codes (such as the two-dimensional code) based on double execution. Experimental results show that our scheme can gain back IPC loss between 9.1% and 10.2%, with an average around 9.2% compared with the conventional double execution architecture.
Many scientific applications require efficient variable-precision floating-point arithmetic. This paper presents a special-purpose Very Large Instruction Word (VLIW) architecture for variable precision floating-point arithmetic (VV-Processor) on FPGA. The proposed processor uses a unified hardware structure, equipped with multiple custom variable-precision arithmetic units, to implement various variable-precision algebraic and transcendental functions. The performance is improved through the explicitly parallel technology of VLIW instruction and by dynamically varying the precision of intermediate computation. We take division and exponential function as examples to illustrate the design of variable-precision elementary algorithms in VV-Processor. Finally, we create a prototype of VV-Processor unit on a Xilinx XC6VLX760-2FF1760 FPGA chip. The experimental results show that one VV-Processor unit, running at 253 MHz, outperforms the approach of a software-based library running on an Intel Core i3 530 CPU at 2.93 GHz by a factor of 5X-37X for basic variable-precision arithmetic operations and elementary functions.
Jiongyao YE Yingtao HU Hongfeng DING Takahiro WATANABE
Power consumption has become an increasing concern in high performance microprocessor design. Especially, Instruction Cache (I-Cache) contributes a large portion of the total power consumption in a microprocessor, since it is a complex unit and is accessed very frequently. Several studies on low-power design have been presented for the power-efficient cache design. However, these techniques usually suffer from the restrictions in the traditional Instruction Fetch Unit (IFU) architectures where the fetch address needs to be sent to I-Cache once it is available. Therefore, work to reduce the power consumption is limited after the address generation and before starting an access. In this paper, we present a new power-aware IFU architecture, named Analysis Before Starting an Access (ABSA), which aims at maximizing the power efficiency of the low-power designs by eliminating the restrictions on those low-power designs of the traditional IFU. To achieve this goal, ABSA reorganizes the IFU pipeline and carefully assigns tasks for each stages so that sufficient time and information can be provided for the low-power techniques to maximize the power efficiency before starting an access. The proposed design is fully scalable and its cost is low. Compared to a conventional IFU design, simulation results show that ABSA saves about 30.3% fetch power consumption, on average. I-Cache employed by ABSA reduces both static and dynamic power consumptions about 85.63% and 66.92%, respectively. Meanwhile the performance degradation is only about 0.97%.
To reduce the huge search space when customizing accelerators for the application specific instruction-set processor (ASIP), this paper proposes an automated customization method based on the data flow graph exploration. This method integrates the instruction identification and selection using an iterative improvement strategy, which uses a seed-growth algorithm to select the valid patterns that can bring higher performance enhancement. The search space is reduced by considering the performance factors during the identification stage. The experimental results indicate that the proposed method is feasible enough compared to the previous exhaustive algorithms.
Osamu NISHII Yoichi YUYAMA Masayuki ITO Yoshikazu KIYOSHIGE Yusuke NITTA Makoto ISHIKAWA Tetsuya YAMADA Junichi MIYAKOSHI Yasutaka WADA Keiji KIMURA Hironori KASAHARA Hideo MAEJIMA
We built a 12.4 mm12.4 mm, 45-nm CMOS, chip that integrates eight 648-MHz general purpose cores, two matrix processor (MX-2) cores, four flexible engine (FE) cores and media IP (VPU5) to establish heterogeneous multi-core chip architecture. The general purpose core had its IPC (instructions per cycle) performance enhanced by adding 32-bit instructions to the existing 16-bit fixed-length instruction set and executing up to two 32-bit instructions per cycle. Considering these five-to-seven years of embedded LSI and increasing trend of access-master within LSI, we predict that the memory usage of single core will not exceed 32-bit physical area (i.e. 4 GB), but chip-total memory usage will exceed 4 GB. Based on this prediction, the physical address was expanded from 32-bit to 40-bit. The fabricated chip was tested and a parallel operation of eight general purpose cores and four FE cores and eight data transfer units (DTU) is obtained on AAC (Advanced Audio Coding) encode processing.
Kazuhiro YOSHIMURA Takuya IWAKAMI Takashi NAKADA Jun YAO Hajime SHIMADA Yasuhiko NAKASHIMA
Recently, we have proposed using a Linear Array Pipeline Processor (LAPP) to improve energy efficiency for various workloads such as image processing and to maintain programmability by working on VLIW codes. In this paper, we proposed an instruction mapping scheme for LAPP to fully exploit the array execution of functional units (FUs) and bypass networks by a mapper to fit the VLIW codes onto the FUs. The mapping can be finished within multi-cycles during a data prefetch before the array execution of FUs. According to an HDL based implementation, the hardware required for mapping scheme is 84% of the cost introduced by a baseline method. In addition, the proposed mapper can further help to shrink the size of array stage, as our results show that their combination becomes 88% of the baseline model in area.
Hao BAI Chang-zhen HU Gang ZHANG Xiao-chuan JING Ning LI
The letter proposes a novel binary vulnerability analyzer for executable programs that is based on the Hidden Markov Model. A vulnerability instruction library (VIL) is primarily constructed by collecting binary frames located by double precision analysis. Executable programs are then converted into structurized code sequences with the VIL. The code sequences are essentially context-sensitive, which can be modeled by Hidden Markov Model (HMM). Finally, the HMM based vulnerability analyzer is built to recognize potential vulnerabilities of executable programs. Experimental results show the proposed approach achieves lower false positive/negative rate than latest static analyzers.
Two-step physical register deallocation (TSD) is an architectural scheme that enhances memory-level parallelism (MLP) by pre-executing instructions. Ideally, TSD allows exploitation of MLP under an unlimited number of physical registers, and consequently only a small register file is needed for MLP. In practice, however, the amount of MLP exploitable is limited, because there are cases where either 1) pre-execution is not performed; or 2) the timing of pre-execution is delayed. Both are due to data dependencies among the pre-executed instructions. This paper proposes the use of value prediction to solve these problems. This paper proposes the use of value prediction to solve these problems. Evaluation results using the SPECfp2000 benchmark confirm that the proposed scheme with value prediction for predicting addresses achieves equivalent IPC, with a smaller register file, to the previous TSD scheme. The reduction rate of the register file size is 21%.
Sung-Rae LEE Ser-Hoon LEE Sun-Young HWANG
This paper presents an efficient instruction scheduling algorithm which generates low-power codes for embedded system applications. Reordering and recoding are concurrently applied for low-power code generation in the proposed algorithm. By appropriate reordering of instruction sequences, the efficiency of instruction recoding is increased. The proposed algorithm constructs program codes on a basic-block basis by selecting a code sequence from among the schedules generated randomly and maintained by the system. By generating random schedules for each of the basic blocks constituting an application program, the proposed algorithm constructs a histogram graph for each of the instruction fields to estimate the figure-of-merits achievable by reordering instruction sequences. For further optimization, the system performs simulated annealing on the generated code. Experimental results for benchmark programs show that the codes generated by the proposed algorithm consume 37.2% less power on average when compared to the previous algorithm which performs list scheduling prior to instruction recoding.
Takuji HIEDA Hiroaki TANAKA Keishi SAKANUSHI Yoshinori TAKEUCHI Masaharu IMAI
Partial forwarding is a design method to place forwarding paths on a part of processor pipeline. Hardware cost of processor can be reduced without performance loss by partial forwarding. However, compiler with the instruction scheduler which considers partial forwarding structure of the target processor is required since conventional scheduling algorithm cannot make the most of partial forwarding structure. In this paper, we propose a heuristic instruction scheduling method for processors with partial forwarding structure. The proposed algorithm uses available distance to schedule instructions which are suitable for the target partial forwarding processor. Experimental results show that the proposed method generates near-optimal solutions in practical time and some of the optimized codes for partial forwarding processor run in the shortest time among the target processors. It also shows that the proposed method is superior to hazard detection unit.
Farhad MEHDIPOUR Hamid NOORI Koji INOUE Kazuaki MURAKAMI
Multitude parameters in the design process of a reconfigurable instruction-set processor (RISP) may lead to a large design space and remarkable complexity. Quantitative design approach uses the data collected from applications to satisfy design constraints and optimize the design goals while considering the applications' characteristics; however it highly depends on designer observations and analyses. Exploring design space can be considered as an effective technique to find a proper balance among various design parameters. Indeed, this approach would be computationally expensive when the performance evaluation of the design points is accomplished based on the synthesis-and-simulation technique. A combined analytical and simulation-based model (CAnSO**) is proposed and validated for performance evaluation of a typical RISP. The proposed model consists of an analytical core that incorporates statistics collected from cycle-accurate simulation to make a reasonable evaluation and provide a valuable insight. CAnSO has clear speed advantages and therefore it can be used for easing a cumbersome design space exploration of a reconfigurable RISP processor and quick performance evaluation of slightly modified architectures.
Kazunaga HYODO Kengo IWAMOTO Hideki ANDO
Instruction pre-execution is an effective way to prefetch data. We previously proposed an instruction pre-execution scheme, which we call two-step physical register deallocation (TSD). The TSD realizes pre-execution by exploiting the difference between the amount of instruction-level parallelism available with an unlimited number of physical registers and that available with an actual number of physical registers. Although previous TSD study has successfully improved performance, it still has an inefficient energy consumption. This is because attempts are made for instructions to be pre-executed as much as possible, independently of whether or not they can significantly contribute to load latency reduction, allowing for maximal performance improvement. This paper presents a scheme that improves the energy efficiency of the TSD by pre-executing only those instructions that have great benefit. Our evaluation results using the SPECfp2000 benchmark show that our scheme reduces the dynamic pre-executed instruction count by 76%, compared with the original scheme. This reduction saves 7% energy consumption of the execution core with 2% overhead. Performance degrades by 2%, compared with that of the original scheme, but is still 15% higher than that of the normal processor without the TSD.
Iver STUBDAL Arda KARADUMAN Hideharu AMANO
Code density is often a critical issue in embedded computers, since the memory size of embedded systems is strictly limited. Echo instructions have been proposed as a method for reducing code size. This paper presents a new type of echo instruction, split echo, and evaluates an implementation of both split echo and traditional echo instructions on a MIPS R3000 based processor. Evaluation results show that memory requirement is reduced by 12% on average with small additional hardware cost.
Gi-Ho PARK Jung-Wook PARK Hoi-Jin LEE Gunok JUNG Sung-Bae PARK Shin-Dug KIM
This paper presents a cache way enabling mechanism using branch target addresses. This mechanism uses branch prediction information to avoid the power consumption due to unnecessary cache way access by enabling only the cache way(s) that should be accessed. The proposed cache way enabling mechanism reduces the power consumption of the instruction cache by 63% without any performance degradation of the processor. An ARM1136 processor simulator and the Synopsys PrimeTime are used to perform the performance/power simulation and static timing analysis of the proposed mechanisms respectively.
The deep submicron semiconductor technologies will make the worst-case design impossible, since they can not provide design margins that it requires. We are investigating a typical-case design methodology, which we call the Constructive Timing Violation (CTV). This paper extends the CTV concept to collapse dependent instructions, resulting in performance improvement. Based on detailed simulations, we find the proposed mechanism effectively collapses dependent instructions.
Modern microprocessors achieve high application performance at the acceptable level of power dissipation. In terms of power to performance trade-off, the instruction window is particularly important. This is because enlarging the window size achieves high performance but naive scaling of the conventional instruction window can severely increase the complexity and power consumption. In this paper, we propose low-power instruction window techniques for contemporary microprocessors. First, the small reorder buffer (SROB) reduces power dissipation by deferred allocation and early release. The deferred allocation delays the SROB allocation of instructions until their all data dependencies are resolved. Then, the instructions are executed in program order and they are released faster from the SROB. This results in higher resource utilization and low power consumption. Second, we replace a conventional issue queue by a direct lookup table (DLT) with an efficient tag translation technique. The translation scheme resolves the instruction dependency, especially for the case of one producer to multiple consumers. The efficiency of the translation scheme stems from the fact that the vast majority of instruction dependency exists within a basic block. Experimental results show that our proposed design reduces the power consumption significantly for SPEC2000 benchmarks.
Ikkyun KIM Koohong KANG Yangseo CHOI Daewon KIM Jintae OH Jongsoo JANG Kijun HAN
The ability to recognize quickly inside network flows to be executable is prerequisite for malware detection. For this purpose, we introduce an instruction transition probability matrix (ITPX) which is comprised of the IA-32 instruction sets and reveals the characteristics of executable code's instruction transition patterns. And then, we propose a simple algorithm to detect executable code inside network flows using a reference ITPX which is learned from the known Windows Portable Executable files. We have tested the algorithm with more than thousands of executable and non-executable codes. The results show that it is very promising enough to use in real world.
Soong Hyun SHIN Sung Woo CHUNG Eui-Young CHUNG Chu Shik JHON
As technology scales down, leakage energy accounts for a greater proportion of total energy. Applying the drowsy technique to a cache, is regarded as one of the most efficient techniques for reducing leakage energy. However, it increases the Soft Error Rate (SER), thus, many researchers doubt the reliability of the drowsy technique. In this paper, we show several reasons why the instruction cache can adopt the drowsy technique without reliability problems. First, an instruction cache always stores read-only data, leading to soft error recovery by re-fetching the instructions from lower level memory. Second, the effect of the re-fetching caused by soft errors on performance is negligible. Additionally, a considerable percentage of soft errors can occur without harming the performance. Lastly, unrecoverable soft errors can be controlled by the scrubbing method. The simulation results show that the drowsy instruction cache rarely increases the rate of unrecoverable errors and negligibly degrades the performance.