The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] instruction(79hit)

1-20hit(79hit)

  • A Principal Factor of Performance in Decoupled Front-End

    Yuya DEGAWA  Toru KOIZUMI  Tomoki NAKAMURA  Ryota SHIOYA  Junichiro KADOMOTO  Hidetsugu IRIE  Shuichi SAKAI  

     
    PAPER

      Pubricized:
    2023/06/30
      Vol:
    E106-D No:12
      Page(s):
    1960-1968

    One of the performance bottlenecks of a processor is the front-end that supplies instructions. Various techniques, such as cache replacement algorithms and hardware prefetching, have been investigated to facilitate smooth instruction supply at the front-end and to improve processor performance. In these approaches, one of the most important factors has been the reduction in the number of instruction cache misses. By using the number of instruction cache misses or derived factors, previous studies have explained the performance improvements achieved by their proposed methods. However, we found that the number of instruction cache misses does not always explain performance changes well in modern processors. This is because the front-end in modern processors handles subsequent instruction cache misses in overlap with earlier ones. Based on this observation, we propose a novel factor: the number of miss regions. We define a region as a sequence of instructions from one branch misprediction to the next, while we define a miss region as a region that contains one or more instruction cache misses. At the boundary of each region, the pipeline is flushed owing to a branch misprediction. Thus, cache misses after this boundary are not handled in overlap with cache misses before the boundary. As a result, the number of miss regions is equal to the number of cache misses that are processed without overlap. In this paper, we demonstrate that the number of miss regions can well explain the variation in performance through mathematical models and simulation results. The results show that the model explains cycles per instruction with an average error of 1.0% and maximum error of 4.1% when applying an existing prefetcher to the instruction cache. The idea of miss regions highlights that instruction cache misses and branch mispredictions interact with each other in processors with a decoupled front-end. We hope that considering this interaction will motivate the development of fast performance estimation methods and new microarchitectural methods.

  • An Accuracy Reconfigurable Vector Accelerator based on Approximate Logarithmic Multipliers for Energy-Efficient Computing

    Lingxiao HOU  Yutaka MASUDA  Tohru ISHIHARA  

     
    PAPER

      Pubricized:
    2022/09/02
      Vol:
    E106-A No:3
      Page(s):
    532-541

    The approximate logarithmic multiplier proposed by Mitchell provides an efficient alternative for processing dense multiplication or multiply-accumulate operations in applications such as image processing and real-time robotics. It offers the advantages of small area, high energy efficiency and is suitable for applications that do not necessarily achieve high accuracy. However, its maximum error of 11.1% makes it challenging to deploy in applications requiring relatively high accuracy. This paper proposes a novel operand decomposition method (OD) that decomposes one multiplication into the sum of multiple approximate logarithmic multiplications to widely reduce Mitchell multiplier errors while taking full advantage of its area savings. Based on the proposed OD method, this paper also proposes an accuracy reconfigurable multiply-accumulate (MAC) unit that provides multiple reconfigurable accuracies with high parallelism. Compared to a MAC unit consisting of accurate multipliers, the area is significantly reduced to less than half, improving the hardware parallelism while satisfying the required accuracy for various scenarios. The experimental results show the excellent applicability of our proposed MAC unit in image smoothing and robot localization and mapping application. We have also designed a prototype processor that integrates the minimum functionality of this MAC unit as a vector accelerator and have implemented a software-level accuracy reconfiguration in the form of an instruction set extension. We experimentally confirmed the correct operation of the proposed vector accelerator, which provides the different degrees of accuracy and parallelism at the software level.

  • Flow Processing Optimization with Accelerated Flow Actions on High Speed Programmable Data Plane

    Zhiyuan LING  Xiao CHEN  Lei SONG  

     
    PAPER-Network System

      Pubricized:
    2022/08/10
      Vol:
    E106-B No:2
      Page(s):
    133-144

    With the development of network technology, next-generation networks must satisfy many new requirements for network functions and performance. The processing of overlong packet fields is one of the requirements and is also the basis for ID-based routing and content lookup, and packet field addition/deletion mechanisms. The current SDN switches do not provide good support for the processing of overlong fields. In this paper, we propose a series of optimization mechanisms for protocol-oblivious instructions, in which we address the problem of insufficient support for overlong data in existing SDN switches by extending the bit width of instructions and accelerating them using SIMD instruction sets. We also provide an intermediate representation of the protocol-oblivious instruction set to improve the efficiency of storing and reading instruction blocks, and further reduce the execution time of instruction blocks by preprocessing them. The experiments show that our approach improves the performance of overlong data processing by 56%. For instructions involving packet field addition and deletion, the improvement in performance reaches 455%. In normal forwarding scenarios, our solution reduces the packet forwarding latency by around 30%.

  • Reducing Energy Consumption of Wakeup Logic through Double-Stage Tag Comparison

    Yasutaka MATSUDA  Ryota SHIOYA  Hideki ANDO  

     
    PAPER-Computer System

      Pubricized:
    2021/11/02
      Vol:
    E105-D No:2
      Page(s):
    320-332

    The high energy consumption of current processors causes several problems, including a limited clock frequency, short battery lifetime, and reduced device reliability. It is therefore important to reduce the energy consumption of the processor. Among resources in a processor, the issue queue (IQ) is a large consumer of energy, much of which is consumed by the wakeup logic. Within the wakeup logic, the tag comparison that checks source operand readiness consumes a significant amount of energy. This paper proposes an energy reduction scheme for tag comparison, called double-stage tag comparison. This scheme first compares the lower bits of the tag and then, only if these match, compares the higher bits. Because the energy consumption of tag comparison is roughly proportional to the total number of bits compared, energy is saved by reducing this number. However, this sequential comparison increases the delay of the IQ, thereby increasing the clock cycle time. Although this can be avoided by allocating an extra cycle to the issue operation, this in turn degrades the IPC. To avoid IPC degradation, we reconfigure a small number of entries in the IQ, where several oldest instructions that are likely to have an adverse effect on performance reside, to a single stage for tag comparison. Our evaluation results for SPEC2017 benchmark programs show that the double-stage tag comparison achieves on average a 21% reduction in the energy consumed by the wakeup logic (15% when including the overhead) with only 3.0% performance degradation.

  • Instruction Prefetch for Improving GPGPU Performance

    Jianli CAO  Zhikui CHEN  Yuxin WANG  He GUO  Pengcheng WANG  

     
    PAPER-VLSI Design Technology and CAD

      Pubricized:
    2020/11/16
      Vol:
    E104-A No:5
      Page(s):
    773-785

    Like many processors, GPGPU suffers from memory wall. The traditional solution for this issue is to use efficient schedulers to hide long memory access latency or use data prefetch mech-anism to reduce the latency caused by data transfer. In this paper, we study the instruction fetch stage of GPU's pipeline and analyze the relationship between the capacity of GPU kernel and instruction miss rate. We improve the next line prefetch mechanism to fit the SIMT model of GPU and determine the optimal parameters of prefetch mechanism on GPU through experiments. The experimental result shows that the prefetch mechanism can achieve 12.17% performance improvement on average. Compared with the solution of enlarging I-Cache, prefetch mechanism has the advantages of more beneficiaries and lower cost.

  • Instruction Filters for Mitigating Attacks on Instruction Emulation in Hypervisors

    Kenta ISHIGURO  Kenji KONO  

     
    PAPER-Dependable Computing

      Pubricized:
    2020/04/06
      Vol:
    E103-D No:7
      Page(s):
    1660-1671

    Vulnerabilities in hypervisors are crucial in multi-tenant clouds and attractive for attackers because a vulnerability in the hypervisor can undermine all the virtual machine (VM) security. This paper focuses on vulnerabilities in instruction emulators inside hypervisors. Vulnerabilities in instruction emulators are not rare; CVE-2017-2583, CVE-2016-9756, CVE-2015-0239, CVE-2014-3647, to name a few. For backward compatibility with legacy x86 CPUs, conventional hypervisors emulate arbitrary instructions at any time if requested. This design leads to a large attack surface, making it hard to get rid of vulnerabilities in the emulator.This paper proposes FWinst that narrows the attack surface against vulnerabilities in the emulator. The key insight behind FWinst is that the emulator should emulate only a small subset of instructions, depending on the underlying CPU micro-architecture and the hypervisor configuration. FWinst recognizes emulation contexts in which the instruction emulator is invoked, and identifies a legitimate subset of instructions that are allowed to be emulated in the current context. By filtering out illegitimate instructions, FWinst narrows the attack surface. In particular, FWinst is effective on recent x86 micro-architectures because the legitimate subset becomes very small. Our experimental results demonstrate FWinst prevents existing vulnerabilities in the emulator from being exploited on Westmere and Skylake micro-architectures, and the runtime overhead is negligible.

  • Anomaly Detection of Folding Operations for Origami Instruction with Single Camera

    Hiroshi SHIMANUKI  Toyohide WATANABE  Koichi ASAKURA  Hideki SATO  Taketoshi USHIAMA  

     
    PAPER-Pattern Recognition

      Pubricized:
    2020/02/25
      Vol:
    E103-D No:5
      Page(s):
    1088-1098

    When people learn a handicraft with instructional contents such as books, videos, and web pages, many of them often give up halfway because the contents do not always assure how to make it. This study aims to provide origami learners, especially beginners, with feedbacks on their folding operations. An approach for recognizing the state of the learner by using a single top-view camera, and pointing out the mistakes made during the origami folding operation is proposed. First, an instruction model that stores easy-to-follow folding operations is defined. Second, a method for recognizing the state of the learner's origami paper sheet is proposed. Third, a method for detecting mistakes made by the learner by means of anomaly detection using a one-class support vector machine (one-class SVM) classifier (using the folding progress and the difference between the learner's origami shape and the correct shape) is proposed. Because noises exist in the camera images due to shadows and occlusions caused by the learner's hands, the shapes of the origami sheet are not always extracted accurately. To train the one-class SVM classifier with high accuracy, a data cleansing method that automatically sifts out video frames with noises is proposed. Moreover, using the statistics of features extracted from the frames in a sliding window makes it possible to reduce the influence by the noises. The proposed method was experimentally demonstrated to be sufficiently accurate and robust against noises, and its false alarm rate (false positive rate) can be reduced to zero. Requiring only a single camera and common origami paper, the proposed method makes it possible to monitor mistakes made by origami learners and support their self-learning.

  • Laser-Induced Controllable Instruction Replacement Fault Attack Open Access

    Junichi SAKAMOTO  Daisuke FUJIMOTO  Tsutomu MATSUMOTO  

     
    PAPER

      Vol:
    E103-A No:1
      Page(s):
    11-20

    To develop countermeasures against fault attacks, it is important to model an attacker's ability. The instruction skip model is a well-studied practical model for fault attacks on software. Contrastingly, few studies have investigated the instruction replacement model, which is a generalization of the instruction skip model, because replacing an instruction with a desired one is considered difficult. Some previous studies have reported successful instruction replacements; however, those studies concluded that such instruction replacements are not practical attacks because the outcomes of the replacements are uncontrollable. This paper proposes the concept of a controllable instruction replacement technique that uses the laser irradiation of flash memory. The feasibility of the proposed technique is demonstrated experimentally using a smartcard-type ARM SC100 microcontroller. Then, practical cryptosystem attacks that exploit the proposed technique are investigated. The targeted cryptosystems employ the AES with software-based anti-fault countermeasures. We demonstrate that an existing anti-instruction-skip countermeasure can be circumvented by replacing a critical instruction, e.g., a branch instruction to detect fault occurrence.

  • An ASIC Crypto Processor for 254-Bit Prime-Field Pairing Featuring Programmable Arithmetic Core Optimized for Quadratic Extension Field

    Hiromitsu AWANO  Tadayuki ICHIHASHI  Makoto IKEDA  

     
    PAPER

      Vol:
    E102-A No:1
      Page(s):
    56-64

    An ASIC crypto processor optimized for the 254-bit prime-field optimal-ate pairing over Barreto-Naehrig (BN) curve is proposed. The data path of the proposed crypto processor is designed to compute five Fp2 operations, a multiplication, three addition/subtractions, and an inversion, simultaneously. We further propose a design methodology to automate the instruction scheduling by using a combinatorial optimization solver, with which the total cycle count is reduced to 1/2 compared with ever reported. The proposed crypto processor is designed and fabricated by using a 65nm silicon-on-thin-box (SOTB) CMOS process. The chip measurement result shows that the fabricated chip successfully computes a pairing in 0.185ms when a typical operating voltage of 1.20V is applied, which corresponds to 2.8× speed up compared to the current state-of-the-art pairing implementation on ASIC platform.

  • Design Exploration of SHA-3 ASIP for IoT on a 32-bit RISC-V Processor

    Jinli RAO  Tianyong AO  Shu XU  Kui DAI  Xuecheng ZOU  

     
    PAPER-Cryptographic Techniques

      Pubricized:
    2018/08/22
      Vol:
    E101-D No:11
      Page(s):
    2698-2705

    Data integrity is a key metric of security for Internet of Things (IoT) which refers to accuracy and reliability of data during transmission, storage and retrieval. Cryptographic hash functions are common means used for data integrity verification. Newly announced SHA-3 is the next generation hash function standard to replace existing SHA-1 and SHA-2 standards for better security. However, its underlying Keccak algorithm is computation intensive and thus limits its deployment on IoT systems which are normally equipped with 32-bit resource constrained embedded processors. This paper proposes two efficient SHA-3 ASIPs based on an open 32-bit RISC-V embedded processor named Z-scale. The first operation-oriented ASIP (OASIP) focuses on accelerating time-consuming operations with instruction set extensions to improve resource efficiency. And next datapath-oriented ASIP (DASIP) targets exploiting advance data and instruction level parallelism with extended auxiliary registers and customized datapath to achieve high performance. Implementation results show that both proposed ASIPs can effectively accelerate SHA-3 algorithm with 14.6% and 26.9% code size reductions, 30% and 87% resource efficiency improvements, 71% and 262% better maximum throughputs as well as 40% and 288% better power efficiencies than reference design. This work makes SHA-3 algorithm integration practical for both low-cost and high-performance IoT systems.

  • Evaluation of Register Number Abstraction for Enhanced Instruction Register Files

    Naoki FUJIEDA  Kiyohiro SATO  Ryodai IWAMOTO  Shuichi ICHIKAWA  

     
    PAPER-Computer System

      Pubricized:
    2018/03/14
      Vol:
    E101-D No:6
      Page(s):
    1521-1531

    Instruction set randomization (ISR) is a cost-effective obfuscation technique that modifies or enhances the relationship between instructions and machine languages. An Instruction Register File (IRF), a list of frequently used instructions, can be used for ISR by providing the way of indirect access to them. This study examines the IRF that integrates a positional register, which was proposed as a supplementary unit of the IRF, for the sake of tamper resistance. According to our evaluation, with a new design for the contents of the positional register, the measure of tamper resistance was increased by 8.2% at a maximum, which corresponds to a 32.2% increase in the size of the IRF. The number of logic elements increased by the addition of the positional register was 3.5% of its baseline processor.

  • PROVIT-CI: A Classroom-Oriented Educational Program Visualization Tool

    Yu YAN  Kohei HARA  Takenobu KAZUMA  Yasuhiro HISADA  Aiguo HE  

     
    PAPER-Educational Technology

      Pubricized:
    2017/11/01
      Vol:
    E101-D No:2
      Page(s):
    447-454

    Studies have shown that program visualization(PV) is effective for student programming exercise or self-study support. However, very few instructors actively use PV tools for programming lectures. This article discussed the impediments the instructors meet during combining PV tools into lecture classrooms and proposed a C programming classroom instruction support tool based on program visualization — PROVIT-CI (PROgram VIsualization Tool for Classroom Instruction). PROVIT-CI has been consecutively and actively used by the instructors in author's university to enhance their lectures since 2015. The evaluation of application results in an introductory C programming course shows that PROVIT-CI is effective and helpful for instructors classroom use.

  • A Memory-Access-Efficient Implementation for Computing the Approximate String Matching Algorithm on GPUs

    Lucas Saad Nogueira NUNES  Jacir Luiz BORDIM  Yasuaki ITO  Koji NAKANO  

     
    PAPER-GPU computing

      Pubricized:
    2016/08/24
      Vol:
    E99-D No:12
      Page(s):
    2995-3003

    The closeness of a match is an important measure with a number of practical applications, including computational biology, signal processing and text retrieval. The approximate string matching (ASM) problem asks to find a substring of string Y of length n that is most similar to string X of length m. It is well-know that the ASM can be solved by dynamic programming technique by computing a table of size m×n. The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU. The proposed GPU implementation relies on warp shuffle instructions which are used to accelerate the communication between threads without resorting to shared memory access. Despite the fact that O(mn) memory access operations are necessary to access all elements of a table with size n×m, the proposed implementation performs only $O( rac{mn}{w})$ memory access operations, where w is the warp size. Experimental results carried out on a GeForce GTX 980 GPU show that the proposed implementation, called w-SCAN, provides speed-up of over two fold in computing the ASM as compared to another prominent alternative.

  • Performance of Dynamic Instruction Window Resizing for a Given Power Budget under DVFS Control

    Hideki ANDO  Ryota SHIOYA  

     
    PAPER-Computer System

      Pubricized:
    2015/11/12
      Vol:
    E99-D No:2
      Page(s):
    341-350

    Dynamic instruction window resizing (DIWR) is a scheme that effectively exploits both memory-level parallelism and instruction-level parallelism by configuring the instruction window size appropriately for exploiting each parallelism. Although a previous study has shown that the DIWR processor achieves a significant speedup, power consumption has not been explored. The power consumption is increased in DIWR because the instruction window resources are enlarged in memory-intensive phases. If the power consumption exceeds the power budget determined by certain requirements, the DIWR processor must save power and thus, the performance previously presented cannot be achieved. In this paper, we explore to what extent the DIWR processor can achieve improved performance for a given power budget, assuming that dynamic voltage and frequency scaling (DVFS) is introduced as a power saving technique. Evaluation results using the SPEC2006 benchmark programs show that the DIWR processor, even with a constrained power budget, achieves a speedup over the conventional processor over a wide range of given power budgets. At the most important power budget point, i.e., when the power a conventional processor consumes without any power constraint is supplied, DIWR achieves a 16% speedup.

  • RX v2: Renesas's New-Generation MCU Processor

    Sugako OTANI  Hiroyuki KONDO  

     
    PAPER

      Vol:
    E98-C No:7
      Page(s):
    544-549

    RXv2 is the new generation of Renesas's processor architecture for microcontrollers with high-capacity flash memory. An enhanced instruction set and pipeline structure with an advanced fetch unit (AFU) provide an effective balance between power consumption performance and high processing performance. Enhanced instructions such as DSP function and floating point operation and a five-stage dual-issue pipeline synergistically boost the performance of digital signal applications. The RXv2 processor delivers 1.9 - 3.7 the cycle performance of the RXv1 in these applications. The decrease of the number of Flash memory accesses by AFU is a dominant determiner of reducing power consumption. AFU of RXv2 benefits from adopting branch target cache, which has a comparatively smaller area than that of typical cache systems. High code density delivers low power consumption by reducing instruction memory bandwidth. The implementation of RXv2 delivers up to 46% reduction in static code size, up to 30% reduction in dynamic code size relative to RISC architectures. RXv2 reaches 4.0 Coremark per MHz and operates up to 240MHz. The RXv2 processor delivers approximately more than 2.2 - 5.7x the power efficiency of the RXv1. The RXv2 microprocessor achieves the best possible computing performance in various applications such as building automation, medical, motor control, e-metering, and home appliances which lead to the higher memory capacity, frequency and processing performance.

  • A New Approach to Embedded Software Optimization Based on Reverse Engineering

    Nguyen Ngoc BINH  Pham Van HUONG  Bui Ngoc HAI  

     
    PAPER-Computer System

      Pubricized:
    2015/03/17
      Vol:
    E98-D No:6
      Page(s):
    1166-1175

    Optimizing embedded software is a problem having scientific and practical signification. Optimizing embedded software can be done in different phases of the software life cycle under different optimal conditions. Most studies of embedded software optimization are done in forward engineering and these studies have not given an overall model for the optimization problem of embedded software in both forward engineering and reverse engineering. Therefore, in this paper, we propose a new approach to embedded software optimization based on reverse engineering. First, we construct an overall model for the embedded software optimization in both forward engineering and reverse engineering and present a process of embedded software optimization in reverse engineering. The main idea of this approach is that decompiling executable code to source code, converting the source code to models and optimizing embedded software under different levels such as source code and model. Then, the optimal source code is recompiled. To develop this approach, we present two optimization techniques such as optimizing power consumption of assembly programs based on instruction schedule and optimizing performance based on alternating equivalent expressions.

  • Distributed Synchronization for Message-Passing Based Embedded Multiprocessors

    Hao XIAO  Ning WU  Fen GE  Guanyu ZHU  Lei ZHOU  

     
    LETTER-Architecture

      Vol:
    E98-D No:2
      Page(s):
    272-275

    This paper presents a synchronization mechanism to effectively implement the lock and barrier protocols in a decentralized manner through explicit message passing. In the proposed solution, a simple and efficient synchronization control mechanism is proposed to support queued synchronization without contention. By using state-of-the-art Application-Specific Instruction-set Processor (ASIP) technology, we embed the synchronization functionality into a baseline processor, making the proposed mechanism feature ultra-low overhead. Experimental results show the proposed synchronization achieves ultra-low latency and almost ideal scalability when the number of processors increases.

  • MLP-Aware Dynamic Instruction Window Resizing in Superscalar Processors for Adaptively Exploiting Available Parallelism

    Yuya KORA  Kyohei YAMAGUCHI  Hideki ANDO  

     
    PAPER-Computer System

      Pubricized:
    2014/09/22
      Vol:
    E97-D No:12
      Page(s):
    3110-3123

    Single-thread performance has not improved much over the past few years, despite an ever increasing transistor budget. One of the reasons for this is that there is a speed gap between the processor and main memory, known as the memory wall. A promising method to overcome this memory wall is aggressive out-of-order execution by extensively enlarging the instruction window resources to exploit memory-level parallelism (MLP). However, simply enlarging the window resources lengthens the clock cycle time. Although pipelining the resources solves this problem, it in turn prevents instruction-level parallelism (ILP) from being exploited because issuing instructions requires multiple clock cycles. This paper proposed a dynamic scheme that adaptively resizes the instruction window based on the predicted available parallelism, either ILP or MLP. Specifically, if the scheme predicts that MLP is available during execution, the instruction window is enlarged and the window resources are pipelined, thereby exploiting MLP. Conversely, if the scheme predicts that less MLP is available, that is, ILP is exploitable for improved performance, the instruction window is shrunk and the window resources are de-pipelined, thereby exploiting ILP. Our evaluation results using the SPEC2006 benchmark programs show that the proposed scheme achieves nearly the best performance possible with fixed-size resources. On average, our scheme realizes a performance improvement of 21% over that of a conventional processor, with additional cost of only 6% of the area of the conventional processor core or 3% of that of the entire processor chip. The evaluation results also show 8% better energy efficiency in terms of 1/EDP (energy-delay product).

  • Fast and Accurate Architecture Exploration for High Performance and Low Energy VLIW Data-Path

    Ittetsu TANIGUCHI  Kohei AOKI  Hiroyuki TOMIYAMA  Praveen RAGHAVAN  Francky CATTHOOR  Masahiro FUKUI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E97-A No:2
      Page(s):
    606-615

    A fast and accurate architecture exploration for high performance and low energy VLIW data-path is proposed. The main contribution is a method to find Pareto optimal FU structures, i.e., the optimal number of FUs and the best instruction assignment for each FU. The proposed architecture exploration method is based on GA and enables the effective exploration of vast solution space. Experimental results showed that proposed method was able to achieve fast and accurate architecture exploration. For most cases, the estimation error was less than 1%.

  • Efficient Utilization of Vector Registers to Improve FFT Performance on SIMD Microprocessors

    Feng YU  Ruifeng GE  Zeke WANG  

     
    LETTER-Digital Signal Processing

      Vol:
    E96-A No:7
      Page(s):
    1637-1641

    We investigate the utilization of vector registers (VRs) on reducing memory references for single instruction multiple data fast Fourier transform calculation. We propose to group the butterfly computations in several consecutive stages to maximize utilization of the available VRs and take the advantage of the symmetries in twiddle factors. All the butterflies sharing identical twiddle factors are clustered and computed together to further improve performance. The relationship between the number of fused stages and the number of available VRs is then examined. Experimental results on different platforms show that the proposed method is effective.

1-20hit(79hit)