The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] instruction(79hit)

41-60hit(79hit)

  • Fast Custom Instruction Identification Algorithm Based on Basic Convex Pattern Model for Supporting ASIP Automated Design

    Kang ZHAO  Jinian BIAN  Sheqin DONG  Yang SONG  Satoshi GOTO  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E91-A No:6
      Page(s):
    1478-1487

    To improve the computation efficiency of the application specific instruction-set processor (ASIP), a strategy of hardware/software collaborative design is usually utilized. In this process, the auto-customization of specific instruction set has always been a key part to support the automated design of ASIP. The key issue of this problem is how to effectively reduce the huge exponential exploration space in the instruction identification process. To address this issue, we first formulate it as a feasible sub-graph enumeration problem under multiple constraints, and then propose a fast instruction identification algorithm based on a new model called basic convex pattern (BCP). The kernel technique in this algorithm is the transformation from the graph exploration to the formula-based computations. The experimental results have indicated that the proposed algorithm has a distinct reduction in the execution time.

  • A Low-Power Instruction Issue Queue for Microprocessors

    Shingo WATANABE  Akihiro CHIYONOBU  Toshinori SATO  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    400-409

    Instruction issue queue is a key component which extracts instruction level parallelism (ILP) in modern out-of-order microprocessors. In order to exploit ILP for improving processor performance, instruction queue size should be increased. However, it is difficult to increase the size, since instruction queue is implemented by a content addressable memory (CAM) whose power and delay are much large. This paper introduces a low power and scalable instruction queue that replaces the CAM with a RAM. In this queue, instructions are explicitly woken up. Evaluation results show that the proposed instruction queue decreases processor performance by only 1.9% on average. Furthermore, the total energy consumption is reduced by 54% on average.

  • A Reconfigurable Functional Unit with Conditional Execution for Multi-Exit Custom Instructions

    Hamid NOORI  Farhad MEHDIPOUR  Koji INOUE  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    497-508

    Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of these custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. A quantitative approach is utilized to propose an efficient architecture for the RFU and fix its constraints. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple exits custom instructions are proposed. Conditional execution has been added to the RFU to support the multi-exit feature of custom instructions. Experimental results show that multi-exit custom instructions enhance the performance by an average of 67% compared to custom instructions limited to one basic block. A maximum speedup of 4.7, compared to a general embedded processor, and an average speedup of 1.85 was achieved on MiBench benchmark suite.

  • Diversification of Processors Based on Redundancy in Instruction Set

    Shuichi ICHIKAWA  Takashi SAWADA  Hisashi HATA  

     
    PAPER-Implementation

      Vol:
    E91-A No:1
      Page(s):
    211-220

    By diversifying processor architecture, computer software is expected to be more resistant to plagiarism, analysis, and attacks. This study presents a new method to diversify instruction set architecture (ISA) by utilizing the redundancy in the instruction set. Our method is particularly suited for embedded systems implemented with FPGA technology, and realizes a genuine instruction set randomization, which has not been provided by the preceding studies. The evaluation results on four typical ISAs indicate that our scheme can provide a far larger degree of freedom than the preceding studies. Diversified processors based on MIPS architecture were actually implemented and evaluated with Xilinx Spartan-3 FPGA. The increase of logic scale was modest: 5.1% in Specialized design and 3.6% in RAM-mapped design. The performance overhead was also modest: 3.4% in Specialized design and 11.6% in RAM-mapped design. From these results, our scheme is regarded as a practical and promising way to secure FPGA-based embedded systems.

  • Generation of Pack Instruction Sequence for Media Processors Using Multi-Valued Decision Diagram

    Hiroaki TANAKA  Yoshinori TAKEUCHI  Keishi SAKANUSHI  Masaharu IMAI  Hiroki TAGAWA  Yutaka OTA  Nobu MATSUMOTO  

     
    PAPER-System Level Design

      Vol:
    E90-A No:12
      Page(s):
    2800-2809

    SIMD instructions are often implemented in modern multimedia oriented processors. Although SIMD instructions are useful for many digital signal processing applications, most compilers do not exploit SIMD instructions. The difficulty in the utilization of SIMD instructions stems from data parallelism in registers. In assembly code generation, the positions of data in registers must be noted. A technique of generating pack instructions which pack or reorder data in registers is essential for exploitation of SIMD instructions. This paper presents a code generation technique for SIMD instructions with pack instructions. SIMD instructions are generated by finding and grouping the same operations in programs. After the SIMD instruction generation, pack instructions are generated. In the pack instruction generation, Multi-valued Decision Diagram (MDD) is introduced to represent and to manipulate sets of packed data. Experimental results show that the proposed code generation technique can generate assembly code with SIMD and pack instructions performing repacking of 8 packed data in registers for a RISC processor with a dual-issue coprocessor which supports SIMD and pack instructions. The proposed method achieved speedup ratio up to about 8.5 by SIMD instructions and multiple-issue mechanism of the target processor.

  • Architectural-Level Soft-Error Modeling for Estimating Reliability of Computer Systems

    Makoto SUGIHARA  Tohru ISHIHARA  Kazuaki MURAKAMI  

     
    PAPER-VLSI Design Technology

      Vol:
    E90-C No:10
      Page(s):
    1983-1991

    This paper proposes a soft-error model for accurately estimating reliability of a computer system at the architectural level within reasonable computation time. The architectural-level soft-error model identifies which part of memory modules are utilized temporally and spatially and which single event upsets (SEUs) are critical to the program execution of the computer system at the cycle accurate instruction set simulation (ISS) level. The soft-error model is capable of estimating reliability of a computer system that has several memory hierarchies with it and finding which memory module is vulnerable in the computer system. Reliability estimation helps system designers apply reliable design techniques to vulnerable part of their design. The experimental results have shown that the usage of the soft-error model achieved more accurate reliability estimation than conventional approaches. The experimental results demonstrate that reliability of computer systems depends on not only soft error rates (SERs) of memories but also the behavior of software running in computer systems.

  • Power Estimation of Partitioned Register Files in a Clustered Architecture with Performance Evaluation

    Yukinori SATO  Ken-ichi SUZUKI  Tadao NAKAMURA  

     
    PAPER-VLSI Systems

      Vol:
    E90-D No:3
      Page(s):
    627-636

    High power consumption and slow access of enlarged and multiported register files make it difficult to design high performance superscalar processors. The clustered architecture, where the conventional monolithic register file is partitioned into several smaller register files, is expect to overcome the register file issues. In the clustered architecture, the more a monolithic register file is partitioned, the lower power and faster access register files can be realized. However, the partitioning causes losses of IPC (instructions per clock cycle) due to communication among register files. Therefore, degree of partitioning has a strong impact on the trade-off between power consumption and performance. In addition, the organization of partitioned register files also affects the trade-off. In this paper, we attempt to investigate appropriate degrees of partitioning and organizations of partitioned register files in a clustered architecture to assess the trade-off. From the results of execute-driven simulation, we find that the organization of register files and the degree of partitioning have a strong impact on the IPC, and the configuration with non-consistent register files can make use of the partitioned resources more effectively. From the results of register file access time and energy modeling, we find that the configurations with the highly partitioned non-consistent register file organization can receive benefit of the partitioning in terms of operating frequency and access energy of register files. Further, we examine relationship between IPS (instructions per second) and the product of IPC and operating frequency of register files. The results suggest that highly partitioned non-consistent configurations tends to gain more advantage in performance and power.

  • Efficient DSP Architecture for Viterbi Decoding with Small Trace Back Latency

    Weon Heum PARK  Myung Hoon SUNWOO  Seong Keun OH  

     
    PAPER-Fundamental Theories for Communications

      Vol:
    E89-B No:10
      Page(s):
    2813-2818

    This paper proposes efficient DSP instructions and their hardware architecture for the Viterbi algorithm. The implementation of the Viterbi algorithm on a DSP chip has been attracting more interest for its flexibility, programmability, etc. The proposed architecture can reduce the Trace Back (TB) latency and can support various wireless communication standards. The proposed instructions perform the Add Compare Select (ACS) and TB operations in parallel and the architecture has special hardware, called the Offset Calculation Unit (OCU), which automatically calculates data addresses for acceleration of the trellis butterfly computations. When the constraint length K is 5, the proposed architecture can reduce the decoding cycles about 17% compared with Carmel DSP and about 45% compared with TMS320C55x.

  • An Interactive Multimedia Instruction System: IMPRESSION for Double Loop Instructional Design Process Model

    Yuki HIGUCHI  Takashi MITSUISHI  Kentaro GO  

     
    PAPER-Service and System

      Vol:
    E89-D No:6
      Page(s):
    1877-1884

    In this paper, we propose an interactive instruction system named IMPRESSION, which allows performance of interactive presentations using multimedia educational materials in class. In recent years, although many practices of educational methodology with information technology and presentation tools using multimedia resources as educational materials have come into common use, instructors can only present such materials in a slide-sheet form through the use of such presentation tools in class. Therefore, instructors can neither do formative evaluations nor can they present suitable materials according to students' reactions in class. Our proposed methodology employs a scenario-based approach in a double loop instructional design process to overcome such problems. Instructors design an instructional plan as a scenario, and subsequently implement and modify the plan through formative evaluation during the class. They then conduct a summative evaluation based on planned and implemented instructions for redesign. To realize our methodology, in this paper we propose and design an instruction system that provides functions to select and present multimedia materials interactively provided on the Internet during the class; we then record these instructions. After implementing it, we confirmed that we can conduct the class flexibly based on our methodology through its practical use in an actual classroom environment.

  • Instruction Based Synthesizable Testbench Architecture

    Ho-Seok CHOI  Hae-Wook CHOI  Sin-Chong PARK  

     
    LETTER-Integrated Electronics

      Vol:
    E89-C No:5
      Page(s):
    653-657

    This paper presents a synthesizable testbench architecture based on a defined instruction for standalone mode verification. A set of instructions describes transitions of a signal. The set of instructions can be changed easily to describe different signal transitions by loading the different set of instructions on emulator's memory. The proposed testbench enables a fast emulation and increases flexibility and reusability by using an instruction set. To prove the performance of instruction based synthesizable testbench, we verified Bluetooth and IEEE 802.11a PHY baseband systems and compared their performance with those of co-sim mode and modified co-sim mode emulation.

  • An Energy-Efficient Partitioned Instruction Cache Architecture for Embedded Processors

    CheolHong KIM  SungWoo CHUNG  ChuShik JHON  

     
    PAPER-Computer Systems

      Vol:
    E89-D No:4
      Page(s):
    1450-1458

    Energy efficiency of cache memories is crucial in designing embedded processors. Reducing energy consumption in the instruction cache is especially important, since the instruction cache consumes a significant portion of total processor energy. This paper proposes a new instruction cache architecture, named Partitioned Instruction Cache (PI-Cache), for reducing dynamic energy consumption in the instruction cache by partitioning it to smaller (less power-consuming) sub-caches. When the proposed PI-Cache is accessed, only one sub-cache is accessed by utilizing the temporal/spatial locality of applications. In the meantime, other sub-caches are not accessed, leading to dynamic energy reduction. The PI-Cache also reduces dynamic energy consumption by eliminating the energy consumed in tag lookup and comparison. Moreover, the performance gap between the conventional instruction cache and the proposed PI-Cache becomes little when the physical cache access time is considered. We evaluated the energy efficiency by running a cycle accurate simulator, SimpleScalar, with power parameters obtained from CACTI. Simulation results show that the PI-Cache improves the energy-delay product by 20%-54% compared to the conventional direct-mapped instruction cache.

  • An Efficient Matrix-Based 2-D DCT Splitter and Merger for SIMD Instructions

    Yuh-Jue CHUANG  Ja-Ling WU  

     
    PAPER-Image Processing and Multimedia Systems

      Vol:
    E88-D No:7
      Page(s):
    1569-1577

    Recent microprocessors have included SIMD (single instruction multiple data) extensions into their instruction set architecture to improve the performance of multimedia applications. SIMD instructions speed up the execution of programs but pose lots of challenges to software developers. An efficient matrix-based splitter (or merger), which can split an N N 2-D DCT block into four N/2 N/2 or two N N/2 (or N/2 N) 2-D DCT blocks (or merger small size blocks into a large size one), specialized for SIMD architectures is presented in this paper. The programming-level complexity of the proposed methods is lower than that of the direct approach. Furthermore, even without using SIMD instructions, the algorithmic-level complexity of the proposed DCT splitter/merger is still lower than that of the direct one and is the same as that of the most efficient approach existed in the literature. When N = 8, our method can be applied to act as a transcoder between the latest video coding standards AVC/H.264 and the older ones, such as MPEG-1, MPEG-2 and MPEG-4 part 2. We also provide the image quality tests to show the performance of the proposed 2-D DCT splitter and merger.

  • A SIMD Instruction Set and Functional Unit Synthesis Algorithm with SIMD Operation Decomposition

    Nozomu TOGAWA  Koichi TACHIKAKE  Yuichiro MIYAOKA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-Programmable Logic, VLSI, CAD and Layout

      Vol:
    E88-D No:7
      Page(s):
    1340-1349

    This paper focuses on SIMD processor synthesis and proposes a SIMD instruction set/functional unit synthesis algorithm. Given an initial assembly code and a timing constraint, the proposed algorithm synthesizes an area-optimized processor core with optimal SIMD functional units. It also synthesizes a SIMD instruction set. The input initial assembly code is assumed to run on a full-resource SIMD processor (virtual processor) which has all the possible SIMD functional units. In our algorithm, we introduce the SIMD operation decomposition and apply it to the initial assembly code and the full-resource SIMD processor. By gradually reducing SIMD operations or decomposing SIMD operations, we can finally find a processor core with small area under the given timing constraint. The promising experimental results are also shown.

  • Instruction-Level Power Estimation Method by Considering Hamming Distance of Registers

    Akihiko HIGUCHI  Kazutoshi KOBAYASHI  Hidetoshi ONODERA  

     
    PAPER

      Vol:
    E87-A No:4
      Page(s):
    823-829

    This paper proposes an instruction-level power estimation method for an embedded RISC processor, the power consumption of which fluctuates so much by applications and input data. The proposed method estimates the power consumption from the result of ISS (Instruction Set Simulator) and energy tables according to Hamming Distance of Registers (HDR) of all instructions. It is over 105 times faster than the gate-level detailed logic simulation, while the estimated power curves have the same tendency with those from the logic simulation. The proposed method realizes both accurate and fast power estimation of embedded processors.

  • A Retargetable Simulator Generator for DSP Processor Cores with Packed SIMD-type Instructions

    Nozomu TOGAWA  Kyosuke KASAHARA  Yuichiro MIYAOKA  Jinku CHOI  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    PAPER-Simulation Accelerator

      Vol:
    E86-A No:12
      Page(s):
    3099-3109

    A packed SIMD type operation or a SIMD operation is n-parallel b/n-bit sub-operations executed by the modified n-bit functional unit. Such a functional unit is called a SIMD functional unit and a processor core which can execute SIMD operations is called a SIMD processor core. SIMD operations can be effectively applied to image processing applications. This paper focuses on hardware/software cosynthesis of SIMD processor cores and particularly proposes a new simulator generator which simulates pipelined instructions for a SIMD processor. Generally, a SIMD functional unit has many options and then we can have so many different SIMD functional unit instances. However, since our hardware/software cosynthesis system synthesizes a special-purpose processor core for an input application program, it uses very limited SIMD functional unit instances. In the proposed approach, we consider a SIMD operation to be a set of SIMD sub-operations. By adding up the appropriate SIMD sub-operations, we construct a single SIMD operation. Then a SIMD functional unit behavior can be characterized by a collection of SIMD operations. This approach has the advantage that: if we have a small number of behavior libraries for SIMD sub-operations, we can instantiate a particular SIMD functional unit behavior. Experimental results demonstrate the effectiveness of the proposed approach.

  • A Transparent Transient Faults Tolerance Mechanism for Superscalar Processors

    Toshinori SATO  

     
    PAPER-Dependable Systems

      Vol:
    E86-D No:12
      Page(s):
    2508-2516

    In this paper, we propose a fault-tolerance mechanism for microprocessors, which detects transient faults and recovers from them. The investigation of fault-tolerance techniques for microprocessors is driven by two issues: One regards deep submicron fabrication technologies. Future semiconductor technologies could become more susceptible to alpha particles and other cosmic radiation. The other is the increasing popularity of mobile platforms. Cellular telephones are currently used for applications which are critical to our financial security, such as mobile banking, mobile trading, and making airline ticket reservations. Such applications demand that computer systems work correctly. In light of this, we propose a mechanism which is based on an instruction reissue technique for incorrect data speculation recovery and utilizes time redundancy, and evaluate our proposal using a timing simulator.

  • A Hardware/Software Partitioning Algorithm for Processor Cores with Packed SIMD-Type Instructions

    Nozomu TOGAWA  Koichi TACHIKAKE  Yuichiro MIYAOKA  Masao YANAGISAWA  Tatsuo OHTSUKI  

     
    LETTER-Design Methodology

      Vol:
    E86-A No:12
      Page(s):
    3218-3224

    This letter proposes a new hardware/software partitioning algorithm for processor cores with SIMD instructions. Given a compiled assembly code including SIMD instructions and a timing constraint, the proposed algorithm synthesizes an area-optimized processor core with a new assembly code. Firstly, we assume for each operation type a super SIMD functional unit which can execute all the SIMD instructions. Secondly we reduce a SIMD instruction or "sub-function" of each super functional unit, one by one, while the timing constraint is satisfied. At the same time, we update the assembly code so that it can run on the new processor configuration. By repeating this process, we finally find SIMD functional unit configuration as well as a processor core architecture. The promising experimental results are also shown.

  • Implementation of Java Accelerator for High-Performance Embedded Systems

    Motoki KIMURA  Morgan Hirosuke MIKI  Takao ONOYE  Isao SHIRAKAWA  

     
    PAPER-Simulation Accelerator

      Vol:
    E86-A No:12
      Page(s):
    3079-3088

    A Java execution environment is implemented, in which a hardware engine is operated in parallel with an embedded processor. This pair of hardware facilities together with an additional software kernel are devised for existing embedded systems, so as to execute Java applications more efficiently in such a way that 39 instructions are added to the original Java Virtual Machine to implement the software kernel. The exploration of design parameters is also attempted to attain a low hardware cost and high performance. The proposed hardware engine of a 6-stage pipeline can be integrated in a single chip using 30 k gates together with the instruction and data cache memories. The proposed approach improves the execution speed by a factor of 5 in comparison with the J2ME software implementation.

  • Performance Evaluation of Instruction Set Architecture of MBP-Light in JUMP-1

    Noriaki SUZUKI  Hideharu AMANO  

     
    PAPER

      Vol:
    E86-D No:10
      Page(s):
    1996-2005

    The instruction set architecture of MBP-light, a dedicated processor for the DSM (Distributed Shared Memory) management of JUMP-1 is analyzed with a real prototype. The Buffer-Register Architecture proposed for MBP-core improves performance with 5.64% in the home cluster and 6.27% in a remote cluster. Only a special instruction for hashing cluster address is efficient and improves the performance with 2.80%, but other special instructions are almost useless. It appears that the dominant operations in the DSM management program were handling packet queues assigned into the local cluster. Thus, common RISC instructions, especially load/store instructions, are frequently used. Separating instruction and data memory improves performance with 33%. The results suggest that another alternative which provides separate on-chip cache and instructions dedicated for packet queue management is advantageous.

  • Resource-Optimal Software Pipelining Using Flow Graphs

    Dirk FIMMEL  Jan MULLER  Renate MERKER  

     
    INVITED PAPER-Software Systems and Technologies

      Vol:
    E86-D No:9
      Page(s):
    1560-1568

    We present a new approach to the loop scheduling problem, which excels previous solutions in two important aspects: The resource constraints are formulated using flow graphs, and the initiation interval λ is treated as a rational variable. The approach supports heterogeneous processor architectures and pipelined functional units, and the Integer Linear Programming implementation produces an optimum loop schedule, whereby a minimum λ is achieved. Our flow graph model facilitates the cyclic binding of loop operations to functional units. Compared to previous research results, the solution can provide faster loop schedules and a significant reduction of the problem complexity and solution time.

41-60hit(79hit)