The search functionality is under construction.

Keyword Search Result

[Keyword] embedded processor(23hit)

1-20hit(23hit)

  • HOG-Based Object Detection Processor Design Using ASIP Methodology

    Shanlin XIAO  Tsuyoshi ISSHIKI  Dongju LI  Hiroaki KUNIEDA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E100-A No:12
      Page(s):
    2972-2984

    Object detection is an essential and expensive process in many computer vision systems. Standard off-the-shelf embedded processors are hard to achieve performance-power balance for implementation of object detection applications. In this work, we explore an Application Specific Instruction set Processor (ASIP) for object detection using Histogram of Oriented Gradients (HOG) feature. Algorithm simplifications are adopted to reduce memory bandwidth requirements and mathematical complexity without losing reliability. Also, parallel histogram generation and on-the-fly Support Vector Machine (SVM) calculation architecture are employed to reduce the necessary cycle counts. The HOG algorithm on the proposed ASIP was accelerated by a factor of 63x compared to the pure software implementation. The ASIP was synthesized for a standard 90nm CMOS library, with a silicon area of 1.31mm2 and 47.8mW power consumption at a 200MHz frequency. Our object detection processor can achieve 42 frames-per-second (fps) on VGA video. The evaluation and implementation results show that the proposed ASIP is both area-efficient and power-efficient while being competitive with commercial CPUs/DSPs. Furthermore, our ASIP exhibits comparable performance even with hard-wire designs.

  • Design of an Application Specific Instruction Set Processor for Real-Time Object Detection Using AdaBoost Algorithm

    Shanlin XIAO  Tsuyoshi ISSHIKI  Dongju LI  Hiroaki KUNIEDA  

     
    PAPER

      Vol:
    E100-A No:7
      Page(s):
    1384-1395

    Object detection is at the heart of nearly all the computer vision systems. Standard off-the-shelf embedded processors are hard to meet the trade-offs among performance, power consumption and flexibility required by object detection applications. Therefore, this paper presents an Application Specific Instruction set Processor (ASIP) for object detection using AdaBoost-based learning algorithm with Haar-like features as weak classifiers. Algorithm optimizations are employed to reduce memory bandwidth requirements without losing reliability. In the proposed ASIP, Single Instruction Multiple Data (SIMD) architecture is adopted for fully exploiting data-level parallelism inherent to the target algorithm. With adding pipeline stages, application-specific hardware components and custom instructions, the AdaBoost algorithm is accelerated by a factor of 13.7x compared to the optimized pure software implementation. Compared with ARM946 and TMS320C64+, our ASIP shows 32x and 7x better throughput, 10x and 224x better area efficiency, 6.8x and 18.8x better power efficiency, respectively. Furthermore, compared to hard-wired designs, evaluation results show an advantage of the proposed architecture in terms of chip area efficiency while maintain a reliable performance and achieve real-time object detection at 32fps on VGA video.

  • Fast Montgomery Modular Multiplication and Squaring on Embedded Processors

    Yang LI  Jinlin WANG  Xuewen ZENG  Xiaozhou YE  

     
    PAPER-Fundamental Theories for Communications

      Pubricized:
    2016/12/06
      Vol:
    E100-B No:5
      Page(s):
    680-690

    Montgomery modular multiplication is one of the most efficient algorithms for modular multiplication of large integers. On resource-constraint embedded processors, memory-access operations play an important role as arithmetic operations in the modular multiplication. To improve the efficiency of Montgomery modular multiplication on embedded processors, this paper concentrates on reducing the memory-access operations through adding a few working registers. We first revisit previous popular Montgomery modular multiplication algorithms, and then present improved algorithms for Montgomery modular multiplication and squaring for arbitrary prime fields. The algorithms adopt the general ideas of hybrid multiplication algorithm proposed by Gura and lazy doubling algorithm proposed by Lee. By careful optimization and redesign, we propose novel implementations for Montgomery multiplication and squaring called coarsely integrated product and operand hybrid scanning algorithm (CIPOHS) and coarsely integrated lazy doubling algorithm (CILD). Then, we implement the algorithms on general MIPS64 processor and OCTEON CN6645 processor equipped with specific multiply-add instructions. Experiments show that CIPOHS and CILD offer the best performance both on the general MIPS64 and OCTEON CN6645 processors. But the proposed algorithms have obvious advantages for the processors with specific multiply-add instructions such as OCTEON CN6645. When the modulus is 2048 bits, the CIPOHS and CILD outperform the CIOS algorithm by a factor of 47% and 58%, respectively.

  • A Leakage Efficient Instruction TLB Design for Embedded Processors

    Zhao LEI  Hui XU  Daisuke IKEBUCHI  Tetsuya SUNATA  Mitaro NAMIKI  Hideharu AMANO  

     
    PAPER-Computer System

      Vol:
    E94-D No:8
      Page(s):
    1565-1574

    This paper presents a leakage-efficient instruction TLB (Translation Lookaside Buffer) design for embedded processors. The key observation is that when programs enter a physical page, the following instructions tend to be fetched from the same page for a rather long time. Thus, by employing a small storage component which holds the recent address-translation information, the TLB access frequency can be drastically decreased, and the instruction TLB can be turned into the low-leakage mode with the dual voltage supply technique. Based on such a design philosophy, three leakage control policies are proposed to maximize the leakage reduction efficiency. Evaluation results with eight MiBench programs show that the proposed design can reduce the leakage power of the instruction TLB by 50% on average, with only 0.01% performance degradation.

  • Core Working Set Based Scratchpad Memory Management

    Ning DENG  Weixing JI  Jiaxin LI  Qi ZUO  Feng SHI  

     
    PAPER-Computer System

      Vol:
    E94-D No:2
      Page(s):
    274-285

    Many state-of-the-art embedded systems adopt scratch-pad memory (SPM) as the main on-chip memory due to its advantages in terms of energy consumption and on-chip area. The cache is automatically managed by the hardware, while SPM is generally manipulated by the software. Traditional compiler-based SPM allocation methods commonly use static analysis and profiling knowledge to identify the frequently used data during runtime. The data transfer is determined at the compiling stage. However, these methods are fragile when the access pattern is unpredictable at compile time. Also, as embedded devices diversify, we expect a novel SPM management that can support embedded application portability over platforms. This paper proposes a novel runtime SPM management method based on the core working set (CWS) theory. A counting-based CWS identification algorithm is adopted to heuristically determine those data blocks in the program's working set with high reference frequency, and then these promising blocks are allocated to SPM. The novelty of this SPM management method lies in its dependence on the program's dynamic access pattern as the main cue to conduct SPM allocation at runtime, thus offloading SPM management from the compiler. Furthermore, the proposed method needs the assistance of MMU to complete address redirection after data transfers. We evaluate the new approach by comparing it with the cache system and a classical profiling-driven method, and the results indicate that the CWS-based SPM management method can achieve a considerable energy reduction compared with the two reference systems without notable degradation on performance.

  • A Leakage Efficient Data TLB Design for Embedded Processors

    Zhao LEI  Hui XU  Daisuke IKEBUCHI  Tetsuya SUNATA  Mitaro NAMIKI  Hideharu AMANO  

     
    PAPER-Computer System

      Vol:
    E94-D No:1
      Page(s):
    51-59

    This paper presents a leakage efficient data TLB (Translation Look-aside Buffer) design for embedded processors. Due to the data locality in programs, data TLB references tend to hit only a small number of pages during short execution intervals. After dividing the overall execution time into smaller time slices, a leakage reduction mechanism is proposed to detect TLB entries which actually serve for virtual-to-physical address translations within each time slice. Thus, with the integration of the dual voltage supply technique, those TLB entries which are not used for address translations can be put into low leakage mode (with lower voltage supply) to save power. Evaluation results with eight MiBench programs show that the proposed design can reduce the leakage power of a data TLB by 37% on average, with performance degradation less than 0.01%.

  • A Way Enabling Mechanism Based on the Branch Prediction Information for Low Power Instruction Cache

    Gi-Ho PARK  Jung-Wook PARK  Hoi-Jin LEE  Gunok JUNG  Sung-Bae PARK  Shin-Dug KIM  

     
    LETTER

      Vol:
    E92-C No:4
      Page(s):
    517-521

    This paper presents a cache way enabling mechanism using branch target addresses. This mechanism uses branch prediction information to avoid the power consumption due to unnecessary cache way access by enabling only the cache way(s) that should be accessed. The proposed cache way enabling mechanism reduces the power consumption of the instruction cache by 63% without any performance degradation of the processor. An ARM1136 processor simulator and the Synopsys PrimeTime are used to perform the performance/power simulation and static timing analysis of the proposed mechanisms respectively.

  • A Partial Access Mechanism on a Register for Low-Cost Embedded Multimedia ASIP

    Ha-young JEONG  Min-young CHO  Won HUR  Yong-surk LEE  

     
    LETTER-Integrated Electronics

      Vol:
    E91-C No:7
      Page(s):
    1171-1174

    In this letter, we propose a partial access mechanism to be used on a register file for low-cost embedded multimedia processor architecture. In the embedded system, supporting the SIMD operations is a burden because of the wide SIMD register file and its execution unit. So a new architecture is proposed to increase the performance of SIMD operations with minimal additional hardware overhead. To evaluate the performance and hardware overhead, this architecture is adopted to an embedded multimedia processor and simulated with five DSP benchmarks. The simulation results indicate that the performance is increased by 38% and the total area is increased by 13.4%. The proposed partial access mechanism may be useful for low-cost embedded multimedia ASIP.

  • Cooperative Cache System: A Low Power Cache System for Embedded Processors

    Gi-Ho PARK  Kil-Whan LEE  Tack-Don HAN  Shin-Dug KIM  

     
    PAPER-Digital

      Vol:
    E90-C No:4
      Page(s):
    708-717

    This paper presents a dual data cache system structure, called a cooperative cache system, that is designed as a low power cache structure for embedded processors. The cooperative cache system consists of two caches, i.e., a direct-mapped temporal oriented cache (TOC) and a four-way set-associative spatial oriented cache (SOC). The cooperative cache system achieves improvement in performance and reduction in power consumption by virtue of the structural characteristics of the two caches designed inherently to help each other. An evaluation chip of an embedded processor having the cooperative cache system is manufactured by Samsung Electronics Co. with 0.25 µm 4-metal process technology.

  • Dynamic Reconfiguration of Cache Indexing in Embedded Processors

    Junhee KIM  Sung-Soo LIM  Jihong KIM  

     
    PAPER-VLSI Systems

      Vol:
    E90-D No:3
      Page(s):
    637-647

    Cache performance optimization is an important design consideration in building high-performance embedded processors. Unlike general-purpose microprocessors, embedded processors can take advantages of application-specific information in optimizing the cache performance. One of such examples is to use modified cache index bits (over conventional index bits) based on memory access traces from key target embedded applications so that the number of conflict misses can be reduced. In this paper, we present a novel fine-grained cache reconfiguration technique which allows an intra-program reconfiguration of cache index bits, thus better reflecting the changing characteristics of a program execution. The proposed technique, called dynamic reconfiguration of index bits (DRIB), dynamically changes cache index bits in the function level. This compiler-directed and fine-grained approach allows each function to be executed using its own optimal index bits with no additional hardware support. In order to avoid potential performance degradation by frequent cache invalidations from reconfiguring cache index bits, we describe an efficient algorithm for selecting target functions whose cache index bits are reconfigured. Our algorithm ensures that the number of cache misses reduced by DRIB outnumbers the number of cache misses increased from cache invalidations. We also propose a new cache architecture, Two-Level Indexing (TLI) cache, which further reduces the number of conflict misses by intelligently dividing indexing steps into two stages. Our experimental results show that the DRIP approach combined with the TLI cache reduces the number of cache misses by 35% over the conventional cache indexing technique.

  • A Hardware Accelerator for JavaTM Platforms on a 130-nm Embedded Processor Core

    Tetsuya YAMADA  Naohiko IRIE  Takanobu TSUNODA  Takahiro IRITA  Kenji KITAGAWA  Ryohei YOSHIDA  Keisuke TOYAMA  Motoaki SATOYAMA  

     
    PAPER-Integrated Electronics

      Vol:
    E90-C No:2
      Page(s):
    523-530

    We have developed a hardware accelerator for Java platforms, integrated on a SuperH microprocessor core, using a 130-nm CMOS process. The Java accelerator, a bytecode translation unit (BTU), is tightly coupled with the CPU to share resources. The BTU supports 159 basic bytecodes and 5 or 6 optional bytecodes. It supports both connected device configuration (CDC) 1.0 and connected limited device configuration (CLDC) 1.0.4 technologies. The BTU corresponds to the dual-issued superscalar CPU and applies a new method, control-sharing. With this method, the BTU always grasps the pipeline status of the CPU, and the Java program is processed by both the BTU and the CPU. To implement this method, we developed some acceleration techniques: fast branch requests, enhanced CPU instructions, Java runtime exception detection hardware, and fewer overhead cycles of handover between the BTU and the CPU. In particular, the BTU can detect Java runtime exceptions in parallel with other processing, such as an array access. With previous methods, there is a disadvantage in that CPU efficiency decreases for Java-specific processing, such as array index bounds checking. The sample chip was fabricated in Renesas 130-nm, five-layer Cu, dual-vth low-power CMOS technology. The chip runs at 216 MHz and 1.2 V. The BTU has 75 kG. The benchmark on an evaluation board showed 6.55 embedded caffeine marks (ECM)/MHz on the CLDC 1.0.4 configuration, a tenfold speed increase without the BTU for roughly the same power consumption. In other words, power savings of 90 percent with the same performance were achieved.

  • An Energy-Efficient Partitioned Instruction Cache Architecture for Embedded Processors

    CheolHong KIM  SungWoo CHUNG  ChuShik JHON  

     
    PAPER-Computer Systems

      Vol:
    E89-D No:4
      Page(s):
    1450-1458

    Energy efficiency of cache memories is crucial in designing embedded processors. Reducing energy consumption in the instruction cache is especially important, since the instruction cache consumes a significant portion of total processor energy. This paper proposes a new instruction cache architecture, named Partitioned Instruction Cache (PI-Cache), for reducing dynamic energy consumption in the instruction cache by partitioning it to smaller (less power-consuming) sub-caches. When the proposed PI-Cache is accessed, only one sub-cache is accessed by utilizing the temporal/spatial locality of applications. In the meantime, other sub-caches are not accessed, leading to dynamic energy reduction. The PI-Cache also reduces dynamic energy consumption by eliminating the energy consumed in tag lookup and comparison. Moreover, the performance gap between the conventional instruction cache and the proposed PI-Cache becomes little when the physical cache access time is considered. We evaluated the energy efficiency by running a cycle accurate simulator, SimpleScalar, with power parameters obtained from CACTI. Simulation results show that the PI-Cache improves the energy-delay product by 20%-54% compared to the conventional direct-mapped instruction cache.

  • Reducing Consuming Clock Power Optimization of a 90 nm Embedded Processor Core

    Tetsuya YAMADA  Masahide ABE  Yusuke NITTA  Kenji OGURA  Manabu KUSAOKE  Makoto ISHIKAWA  Motokazu OZAWA  Kiwamu TAKADA  Fumio ARAKAWA  Osamu NISHII  Toshihiro HATTORI  

     
    PAPER-Low Power Techniques

      Vol:
    E89-C No:3
      Page(s):
    287-294

    A low-power SuperHTM embedded processor core, the SH-X2, has been designed in 90-nm CMOS technology. The power consumption was reduced by using hierarchical fine-grained clock gating to reduce the power consumption of the flip-flops and the clock-tree, synthesis and a layout that supports the implementation of the clock gating, and several-level power evaluations for RTL refinement. With this clock gating and RTL refinement, the power consumption of the clock-tree and flip-flops was reduced by 35% and 59%, including the process shrinking effects, respectively. As a result, the SH-X2 achieved 6,000 MIPS/W using a Renesas low-power process with a lowered voltage. Its performance-power efficiency was 25% better than that of a 130-nm-process SH-X.

  • Speculative Branch Folding for Pipelined Processors

    Sang-Hyun PARK  Sungwook YU  Jung-Wan CHO  

     
    LETTER-Computer Systems

      Vol:
    E88-D No:5
      Page(s):
    1064-1066

    This paper proposes an effective branch folding technique which combines branch instructions with predicted instructions. This technique can be implemented using an instruction queue, which buffers prefetched instructions. Most of the instructions in the instruction queue are forwarded to the execution unit in sequence. Branch instructions, however, are combined with predicted instructions in the instruction queue and these folded instructions are forwarded to the execution unit. Miss-prediction can be recovered by flushing folded instructions without processor state recovery and by restarting from the other path. Simulation and implementation results show that both performance and power consumption are significantly improved with little additional hardware cost.

  • An Exact Leading Non-Zero Detector for a Floating-Point Unit

    Fumio ARAKAWA  Tomoichi HAYASHI  Masakazu NISHIBORI  

     
    PAPER-Digital

      Vol:
    E88-C No:4
      Page(s):
    570-575

    Parallel execution of the carry propagate adder (CPA) and leading non-zero (LNZ) detector that processes the CPA result is a common way to reduce the latencies of floating-point instructions. However, the conventional methods usually cause one-bit errors. We developed an exact LNZ detection circuit operating in parallel with the CPA. The circuit is implemented in the floating-point unit of our newly developed embedded processor core. Circuit simulation results show that the LNZ circuit has a similar speed to the CPA, and it contributes to make a small low-power FPU for an embedded processor core.

  • An Embedded Processor Core for Consumer Appliances with 2.8GFLOPS and 36 M Polygons/s FPU

    Fumio ARAKAWA  Motokazu OZAWA  Osamu NISHII  Toshihiro HATTORI  Takeshi YOSHINAGA  Tomoichi HAYASHI  Yoshikazu KIYOSHIGE  Takashi OKADA  Masakazu NISHIBORI  Tomoyuki KODAMA  Tatsuya KAMEI  Makoto ISHIKAWA  

     
    PAPER-System Level Design

      Vol:
    E87-A No:12
      Page(s):
    3068-3074

    A SuperHTM embedded processor core implemented in a 130-nm CMOS process running at 400 MHz achieved 720 MIPS and 2.8 GFLOPS at a power of 250 mW in worst-case conditions. It has a dual-issue seven-stage pipeline architecture but maintains the 1.8 MIPS/MHz of the previous five-stage processor. The processor meets the requirements of a wide range of applications, and is suitable for digital appliances aimed at the consumer market, such as cellular phones, digital still/video cameras, and car navigation systems.

  • Memory Data Organization for Low-Energy Address Buses

    Hiroyuki TOMIYAMA  Hiroaki TAKADA  Nikil D. DUTT  

     
    PAPER

      Vol:
    E87-C No:4
      Page(s):
    606-612

    Energy consumption has become one of the most critical constraints in the design of portable multimedia systems. For media applications, address buses between processor and data memory consume a considerable amount of energy due to their large capacitance and frequent accesses. This paper studies impacts of memory data organization on the address bus energy. Our experiments show that the address bus activity is significantly reduced by 50% through exploring memory data organization and encoding address buses.

  • Instruction-Level Power Estimation Method by Considering Hamming Distance of Registers

    Akihiko HIGUCHI  Kazutoshi KOBAYASHI  Hidetoshi ONODERA  

     
    PAPER

      Vol:
    E87-A No:4
      Page(s):
    823-829

    This paper proposes an instruction-level power estimation method for an embedded RISC processor, the power consumption of which fluctuates so much by applications and input data. The proposed method estimates the power consumption from the result of ISS (Instruction Set Simulator) and energy tables according to Hamming Distance of Registers (HDR) of all instructions. It is over 105 times faster than the gate-level detailed logic simulation, while the estimated power curves have the same tendency with those from the logic simulation. The proposed method realizes both accurate and fast power estimation of embedded processors.

  • Randomized Caches for Power-Efficiency

    Hans VANDIERENDONCK  Koen De BOSSCHERE  

     
    PAPER-Integrated Electronics

      Vol:
    E86-C No:10
      Page(s):
    2137-2144

    Embedded processors are used in numerous devices executing dedicated applications. This setting makes it worthwhile to optimize the processor to the application it executes, in order to increase its power-efficiency. This paper proposes to enhance direct mapped data caches with automatically tuned randomized set index functions to achieve that goal. We show how randomization functions can be automatically generated and compare them to traditional set-associative caches in terms of performance and energy consumption. A 16 kB randomized direct mapped cache consumes 22% less energy than a 2-way set-associative cache, while it is less than 3% slower. When the randomization function is made configurable (i.e., it can be adapted to the program), the additional reduction of conflicts outweighs the added complexity of the hardware, provided there is a sufficient amount of conflict misses.

  • A Low-Power Embedded RISC Microprocessor with an Integrated DSP for Mobile Applications

    Tetsuya YAMADA  Makoto ISHIKAWA  Yuji OGATA  Takanobu TSUNODA  Takahiro IRITA  Saneaki TAMAKI  Kunihiko NISHIYAMA  Tatsuya KAMEI  Ken TATEZAWA  Fumio ARAKAWA  Takuichiro NAKAZAWA  Toshihiro HATTORI  Kunio UCHIYAMA  

     
    INVITED PAPER

      Vol:
    E85-C No:2
      Page(s):
    253-262

    A 32-bit embedded RISC microprocessor core integrating a DSP has been developed using a 0.18-µm five-layer-metal CMOS technology. The integrated DSP has a single-MAC and exploits CPU resources to reduce hardware. The DSP occupies only 0.5 mm2. The processor core includes a large on-chip 128 kB SRAM called U-memory. A large capacity on-chip memory decreases the amount of traffic with an external memory. And it is effective for low-power and high-performance operation. To realize low-power dissipation for the U-memory access, the active ratio of U-memory's access is reduced. The critical path is a load path from the U-memory, and we optimized the path through the whole chip. The chip achieves 0.79 mA/MHz executing Dhrystone 1.1 at 108 MHz, which is suitable for mobile applications.

1-20hit(23hit)