The search functionality is under construction.

Keyword Search Result

[Keyword] ASIP(26hit)

1-20hit(26hit)

  • HOG-Based Object Detection Processor Design Using ASIP Methodology

    Shanlin XIAO  Tsuyoshi ISSHIKI  Dongju LI  Hiroaki KUNIEDA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E100-A No:12
      Page(s):
    2972-2984

    Object detection is an essential and expensive process in many computer vision systems. Standard off-the-shelf embedded processors are hard to achieve performance-power balance for implementation of object detection applications. In this work, we explore an Application Specific Instruction set Processor (ASIP) for object detection using Histogram of Oriented Gradients (HOG) feature. Algorithm simplifications are adopted to reduce memory bandwidth requirements and mathematical complexity without losing reliability. Also, parallel histogram generation and on-the-fly Support Vector Machine (SVM) calculation architecture are employed to reduce the necessary cycle counts. The HOG algorithm on the proposed ASIP was accelerated by a factor of 63x compared to the pure software implementation. The ASIP was synthesized for a standard 90nm CMOS library, with a silicon area of 1.31mm2 and 47.8mW power consumption at a 200MHz frequency. Our object detection processor can achieve 42 frames-per-second (fps) on VGA video. The evaluation and implementation results show that the proposed ASIP is both area-efficient and power-efficient while being competitive with commercial CPUs/DSPs. Furthermore, our ASIP exhibits comparable performance even with hard-wire designs.

  • Design of an Application Specific Instruction Set Processor for Real-Time Object Detection Using AdaBoost Algorithm

    Shanlin XIAO  Tsuyoshi ISSHIKI  Dongju LI  Hiroaki KUNIEDA  

     
    PAPER

      Vol:
    E100-A No:7
      Page(s):
    1384-1395

    Object detection is at the heart of nearly all the computer vision systems. Standard off-the-shelf embedded processors are hard to meet the trade-offs among performance, power consumption and flexibility required by object detection applications. Therefore, this paper presents an Application Specific Instruction set Processor (ASIP) for object detection using AdaBoost-based learning algorithm with Haar-like features as weak classifiers. Algorithm optimizations are employed to reduce memory bandwidth requirements without losing reliability. In the proposed ASIP, Single Instruction Multiple Data (SIMD) architecture is adopted for fully exploiting data-level parallelism inherent to the target algorithm. With adding pipeline stages, application-specific hardware components and custom instructions, the AdaBoost algorithm is accelerated by a factor of 13.7x compared to the optimized pure software implementation. Compared with ARM946 and TMS320C64+, our ASIP shows 32x and 7x better throughput, 10x and 224x better area efficiency, 6.8x and 18.8x better power efficiency, respectively. Furthermore, compared to hard-wired designs, evaluation results show an advantage of the proposed architecture in terms of chip area efficiency while maintain a reliable performance and achieve real-time object detection at 32fps on VGA video.

  • Retargeting Derivative-ASIP with Assembly Converter Tool

    Agus BEJO  Dongju LI  Tsuyoshi ISSHIKI  Hiroaki KUNIEDA  

     
    PAPER-Computer System

      Vol:
    E97-D No:5
      Page(s):
    1188-1195

    This paper firstly presents a processor design with Derivative ASIP approach. The architecture of processor is designed by making use of a well-known embedded processor's instruction-set as a base architecture. To improve its performance, the architecture is enhanced with more hardware resources such as registers, interfaces and instruction extensions which might achieve target specifications. Secondly, a new approach for retargeting compiler by means of assembly converter tool is proposed. Our retargeting approach is practical because it is performed by the assembly converter tool with a simple configuration file and independent from a base compiler. With our proposed approach, both architecture flexibility and a good quality of assembly code can be obtained at once. Compared to other compilers, experiments show that our approach capable of generating code as high efficiency as its base compiler and the developed ASIP results in better performance than its base processor.

  • A Design of High Performance Parallel Architecture and Communication for Multi-ASIP Based Image Processing Engine

    Hsuan-Chun LIAO  Mochamad ASRI  Tsuyoshi ISSHIKI  Dongju LI  Hiroaki KUNIEDA  

     
    PAPER

      Vol:
    E96-A No:6
      Page(s):
    1222-1235

    Image processing engine is crucial for generating high quality images in video system. As Application Specific Integrated Circuit (ASIC) is dedicated for specific standards, Application Specific Instruction-set Processor (ASIP) which provides high flexibility and high performance seems to have more advantages in supporting the nonstandard pre/post image processing in video system. In our previous work, we have designed some ASIPs that can perform several image processing algorithms with a reconfigurable datapath. ASIP is as efficient as DSP, but its area is considerably smaller than DSP. As the resolution of image and the complexity of processing increase, the performance requirement also increases accordingly. In this paper, we presents a novel multi ASIP based image processing unit (IPU) which can provide sufficient performance for the emerging very-high-resolution applications. In order to provide a high performance image processing engine, we propose several new techniques and architecture such as multi block-pipes architecture, pixel direct transmission and boundary pixel write-through. Multi block-pipes architecture has flexible scalability in supporting a various ranges of resolution, which ranges from low resolution to high resolution. The boundary pixel write-through technique provides high efficient parallel processing, and pixel direct transmission technique is implemented in each ASIP to further reduce the data transmission time. Cycle-accurate SystemC simulations are performed, and the experimental results show that the maximum bandwidth of the proposed communication approach can achieve up to 1580 Mbyte/s at 400 MHz. Moreover, communication overhead can be reduced about a maximum of 88% compared to our previous works.

  • A High Level Design of Reconfigurable and High-Performance ASIP Engine for Image Signal Processing

    Hsuan-Chun LIAO  Mochamad ASRI  Tsuyoshi ISSHIKI  Dongju LI  Hiroaki KUNIEDA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E95-A No:12
      Page(s):
    2373-2383

    Emerging image and video applications and conventional MPSoC architectures encounter drastically increasing performance and flexibility requirements. In order to display high quality images, large amount of image processing needs to be carried out. These image processing algorithms are nonstandard and vary case by case, and it is difficult to achieve real time processing by using general purpose processors or DSP. In this paper, we present two reconfigurable Application Specific Instruction-set Processors (ASIP) which can perform several image processing algorithms by using the same processor architecture. These ASIPs can achieve performance similar to DSP; meanwhile, while the area is considerably smaller than DSP and slightly bigger than conventional RISC processor. 1D ASIP can perform 16 times higher compared to a RISC processor, and 2D ASIP can perform 3 to 7 times higher compared to a RISC processor.

  • Design Approach and Implementation of Application Specific Instruction Set Processor for SHA-3 BLAKE Algorithm

    Yuli ZHANG  Jun HAN  Xinqian WENG  Zhongzhu HE  Xiaoyang ZENG  

     
    PAPER-Electronic Circuits

      Vol:
    E95-C No:8
      Page(s):
    1415-1426

    This paper presents an Application Specific Instruction-set Processor (ASIP) for the SHA-3 BLAKE algorithm family by instruction set extensions (ISE) from an RISC (reduced instruction set computer) processor. With a design space exploration for this ASIP to increase the performance and reduce the area cost, we accomplish an efficient hardware and software implementation of BLAKE algorithm. The special instructions and their well-matched hardware function unit improve the calculation of the key section of the algorithm, namely G-functions. Also, relaxing the time constraint of the special function unit can decrease its hardware cost, while keeping the high data throughput of the processor. Evaluation results reveal the ASIP achieves 335 Mbps and 176 Mbps for BLAKE-256 and BLAKE-512. The extra area cost is only 8.06k equivalent gates. The proposed ASIP outperforms several software approaches on various platforms in cycle per byte. In fact, both high throughput and low hardware cost achieved by this programmable processor are comparable to that of ASIC implementations.

  • A Small-Area and Low-Power SoC for Less-Invasive Pressure Sensing Capsules in Ambulatory Urodynamic Monitoring

    Hirofumi IWATO  Keishi SAKANUSHI  Yoshinori TAKEUCHI  Masaharu IMAI  

     
    PAPER

      Vol:
    E95-C No:4
      Page(s):
    487-494

    To measure the detrusor pressure for diagnosing lower urinary tract symptoms, we designed a small-area and low-power System on a Chip (SoC). The SoC should be small and low power because it is encapsulated in tiny air-tight capsules which are simultaneously inserted in the urinary bladder and rectum for several days. Since the SoC is also required to be programmable, we designed an Application Specific Instruction set Processor (ASIP) for pressure measurement and wireless communication, and implemented almost required functions on the ASIP. The SoC was fabricated using a 0.18 µm CMOS mixed-signal process and the chip size is 2.5 2.5 mm2. Evaluation results show that the power consumption of the SoC is 93.5 µW, and that it can operate the capsule for seven days with a tiny battery.

  • SIS Junctions for Millimeter and Submillimeter Wave Mixers Open Access

    Takashi NOGUCHI  Toyoaki SUZUKI  Tomonori TAMURA  

     
    INVITED PAPER

      Vol:
    E95-C No:3
      Page(s):
    320-328

    We have developed a process for the fabrication of high-quality Nb/AlOx/Nb tunnel junctions with small area and high current densities for the heterodyne mixers at millimeter and submillimeter wavelengths. Their dc I-V curves are numerically studied, including the broadening of quasiparticle density of states resulting from the existence of an imaginary part of the gap energy of Nb. We have found both experimentally and numerically that the subgap current is strongly dependent on bias voltage at temperatures below 4.2 K unlike the prediction of the BCS tunneling theory. It is shown that calculated dc I-V curves taking into account the complex number of the gap energy agree well with those of Nb/AlOx/Nb junctions measured at temperatures from 0.4 to 4.2 K. We have successfully built receivers at millimeter and submillimeter wavelengths with the noise temperature as low as 4 times the quantum photon noise, employing those high-quality Nb/AlOx/Nb junctions. Those low-noise receivers are to be installed in the ALMA (Atacama Large Millimeter/Submillimeter Array) telescope and they are going into series production now.

  • Pipeline-Based Partition Exploration for Heterogeneous Multiprocessor Synthesis

    Kang ZHAO  Jinian BIAN  Sheqin DONG  Yang SONG  Satoshi GOTO  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E92-A No:9
      Page(s):
    2283-2294

    To achieve an automated implementation for the application-specific heterogeneous multiprocessor systems-on-chip (MPSoC), partitioning and mapping the sequential programs onto multiple parallel processors is one of the most difficult challenges. However, the existing traditional parallelizing techniques cannot solve the MPSoC-related problems effectively, so designers are still required to manually extract the concurrency potentials in the program. To solve this bottleneck, an automated application partition technique is needed. However, completely automatic parallelism is ineffective, so it is promising to explore concurrency for certain practical special structures. To settle those issues, this paper proposes a template-based algorithm to automatically partition a special load-compute-store (LCS) loop structure. Since specific-instruction customization for the application specific instruction-set processors (ASIPs) has interactions with task partitioning, the proposed algorithm integrates the dynamic pipelining and ASIP techniques using an iterative improvement strategy: first, an initial pipelining scheme is generated to obtain the maximum parallelism; second, under the primary partition results specific instructions are customized respectively for each subprogram; third, the program is repartitioned via pipelining under the specific instruction configurations. The proposed method has been implemented in the context of a commercial extensible multiprocessor design flow, using the Xtensa-based XTMP platform from Tensilica Inc. Based on a case study of Fast Fourier Transform (FFT), the experimental results indicate that the partitioned programs by the proposed method demonstrate an average speedup of 10 compared to the original sequential programs which have not been partitioned and run on the uniprocessor system.

  • Reconfigurable AGU: An Address Generation Unit Based on Address Calculation Pattern for Low Energy and High Performance Embedded Processors

    Ittetsu TANIGUCHI  Praveen RAGHAVAN  Murali JAYAPALA  Francky CATTHOOR  Yoshinori TAKEUCHI  Masaharu IMAI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E92-A No:4
      Page(s):
    1161-1173

    Low energy and high performance embedded processor is crucial in the future nomadic embedded systems design. Improvement of memory accesses, especially improvement of spatial and temporal locality is well known technique to reduce energy and increase performance. However, after transformations that improve locality, address calculation often becomes a bottleneck. In this paper, we propose novel AGU (Address Generation Unit) exploration and mapping technique based on a reconfigurable AGU model. Experimental results show that the proposed techniques help exploring AGU architectures effectively and designers can get trade-offs of real life applications for about 10 hours.

  • Exploring Partitions Based on Search Space Smoothing for Heterogeneous Multiprocessor System

    Kang ZHAO  Jinian BIAN  Sheqin DONG  Yang SONG  Satoshi GOTO  

     
    PAPER-Electronic Circuits and Systems

      Vol:
    E91-A No:9
      Page(s):
    2456-2464

    Programming the multiprocessor system-on-chip (MPSoC) requires partitioning the sequential reference programs onto multiple processors running in parallel. However, designers still need to partition the code manually due to the lack of automated partition techniques. To settle this issue, this paper proposes a partition exploration algorithm based on the search space smoothing techniques, and implements the proposed method using a commercial extensible processor (Xtensa LX2 processor from Tensilica Inc.). We have verified the feasibility of the algorithm by implementing the MPEG2 benchmark on the Xtensa-based two-processor system. The final experimental results indicate a performance improvement of at least 1.6 compared to the single-processor system.

  • A Partial Access Mechanism on a Register for Low-Cost Embedded Multimedia ASIP

    Ha-young JEONG  Min-young CHO  Won HUR  Yong-surk LEE  

     
    LETTER-Integrated Electronics

      Vol:
    E91-C No:7
      Page(s):
    1171-1174

    In this letter, we propose a partial access mechanism to be used on a register file for low-cost embedded multimedia processor architecture. In the embedded system, supporting the SIMD operations is a burden because of the wide SIMD register file and its execution unit. So a new architecture is proposed to increase the performance of SIMD operations with minimal additional hardware overhead. To evaluate the performance and hardware overhead, this architecture is adopted to an embedded multimedia processor and simulated with five DSP benchmarks. The simulation results indicate that the performance is increased by 38% and the total area is increased by 13.4%. The proposed partial access mechanism may be useful for low-cost embedded multimedia ASIP.

  • Fast Custom Instruction Identification Algorithm Based on Basic Convex Pattern Model for Supporting ASIP Automated Design

    Kang ZHAO  Jinian BIAN  Sheqin DONG  Yang SONG  Satoshi GOTO  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E91-A No:6
      Page(s):
    1478-1487

    To improve the computation efficiency of the application specific instruction-set processor (ASIP), a strategy of hardware/software collaborative design is usually utilized. In this process, the auto-customization of specific instruction set has always been a key part to support the automated design of ASIP. The key issue of this problem is how to effectively reduce the huge exponential exploration space in the instruction identification process. To address this issue, we first formulate it as a feasible sub-graph enumeration problem under multiple constraints, and then propose a fast instruction identification algorithm based on a new model called basic convex pattern (BCP). The kernel technique in this algorithm is the transformation from the graph exploration to the formula-based computations. The experimental results have indicated that the proposed algorithm has a distinct reduction in the execution time.

  • An Efficient Method for System-Level Exploration of Global Optimum in a Parameterized ASIP Design

    Yeong-Geol KIM  Tag-Gon KIM  

     
    LETTER-VLSI Design Technology and CAD

      Vol:
    E86-A No:12
      Page(s):
    3297-3302

    This paper proposes an efficient method for design space exploration of the global optimum configuration for parameterized ASIPs. The method not only guarantees the optimum configuration, but also provides robust speedup for a wide range of processor architectures such as SoC, ASIC as well as ASIP. The optimization procedure within this method takes a two-steps approach. Firstly, design parameters are partitioned into clusters of inter-dependent parameters using parameter dependency information. Secondly, parameters are optimized for each cluster, the results of which are merged for global optimum. In such optimization, inferior configurations are extensively pruned with a detailed optimality mapping between dependent parameters. Experimental results with mediabench applications show an optimization speedup of 4.1 times faster than the previous work on average, which is significant improvement for practical use.

  • A Compiler Generation Method for HW/SW Codesign Based on Configurable Processors

    Shinsuke KOBAYASHI  Kentaro MITA  Yoshinori TAKEUCHI  Masaharu IMAI  

     
    PAPER-Hardware/Software Codesign

      Vol:
    E85-A No:12
      Page(s):
    2586-2595

    This paper proposes a compiler generation method for PEAS-III (Practical Environment for ASIP development), which is a configurable processor development environment for application domain specific embedded systems. Using the PEAS-III system, not only the HDL description of a target processor but also its target compiler can be generated. Therefore, execution cycles and dynamic power consumption can be rapidly evaluated. Two processors and their derivatives were designed using the PEAS-III system in the experiment. Experimental results show that the trade-offs among area, performance and power consumption of processors were analyzed in about twelve hours and the optimal processor was selected under the design constraints by using generated compilers and processors.

  • Josephson and Quasiparticle Tunneling in Anisotropic High-Tc d-Wave Superconductors

    Ienari IGUCHI  Takuya IMAIZUMI  Tomoyuki KAWAI  Yukio TANAKA  Satoshi KASHIWAYA  

     
    INVITED PAPER-Novel Devices and Device Physics

      Vol:
    E85-C No:3
      Page(s):
    789-796

    We report the measurements on the ramp-edge type Josephson and quasiparticle tunnel junctions with the different interface angle geometry using high-Tc YBa2Cu3O7-y (YBCO) electrodes. The YBCO/I/Ag tunnel junctions with different crystal-interface boundary angles are fabricated for the investigation of zero bias conductance peak. The angle dependent zero bias conductance peak typical to a dx2-y2-wave superconductor is observable. For Josephson junctions, YBCO ramp-edge junctions with different ab-plane electrodes relatively rotated by 45are fabricated using a CeO2 seed-layer technique. The temperature dependence of the maximum Josephson current for YBCO/PBCO/YBCO junctions (PBCO: PrBa2Cu3O7-y) exhibits angle-dependent behavior, qualitatively different from the Ambegaokar-Baratoff prediction. Under microwave irradiation of 9 GHz, the Shapiro steps appear at integer and/or half integer multiples of the voltage satisfying Josephson voltage-frequency relation, whose behavior depends on the sample angle geometry. The results are reasonably interpreted by the dx2-y2-wave theory by taking the zero energy state into account.

  • Optimization of 1.5 µm-Band LiNbO3 Quasiphase Matched Wavelength Converters for Optical Communication Systems

    Chang-Qing XU  Ken FUJITA  Andrew R. PRATT  Yoh OGAWA  Takeshi KAMIJOH  

     
    PAPER-WDM Network Devices

      Vol:
    E83-C No:6
      Page(s):
    884-891

    1.5 µm-band LiNbO3 quasiphase matched (QPM) wavelength converters consisting of a periodical domain inverted structure and a proton exchanged waveguide, have been studied in detail both theoretically and experimentally. Optimum device fabrication conditions are investigated with respected to waveguide propagation loss, coupling loss to a single-mode fiber and wavelength conversion efficiency. A normalized conversion efficiency as high as 200 %/W (by a SHG measurement) and a fiber-to-fiber insertion loss of less than 3.5 dB (@1.55 µm) is obtained for a wavelength converter module with a device length of 40 mm. It is shown that a highly uniform periodical domain inverted structure and a uniform proton exchange waveguide are key to obtaining efficient wavelength conversion. The tolerance of the waveguide width fluctuation is found to be very critical and is less than 20 nm for a 40 mm-long device. Future optimization of LiNbO3 QPM wavelength converters and the possible device applications in future optical communication systems are also presented.

  • A Performance Optimization Method for Pipelined ASIPs in Consideration of Clock Frequency

    Katsuya SHINOHARA  Norimasa OHTSUKI  Yoshinori TAKEUCHI  Masaharu IMAI  

     
    PAPER

      Vol:
    E82-A No:11
      Page(s):
    2356-2365

    This paper proposes an ASIP performance optimization method taking clock frequency into account. The performance of an instruction set processor can be measured using the execution time of an application program, which can be determined by the clock cycles to perform the application program divided by the applied clock frequency. Therefore, the clock frequency should also be tuned in order to maximize the performance of the processor under the given design constraints. Experimental results show that the proposed method determines an optimal combination of FUs considering clock frequency.

  • An Optimization Algorithm for High Performance ASIP Design with Considering the RAM and ROM Sizes

    Nguyen Ngoc BINH  Masaharu IMAI  Yoshinori TAKEUCHI  

     
    PAPER-Co-design

      Vol:
    E81-A No:12
      Page(s):
    2612-2620

    In designing ASIPs (Application Specific Integrated Processors), the papers investigated so far have almost focused on the optimization of the CPU core and did not pay enough attention to the optimization of the RAM and ROM sizes together. This paper overcomes this limitation and proposes an optimization algorithm to define the best ratio between the CPU core, RAM and ROM of an ASIP chip to achieve the highest performance while satisfying design constraints on the chip area. The partitioning problem is formalized as a combinatorial optimization problem that partitions the operations into hardware and software so that the performance of the designed ASIP is maximized under given chip area constraint, where the chip area includes the HW cost of the register file for a given application program with associated input data set. The optimization problem is parameterized so that it can be applied with different technologies to synthesize CPU cores, RAMs or ROMs. The experimental results show that the proposed algorithm is found to be effective and efficient.

  • A New Approach for Datapath Synthesis of Application Specific Instruction Processor

    Kyung-Sik JANG  Hiroaki KUNIEDA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E80-A No:8
      Page(s):
    1478-1488

    In this paper, a systematic method which synthesizes the datapath of Application Specific Instruction Processor (ASIP) is proposed. The behavioral description of application is written in instruction code defined on abstract machine. We introduce register transfer graph (RTG) to represent instructions and synthesis constraint tree to select the combinations of synthesis constraints to explore design space along area and performance axis. The high performance is achieved by scheduling micro-operations of instruction in out-of-order. The practical datapath is synthesized by considering connection geometry as well as the maximum utilization of hardware resources. To reduce connection cost, data transfer paths are minimized by replacing an inefficient data transfer path with its bypass route. The feasibility of the proposed synthesis method is verified with several experimental instruction sequences.

1-20hit(26hit)