Shanlin XIAO Tsuyoshi ISSHIKI Dongju LI Hiroaki KUNIEDA
Object detection is an essential and expensive process in many computer vision systems. Standard off-the-shelf embedded processors are hard to achieve performance-power balance for implementation of object detection applications. In this work, we explore an Application Specific Instruction set Processor (ASIP) for object detection using Histogram of Oriented Gradients (HOG) feature. Algorithm simplifications are adopted to reduce memory bandwidth requirements and mathematical complexity without losing reliability. Also, parallel histogram generation and on-the-fly Support Vector Machine (SVM) calculation architecture are employed to reduce the necessary cycle counts. The HOG algorithm on the proposed ASIP was accelerated by a factor of 63x compared to the pure software implementation. The ASIP was synthesized for a standard 90nm CMOS library, with a silicon area of 1.31mm2 and 47.8mW power consumption at a 200MHz frequency. Our object detection processor can achieve 42 frames-per-second (fps) on VGA video. The evaluation and implementation results show that the proposed ASIP is both area-efficient and power-efficient while being competitive with commercial CPUs/DSPs. Furthermore, our ASIP exhibits comparable performance even with hard-wire designs.
Shanlin XIAO Tsuyoshi ISSHIKI Dongju LI Hiroaki KUNIEDA
Object detection is at the heart of nearly all the computer vision systems. Standard off-the-shelf embedded processors are hard to meet the trade-offs among performance, power consumption and flexibility required by object detection applications. Therefore, this paper presents an Application Specific Instruction set Processor (ASIP) for object detection using AdaBoost-based learning algorithm with Haar-like features as weak classifiers. Algorithm optimizations are employed to reduce memory bandwidth requirements without losing reliability. In the proposed ASIP, Single Instruction Multiple Data (SIMD) architecture is adopted for fully exploiting data-level parallelism inherent to the target algorithm. With adding pipeline stages, application-specific hardware components and custom instructions, the AdaBoost algorithm is accelerated by a factor of 13.7x compared to the optimized pure software implementation. Compared with ARM946 and TMS320C64+, our ASIP shows 32x and 7x better throughput, 10x and 224x better area efficiency, 6.8x and 18.8x better power efficiency, respectively. Furthermore, compared to hard-wired designs, evaluation results show an advantage of the proposed architecture in terms of chip area efficiency while maintain a reliable performance and achieve real-time object detection at 32fps on VGA video.
Agus BEJO Dongju LI Tsuyoshi ISSHIKI Hiroaki KUNIEDA
This paper firstly presents a processor design with Derivative ASIP approach. The architecture of processor is designed by making use of a well-known embedded processor's instruction-set as a base architecture. To improve its performance, the architecture is enhanced with more hardware resources such as registers, interfaces and instruction extensions which might achieve target specifications. Secondly, a new approach for retargeting compiler by means of assembly converter tool is proposed. Our retargeting approach is practical because it is performed by the assembly converter tool with a simple configuration file and independent from a base compiler. With our proposed approach, both architecture flexibility and a good quality of assembly code can be obtained at once. Compared to other compilers, experiments show that our approach capable of generating code as high efficiency as its base compiler and the developed ASIP results in better performance than its base processor.
Hsuan-Chun LIAO Mochamad ASRI Tsuyoshi ISSHIKI Dongju LI Hiroaki KUNIEDA
Image processing engine is crucial for generating high quality images in video system. As Application Specific Integrated Circuit (ASIC) is dedicated for specific standards, Application Specific Instruction-set Processor (ASIP) which provides high flexibility and high performance seems to have more advantages in supporting the nonstandard pre/post image processing in video system. In our previous work, we have designed some ASIPs that can perform several image processing algorithms with a reconfigurable datapath. ASIP is as efficient as DSP, but its area is considerably smaller than DSP. As the resolution of image and the complexity of processing increase, the performance requirement also increases accordingly. In this paper, we presents a novel multi ASIP based image processing unit (IPU) which can provide sufficient performance for the emerging very-high-resolution applications. In order to provide a high performance image processing engine, we propose several new techniques and architecture such as multi block-pipes architecture, pixel direct transmission and boundary pixel write-through. Multi block-pipes architecture has flexible scalability in supporting a various ranges of resolution, which ranges from low resolution to high resolution. The boundary pixel write-through technique provides high efficient parallel processing, and pixel direct transmission technique is implemented in each ASIP to further reduce the data transmission time. Cycle-accurate SystemC simulations are performed, and the experimental results show that the maximum bandwidth of the proposed communication approach can achieve up to 1580 Mbyte/s at 400 MHz. Moreover, communication overhead can be reduced about a maximum of 88% compared to our previous works.
Hsuan-Chun LIAO Mochamad ASRI Tsuyoshi ISSHIKI Dongju LI Hiroaki KUNIEDA
Emerging image and video applications and conventional MPSoC architectures encounter drastically increasing performance and flexibility requirements. In order to display high quality images, large amount of image processing needs to be carried out. These image processing algorithms are nonstandard and vary case by case, and it is difficult to achieve real time processing by using general purpose processors or DSP. In this paper, we present two reconfigurable Application Specific Instruction-set Processors (ASIP) which can perform several image processing algorithms by using the same processor architecture. These ASIPs can achieve performance similar to DSP; meanwhile, while the area is considerably smaller than DSP and slightly bigger than conventional RISC processor. 1D ASIP can perform 16 times higher compared to a RISC processor, and 2D ASIP can perform 3 to 7 times higher compared to a RISC processor.
Yuli ZHANG Jun HAN Xinqian WENG Zhongzhu HE Xiaoyang ZENG
This paper presents an Application Specific Instruction-set Processor (ASIP) for the SHA-3 BLAKE algorithm family by instruction set extensions (ISE) from an RISC (reduced instruction set computer) processor. With a design space exploration for this ASIP to increase the performance and reduce the area cost, we accomplish an efficient hardware and software implementation of BLAKE algorithm. The special instructions and their well-matched hardware function unit improve the calculation of the key section of the algorithm, namely G-functions. Also, relaxing the time constraint of the special function unit can decrease its hardware cost, while keeping the high data throughput of the processor. Evaluation results reveal the ASIP achieves 335 Mbps and 176 Mbps for BLAKE-256 and BLAKE-512. The extra area cost is only 8.06k equivalent gates. The proposed ASIP outperforms several software approaches on various platforms in cycle per byte. In fact, both high throughput and low hardware cost achieved by this programmable processor are comparable to that of ASIC implementations.
Hirofumi IWATO Keishi SAKANUSHI Yoshinori TAKEUCHI Masaharu IMAI
To measure the detrusor pressure for diagnosing lower urinary tract symptoms, we designed a small-area and low-power System on a Chip (SoC). The SoC should be small and low power because it is encapsulated in tiny air-tight capsules which are simultaneously inserted in the urinary bladder and rectum for several days. Since the SoC is also required to be programmable, we designed an Application Specific Instruction set Processor (ASIP) for pressure measurement and wireless communication, and implemented almost required functions on the ASIP. The SoC was fabricated using a 0.18 µm CMOS mixed-signal process and the chip size is 2.5 2.5 mm2. Evaluation results show that the power consumption of the SoC is 93.5 µW, and that it can operate the capsule for seven days with a tiny battery.
Takashi NOGUCHI Toyoaki SUZUKI Tomonori TAMURA
We have developed a process for the fabrication of high-quality Nb/AlOx/Nb tunnel junctions with small area and high current densities for the heterodyne mixers at millimeter and submillimeter wavelengths. Their dc I-V curves are numerically studied, including the broadening of quasiparticle density of states resulting from the existence of an imaginary part of the gap energy of Nb. We have found both experimentally and numerically that the subgap current is strongly dependent on bias voltage at temperatures below 4.2 K unlike the prediction of the BCS tunneling theory. It is shown that calculated dc I-V curves taking into account the complex number of the gap energy agree well with those of Nb/AlOx/Nb junctions measured at temperatures from 0.4 to 4.2 K. We have successfully built receivers at millimeter and submillimeter wavelengths with the noise temperature as low as 4 times the quantum photon noise, employing those high-quality Nb/AlOx/Nb junctions. Those low-noise receivers are to be installed in the ALMA (Atacama Large Millimeter/Submillimeter Array) telescope and they are going into series production now.
Kang ZHAO Jinian BIAN Sheqin DONG Yang SONG Satoshi GOTO
To achieve an automated implementation for the application-specific heterogeneous multiprocessor systems-on-chip (MPSoC), partitioning and mapping the sequential programs onto multiple parallel processors is one of the most difficult challenges. However, the existing traditional parallelizing techniques cannot solve the MPSoC-related problems effectively, so designers are still required to manually extract the concurrency potentials in the program. To solve this bottleneck, an automated application partition technique is needed. However, completely automatic parallelism is ineffective, so it is promising to explore concurrency for certain practical special structures. To settle those issues, this paper proposes a template-based algorithm to automatically partition a special load-compute-store (LCS) loop structure. Since specific-instruction customization for the application specific instruction-set processors (ASIPs) has interactions with task partitioning, the proposed algorithm integrates the dynamic pipelining and ASIP techniques using an iterative improvement strategy: first, an initial pipelining scheme is generated to obtain the maximum parallelism; second, under the primary partition results specific instructions are customized respectively for each subprogram; third, the program is repartitioned via pipelining under the specific instruction configurations. The proposed method has been implemented in the context of a commercial extensible multiprocessor design flow, using the Xtensa-based XTMP platform from Tensilica Inc. Based on a case study of Fast Fourier Transform (FFT), the experimental results indicate that the partitioned programs by the proposed method demonstrate an average speedup of 10 compared to the original sequential programs which have not been partitioned and run on the uniprocessor system.
Ittetsu TANIGUCHI Praveen RAGHAVAN Murali JAYAPALA Francky CATTHOOR Yoshinori TAKEUCHI Masaharu IMAI
Low energy and high performance embedded processor is crucial in the future nomadic embedded systems design. Improvement of memory accesses, especially improvement of spatial and temporal locality is well known technique to reduce energy and increase performance. However, after transformations that improve locality, address calculation often becomes a bottleneck. In this paper, we propose novel AGU (Address Generation Unit) exploration and mapping technique based on a reconfigurable AGU model. Experimental results show that the proposed techniques help exploring AGU architectures effectively and designers can get trade-offs of real life applications for about 10 hours.
Kang ZHAO Jinian BIAN Sheqin DONG Yang SONG Satoshi GOTO
Programming the multiprocessor system-on-chip (MPSoC) requires partitioning the sequential reference programs onto multiple processors running in parallel. However, designers still need to partition the code manually due to the lack of automated partition techniques. To settle this issue, this paper proposes a partition exploration algorithm based on the search space smoothing techniques, and implements the proposed method using a commercial extensible processor (Xtensa LX2 processor from Tensilica Inc.). We have verified the feasibility of the algorithm by implementing the MPEG2 benchmark on the Xtensa-based two-processor system. The final experimental results indicate a performance improvement of at least 1.6 compared to the single-processor system.
Ha-young JEONG Min-young CHO Won HUR Yong-surk LEE
In this letter, we propose a partial access mechanism to be used on a register file for low-cost embedded multimedia processor architecture. In the embedded system, supporting the SIMD operations is a burden because of the wide SIMD register file and its execution unit. So a new architecture is proposed to increase the performance of SIMD operations with minimal additional hardware overhead. To evaluate the performance and hardware overhead, this architecture is adopted to an embedded multimedia processor and simulated with five DSP benchmarks. The simulation results indicate that the performance is increased by 38% and the total area is increased by 13.4%. The proposed partial access mechanism may be useful for low-cost embedded multimedia ASIP.
Kang ZHAO Jinian BIAN Sheqin DONG Yang SONG Satoshi GOTO
To improve the computation efficiency of the application specific instruction-set processor (ASIP), a strategy of hardware/software collaborative design is usually utilized. In this process, the auto-customization of specific instruction set has always been a key part to support the automated design of ASIP. The key issue of this problem is how to effectively reduce the huge exponential exploration space in the instruction identification process. To address this issue, we first formulate it as a feasible sub-graph enumeration problem under multiple constraints, and then propose a fast instruction identification algorithm based on a new model called basic convex pattern (BCP). The kernel technique in this algorithm is the transformation from the graph exploration to the formula-based computations. The experimental results have indicated that the proposed algorithm has a distinct reduction in the execution time.
This paper proposes an efficient method for design space exploration of the global optimum configuration for parameterized ASIPs. The method not only guarantees the optimum configuration, but also provides robust speedup for a wide range of processor architectures such as SoC, ASIC as well as ASIP. The optimization procedure within this method takes a two-steps approach. Firstly, design parameters are partitioned into clusters of inter-dependent parameters using parameter dependency information. Secondly, parameters are optimized for each cluster, the results of which are merged for global optimum. In such optimization, inferior configurations are extensively pruned with a detailed optimality mapping between dependent parameters. Experimental results with mediabench applications show an optimization speedup of 4.1 times faster than the previous work on average, which is significant improvement for practical use.
Shinsuke KOBAYASHI Kentaro MITA Yoshinori TAKEUCHI Masaharu IMAI
This paper proposes a compiler generation method for PEAS-III (Practical Environment for ASIP development), which is a configurable processor development environment for application domain specific embedded systems. Using the PEAS-III system, not only the HDL description of a target processor but also its target compiler can be generated. Therefore, execution cycles and dynamic power consumption can be rapidly evaluated. Two processors and their derivatives were designed using the PEAS-III system in the experiment. Experimental results show that the trade-offs among area, performance and power consumption of processors were analyzed in about twelve hours and the optimal processor was selected under the design constraints by using generated compilers and processors.
Ienari IGUCHI Takuya IMAIZUMI Tomoyuki KAWAI Yukio TANAKA Satoshi KASHIWAYA
We report the measurements on the ramp-edge type Josephson and quasiparticle tunnel junctions with the different interface angle geometry using high-Tc YBa2Cu3O7-y (YBCO) electrodes. The YBCO/I/Ag tunnel junctions with different crystal-interface boundary angles are fabricated for the investigation of zero bias conductance peak. The angle dependent zero bias conductance peak typical to a dx2-y2-wave superconductor is observable. For Josephson junctions, YBCO ramp-edge junctions with different ab-plane electrodes relatively rotated by 45are fabricated using a CeO2 seed-layer technique. The temperature dependence of the maximum Josephson current for YBCO/PBCO/YBCO junctions (PBCO: PrBa2Cu3O7-y) exhibits angle-dependent behavior, qualitatively different from the Ambegaokar-Baratoff prediction. Under microwave irradiation of 9 GHz, the Shapiro steps appear at integer and/or half integer multiples of the voltage satisfying Josephson voltage-frequency relation, whose behavior depends on the sample angle geometry. The results are reasonably interpreted by the dx2-y2-wave theory by taking the zero energy state into account.
Chang-Qing XU Ken FUJITA Andrew R. PRATT Yoh OGAWA Takeshi KAMIJOH
1.5 µm-band LiNbO3 quasiphase matched (QPM) wavelength converters consisting of a periodical domain inverted structure and a proton exchanged waveguide, have been studied in detail both theoretically and experimentally. Optimum device fabrication conditions are investigated with respected to waveguide propagation loss, coupling loss to a single-mode fiber and wavelength conversion efficiency. A normalized conversion efficiency as high as 200 %/W (by a SHG measurement) and a fiber-to-fiber insertion loss of less than 3.5 dB (@1.55 µm) is obtained for a wavelength converter module with a device length of 40 mm. It is shown that a highly uniform periodical domain inverted structure and a uniform proton exchange waveguide are key to obtaining efficient wavelength conversion. The tolerance of the waveguide width fluctuation is found to be very critical and is less than 20 nm for a 40 mm-long device. Future optimization of LiNbO3 QPM wavelength converters and the possible device applications in future optical communication systems are also presented.
Katsuya SHINOHARA Norimasa OHTSUKI Yoshinori TAKEUCHI Masaharu IMAI
This paper proposes an ASIP performance optimization method taking clock frequency into account. The performance of an instruction set processor can be measured using the execution time of an application program, which can be determined by the clock cycles to perform the application program divided by the applied clock frequency. Therefore, the clock frequency should also be tuned in order to maximize the performance of the processor under the given design constraints. Experimental results show that the proposed method determines an optimal combination of FUs considering clock frequency.
Nguyen Ngoc BINH Masaharu IMAI Yoshinori TAKEUCHI
In designing ASIPs (Application Specific Integrated Processors), the papers investigated so far have almost focused on the optimization of the CPU core and did not pay enough attention to the optimization of the RAM and ROM sizes together. This paper overcomes this limitation and proposes an optimization algorithm to define the best ratio between the CPU core, RAM and ROM of an ASIP chip to achieve the highest performance while satisfying design constraints on the chip area. The partitioning problem is formalized as a combinatorial optimization problem that partitions the operations into hardware and software so that the performance of the designed ASIP is maximized under given chip area constraint, where the chip area includes the HW cost of the register file for a given application program with associated input data set. The optimization problem is parameterized so that it can be applied with different technologies to synthesize CPU cores, RAMs or ROMs. The experimental results show that the proposed algorithm is found to be effective and efficient.
Kyung-Sik JANG Hiroaki KUNIEDA
In this paper, a systematic method which synthesizes the datapath of Application Specific Instruction Processor (ASIP) is proposed. The behavioral description of application is written in instruction code defined on abstract machine. We introduce register transfer graph (RTG) to represent instructions and synthesis constraint tree to select the combinations of synthesis constraints to explore design space along area and performance axis. The high performance is achieved by scheduling micro-operations of instruction in out-of-order. The practical datapath is synthesized by considering connection geometry as well as the maximum utilization of hardware resources. To reduce connection cost, data transfer paths are minimized by replacing an inefficient data transfer path with its bypass route. The feasibility of the proposed synthesis method is verified with several experimental instruction sequences.