The search functionality is under construction.

Author Search Result

[Author] Min ZHU(14hit)

1-14hit
  • Parallelization of Computing-Intensive Tasks of the H.264 High Profile Decoding Algorithm on a Reconfigurable Multimedia System

    Tongsheng GENG  Leibo LIU  Shouyi YIN  Min ZHU  Shaojun WEI  

     
    PAPER

      Vol:
    E93-D No:12
      Page(s):
    3223-3231

    This paper proposes approaches to perform HW/SW (Hardware/Software) partition and parallelization of computing-intensive tasks of the H.264 HiP (High Profile) decoding algorithm on an embedded coarse-grained reconfigurable multimedia system, called REMUS (REconfigurable MUltimedia System). Several techniques, such as MB (Macro-Block) based parallelization, unfixed sub-block operation etc., are utilized to speed up the decoding process, satisfying the requirements of real-time and high quality H.264 applications. Tests show that the execution performance of MC (Motion Compensation), deblocking, and IDCT-IQ (Inverse Discrete Cosine Transform-Inverse Quantization) on REMUS is improved by 60%, 73%, 88.5% in the typical case and 60%, 69%, 88.5% in the worst case, respectively compared with that on XPP PACT (a commercial reconfigurable processor). Compared with ASIC solutions, the performance of MC is improved by 70%, 74% in the typical and in the worst case, respectively, while those of Deblocking remain the same. As for IDCT_IQ, the performance is improved by 17% no matter in the typical or worst case. Relying on the proposed techniques, 1080p@30 fps of H.264 HiP@ Level 4 decoding could be achieved on REMUS when utilizing a 200 MHz working frequency.

  • An Implementation of Multiple-Standard Video Decoder on a Mixed-Grained Reconfigurable Computing Platform

    Leibo LIU  Dong WANG  Yingjie CHEN  Min ZHU  Shouyi YIN  Shaojun WEI  

     
    PAPER-Computer System

      Pubricized:
    2016/02/02
      Vol:
    E99-D No:5
      Page(s):
    1285-1295

    This paper presents the design of a multiple-standard 1080 high definition (HD) video decoder on a mixed-grained reconfigurable computing platform integrating coarse-grained reconfigurable processing units (RPUs) and FPGAs. The proposed RPU, including 16×16 multi-functional processing elements (PEs), is used to accelerate compute-intensive tasks in the video decoding. A soft-core-based microprocessor array is implemented on the FPGA and adopted to speed-up the dynamic reconfiguration of the RPU. Furthermore, a mail-box-based communication scheme is utilized to improve the communication efficiency between RPUs and FPGAs. By exploiting dynamic reconfiguration of the RPUs and static reconfiguration of the FPGAs, the proposed platform achieves scalable performances and cost trade-offs to support a variety of video coding standards, including MPEG-2, AVS, H.264, and HEVC. The measured results show that the proposed platform can support H.264 1080 HD video streams at up to 57 frames per second (fps) and HEVC 1080 HD video streams at up to 52fps under 250MHz, at the same time, it achieves a 3.6× performance gain over an industrial coarse-grained reconfigurable processor for H.264 decoding, and a 6.43× performance boosts over a general purpose processor based implementation for HEVC decoding.

  • Hardware Software Co-design of H.264 Baseline Encoder on Coarse-Grained Dynamically Reconfigurable Computing System-on-Chip

    Hung K. NGUYEN  Peng CAO  Xue-Xiang WANG  Jun YANG  Longxing SHI  Min ZHU  Leibo LIU  Shaojun WEI  

     
    PAPER-Computer System

      Vol:
    E96-D No:3
      Page(s):
    601-615

    REMUS-II (REconfigurable MUltimedia System 2) is a coarse-grained dynamically reconfigurable computing system for multimedia and communication baseband processing. This paper proposes a real-time H.264 baseline profile encoder on REMUS-II. First, we propose an overall mapping flow for mapping algorithms onto the platform of REMUS-II system and then illustrate it by implementing the H.264 encoder. Second, parallel and pipelining techniques are considered for fully exploiting the abundant computing resources of REMUS-II, thus increasing total computing throughput and solving high computational complexity of H.264 encoder. Besides, some data-reuse schemes are also used to increase data-reuse ratio and therefore reduce the required data bandwidth. Third, we propose a scheduling scheme to manage run-time reconfiguration of the system. The scheduling is also responsible for synchronizing the data communication between tasks and handling conflict between hardware resources. Experimental results prove that the REMUS-MB (REMUS-II version for mobile applications) system can perform a real-time H.264/AVC baseline profile encoder. The encoder can encode CIF@30 fps video sequences with two reference frames and maximum search range of [-16,15]. The implementation, thereby, can be applied to handheld devices targeted at mobile multimedia applications. The platform of REMUS-MB system is designed and synthesized by using TSMC 65 nm low power technology. The die size of REMUS-MB is 13.97 mm2. REMUS-MB consumes, on average, about 100 mW while working at 166 MHz. To my knowledge, in the literature this is the first implementation of H.264 encoding algorithm on a coarse-grained dynamically reconfigurable computing system.

  • Reducing the Inaccuracy Caused by Inappropriate Time Window in Probabilistic Fault Localization

    Jianxin LIAO  Cheng ZHANG  Tonghong LI  Xiaomin ZHU  

     
    PAPER-Network Management/Operation

      Vol:
    E94-B No:1
      Page(s):
    128-138

    To reduce the inaccuracy caused by inappropriate time window, we propose two probabilistic fault localization schemes based on the idea of "extending time window." The global window extension algorithm (GWE) uses a window extension strategy for all candidate faults, while the on-demand window extension algorithm (OWE) uses the extended window only for a small set of faults when necessary. Both algorithms can increase the metric values of actual faults and thus improve the accuracy of fault localization. Simulation results show that both schemes perform better than existing algorithms. Furthermore, OWE performs better than GWE at the cost of a bit more computing time.

  • Date Flow Optimization of Dynamically Coarse Grain Reconfigurable Architecture for Multimedia Applications

    Xinning LIU  Chen MEI  Peng CAO  Min ZHU  Longxing SHI  

     
    PAPER-Design Methodology

      Vol:
    E95-D No:2
      Page(s):
    374-382

    This paper proposes a novel sub-architecture to optimize the data flow of REMUS-II (REconfigurable MUltimedia System 2), a dynamically coarse grain reconfigurable architecture. REMUS-II consists of a µPU (Micro-Processor Unit) and two RPUs (Reconfigurable Processor Unit), which are used to speeds up control-intensive tasks and data-intensive tasks respectively. The parallel computing capability and flexibility of REMUS-II makes itself an excellent candidate to process multimedia applications, which require a large amount of memory accesses. In this paper, we specifically optimize the data flow to deal with those performance-hazard and energy-hungry memory accessing in order to meet the bandwidth requirement of parallel computing. The RPU internal memory could work in multiple modes, like 2D-access mode and transformation mode, according to different multimedia access patterns. This novel design can improve the performance up to 26% compared to traditional on-chip memory. Meanwhile, the block buffer is implemented to optimize the off-chip data flow through reducing off-chip memory accesses, which reducing up to 43% compared to direct DDR access. Based on RTL simulation, REMUS-II can achieve 1080p@30 fps of H.264 High Profile@ Level 4 and High Level MPEG2 at 200 MHz clock frequency. The REMUS-II is implemented into 23.7 mm2 silicon on TSMC 65 nm logic process with a 400 MHz maximum working frequency.

  • Reconfiguration Process Optimization of Dynamically Coarse Grain Reconfigurable Architecture for Multimedia Applications

    Bo LIU  Peng CAO  Min ZHU  Jun YANG  Leibo LIU  Shaojun WEI  Longxing SHI  

     
    PAPER-Computer System

      Vol:
    E95-D No:7
      Page(s):
    1858-1871

    This paper presents a novel architecture design to optimize the reconfiguration process of a coarse-grained reconfigurable architecture (CGRA) called Reconfigurable Multimedia System II ( REMUS-II ). In REMUS-II, the tasks in multi-media applications are divided into two parts: computing-intensive tasks and control-intensive tasks. Two Reconfigurable Processor Units (RPUs) for accelerating computing-intensive tasks and a Micro-Processor Unit (µPU) for accelerating control-intensive tasks are contained in REMUS-II. As a large-scale CGRA, REMUS-II can provide satisfying solutions in terms of both efficiency and flexibility. This feature makes REMUS-II well-suited for video processing, where higher flexibility requirements are posed and a lot of computation tasks are involved. To meet the high requirement of the dynamic reconfiguration performance for multimedia applications, the reconfiguration architecture of REMUS-II should be well designed. To optimize the reconfiguration architecture of REMUS-II, a hierarchical configuration storage structure and a 3-stage reconfiguration processing structure are proposed. Furthermore, several optimization methods for configuration reusing are also introduced, to further improve the performance of reconfiguration process. The optimization methods include two aspects: the multi-target reconfiguration method and the configuration caching strategies. Experimental results showed that, with the reconfiguration architecture proposed, the performance of reconfiguration process will be improved by 4 times. Based on RTL simulation, REMUS-II can support the 1080p@32 fps of H.264 HiP@Level4 and 1080p@40 fps High-level MPEG-2 stream decoding at the clock frequency of 200 MHz. The proposed REMUS-II system has been implemented on a TSMC 65 nm process. The die size is 23.7 mm2 and the estimated on-chip dynamic power is 620 mW.

  • Blind Preprocessing of Multichannel Feedforward ANC in Frequency Domain

    Min ZHU  Huigang WANG  Guoyue CHEN  Kenji MUTO  

     
    LETTER-Noise and Vibration

      Vol:
    E95-A No:9
      Page(s):
    1615-1618

    It is shown that simple preprocessing on the reference signals in multichannel feedforward ANC system can improve the convergence performance of the adaptive ANC algorithm. A fast and efficient blind preprocessing algorithm in frequency domain is proposed to reduce the computational complexity even that the reference sensors are located far from the noise sources. The permutation problem at different frequency bin is also addressed and solved by an independent vector analysis algorithm. The basic principle and performance comparison are given to verify our conclusion.

  • Multipath Probing and Grouping in Multihomed Networks

    Jianxin LIAO  Jingyu WANG  Tonghong LI  Xiaomin ZHU  

     
    LETTER-Information Network

      Vol:
    E94-D No:3
      Page(s):
    710-713

    We propose a novel probing scheme capable of discovering shared bottlenecks among multiple paths between two multihomed hosts simultaneously, without any specific help from the network routers, and a subsequent grouping approach for partitioning these paths into groups. Simulation results show that the probing and grouping have an excellent performance under different network conditions.

  • A Power Analysis Attack Countermeasure Based on Random Data Path Execution For CGRA

    Wei GE  Shenghua CHEN  Benyu LIU  Min ZHU  Bo LIU  

     
    PAPER-Computer System

      Pubricized:
    2020/02/10
      Vol:
    E103-D No:5
      Page(s):
    1013-1022

    Side-channel Attack, such as simple power analysis and differential power analysis (DPA), is an efficient method to gather the key, which challenges the security of crypto chips. Side-channel Attack logs the power trace of the crypto chip and speculates the key by statistical analysis. To reduce the threat of power analysis attack, an innovative method based on random execution and register randomization is proposed in this paper. In order to enhance ability against DPA, the method disorders the correspondence between power trace and operands by scrambling the data execution sequence randomly and dynamically and randomize the data operation path to randomize the registers that store intermediate data. Experiments and verification are done on the Sakura-G FPGA platform. The results show that the key is not revealed after even 2 million power traces by adopting the proposed method and only 7.23% slices overhead and 3.4% throughput rate cost is introduced. Compared to unprotected chip, it increases more than 4000× measure to disclosure.

  • Fast AdaBoost-Based Face Detection System on a Dynamically Coarse Grain Reconfigurable Architecture

    Jian XIAO  Jinguo ZHANG  Min ZHU  Jun YANG  Longxing SHI  

     
    PAPER-Application

      Vol:
    E95-D No:2
      Page(s):
    392-402

    An AdaBoost-based face detection system is proposed, on a Coarse Grain Reconfigurable Architecture (CGRA) named “REMUS-II”. Our work is quite distinguished from previous ones in three aspects. First, a new hardware-software partition method is proposed and the whole face detection system is divided into several parallel tasks implemented on two Reconfigurable Processing Units (RPU) and one micro Processors Unit (µPU) according to their relationships. These tasks communicate with each other by a mailbox mechanism. Second, a strong classifier is treated as a smallest phase of the detection system, and every phase needs to be executed by these tasks in order. A phase of Haar classifier is dynamically mapped onto a Reconfigurable Cell Array (RCA) only when needed, and it's quite different from traditional Field Programmable Gate Array (FPGA) methods in which all the classifiers are fabricated statically. Third, optimized data and configuration word pre-fetch mechanisms are employed to improve the whole system performance. Implementation results show that our approach under 200 MHz clock rate can process up-to 17 frames per second on VGA size images, and the detection rate is over 95%. Our system consumes 194 mW, and the die size of fabricated chip is 23 mm2 using TSMC 65 nm standard cell based technology. To the best of our knowledge, this work is the first implementation of the cascade Haar classifier algorithm on a dynamically CGRA platform presented in the literature.

  • The Organization of On-Chip Data Memory in One Coarse-Grained Reconfigurable Architecture

    Yansheng WANG  Leibo LIU  Shouyi YIN  Min ZHU  Peng CAO  Jun YANG  Shaojun WEI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E96-A No:11
      Page(s):
    2218-2229

    RCP (Reconfigurable Computing Processor) is intended to fill the gap between ASIC and GPP (General Purpose processor), which achieves much higher energy efficiency than GPP, while is much more flexible than ASIC. In this paper, one organization of on-chip data memory called LIBODM (LIfetime Based On-chip Data Memory) is proposed to reduce the reference delay for data and on-chip data memory size in RCP. In the LIBODM, the allocation of data is based on the data dependency. The data with low data dependency are stored off-chip to save the storage costs, while the data with high data dependency are stored on-chip to reduce the reference delay. Besides, in the LIBODM, the on-chip data are classified into two types, and the classification is based on the lifetime of data. For short lifetime data, they are preferred to be stored into FIFO to increase the reuse ratio of memory space naturally. For long lifetime data, they are preferred to be stored into RAM for several time references. The LIBODM has been testified in one CGRA (Coarse Grained Reconfigurable Architecture) called RPU (Reconfigurable Processing Unit), and two RPUs has been integrated in a RCP-REMUS_HP (High Performance version of Reconfigurable MUlti-media System) focused on video decoding. Thanks to the LIBODM, although the size of on-chip data memory in REMUS_HP is small, a high performance can still be achieved. Compared with XPP and ADRES, in REMUS_HP, the on-chip data memory size at same performance level is only 23.9% and 14.8%. REMUS_HP is implemented on a 48.9mm2 silicon with TSMC 65nm technology. Simulation shows that 1920*1088 @30fps can be achieved for H.264 high-profile decoding when exploiting a 200MHz working frequency. Compared with the high performance version of XPP, the performance is 150% boosted, while the energy efficiency is 17.59x boosted.

  • A Cycle-Accurate Simulator for a Reconfigurable Multi-Media System

    Min ZHU  Leibo LIU  Shouyi YIN  Chongyong YIN  Shaojun WEI  

     
    PAPER

      Vol:
    E93-D No:12
      Page(s):
    3202-3210

    This paper introduces a cycle-accurate Simulator for a dynamically REconfigurable MUlti-media System, called SimREMUS. SimREMUS can either be used at transaction-level, which allows the modeling and simulation of higher-level hardware and embedded software, or at register transfer level, if the dynamic system behavior is desired to be observed at signal level. Trade-offs among a set of criteria that are frequently used to characterize the design of a reconfigurable computing system, such as granularity, programmability, configurability as well as architecture of processing elements and route modules etc., can be quickly evaluated. Moreover, a complete tool chain for SimREMUS, including compiler and debugger, is developed. SimREMUS could simulate 270 k cycles per second for million gates SoC (System-on-a-Chip) and produced one H.264 1080p frame in 15 minutes, which might cost days on VCS (platform: CPU: E5200@ 2.5 Ghz, RAM: 2.0 GB). Simulation showed that 1080p@30 fps of H.264 High Profile@ Level 4 can be achieved when exploiting a 200 MHz working frequency on the VLSI architecture of REMUS.

  • Adaptive Handoff with Dynamic Hysteresis Value Using Distance Information in Cellular Communications

    Huamin ZHU  Kyung Sup KWAK  

     
    LETTER

      Vol:
    E89-A No:7
      Page(s):
    1972-1975

    In this study, we propose an adaptive handoff scheme with dynamic hysteresis value for cellular communications, which is based on distance between the mobile station and the serving base station. Performance is evaluated in terms of the expected number of handoffs, the expected handoff delay, standard deviation of handoff location, and the expected link degradation probability as well. Numerical results and simulations show that the proposed scheme outperforms the handoff schemes with static hysteresis levels. The effect of distance error is also discussed.

  • Configuration Context Reduction for Coarse-Grained Reconfigurable Architecture

    Shouyi YIN  Chongyong YIN  Leibo LIU  Min ZHU  Shaojun WEI  

     
    PAPER-Design Methodology

      Vol:
    E95-D No:2
      Page(s):
    335-344

    Coarse-grained reconfigurable architecture (CGRA) combines the performance of application-specific integrated circuits (ASICs) and the flexibility of general-purpose processors (GPPs), which is a promising solution for embedded systems. With the increasing complexity of reconfigurable resources (processing elements, routing cells, I/O blocks, etc.), the reconfiguration cost is becoming the performance bottleneck. The major reconfiguration cost comes from the frequent memory-read/write operations for transferring the configuration context from main memory to context buffer. To improve the overall performance, it is critical to reduce the amount of configuration context. In this paper, we propose a configuration context reduction method for CGRA. The proposed method exploits the structure correlation of computation tasks that are mapped onto CGRA and reduce the redundancies in configuration context. Experimental results show that the proposed method can averagely reduce the configuration context size up to 71% and speed up the execution up to 68%. The proposed method does not depend on any architectural feature and can be applied to CGRA with an arbitrary architecture.