The search functionality is under construction.

Author Search Result

[Author] Heming SUN(7hit)

1-7hit
  • Accelerating HEVC Inter Prediction with Improved Merge Mode Handling

    Zhengxue CHENG  Heming SUN  Dajiang ZHOU  Shinji KIMURA  

     
    PAPER-VIDEO CODING

      Vol:
    E100-A No:2
      Page(s):
    546-554

    High Efficiency Video Coding (HEVC/H.265) obtains 50% bit rate reduction than H.264/AVC standard with comparable quality at the cost of high computational complexity. Merge mode is one of the most important new features introduced in HEVC's inter prediction. Merge mode and traditional inter mode consume about 90% of the total encoding time. To address this high complexity, this paper utilizes the merge mode to accelerate inter prediction by four strategies. 1) A merge candidate decision is proposed by the sum of absolute transformed difference (SATD) cost. 2) An early merge termination is presented with more than 90% accuracy. 3) Due to the compensation effect of merge candidates, symmetric motion partition (SMP) mode is disabled for non-8×8 coding units (CUs). 4) A fast coding unit filtering strategy is proposed to reduce the number of CUs which need to be fine-processed. Experimental results demonstrate that our fast strategies can achieve 35.4%-58.7% time reduction with 0.68%-1.96% BD-rate increment in RA case. Compared with similar works, the proposed strategies are not only among the best performing in average-case complexity reduction, but also notably outperforming in the worst cases.

  • A Low-Power VLSI Architecture for HEVC De-Quantization and Inverse Transform

    Heming SUN  Dajiang ZHOU  Shuping ZHANG  Shinji KIMURA  

     
    PAPER

      Vol:
    E99-A No:12
      Page(s):
    2375-2387

    In this paper, we present a low-power system for the de-quantization and inverse transform of HEVC. Firstly, we present a low-delay circuit to process the coded results of the syntax elements, and then reduce the number of multipliers from 16 to 4 for the de-quantization process of each 4x4 block. Secondly, we give two efficient data mapping schemes for the memory between de-quantization and inverse transform, and the memory for transpose. Thirdly, the zero information is utilized through the whole system. For two memory parts, the write and read operation of zero blocks/ rows/ coefficients can all be skipped to save the power consumption. The results show that up to 86% power consumption can be saved for the memory part under the configuration of “Random-access” and common QPs. For the logical part, the proposed architecture for de-quantization can reduce 77% area consumption. Overall, our system can support real-time coding for 8K x 4K 120fps video sequences and the normalized area consumption can be reduced by 68% compared with the latest work.

  • A Low-Cost VLSI Architecture of Multiple-Size IDCT for H.265/HEVC

    Heming SUN  Dajiang ZHOU  Peilin LIU  Satoshi GOTO  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E97-A No:12
      Page(s):
    2467-2476

    In this paper, we present an area-efficient 4/8/16/32-point inverse discrete cosine transform (IDCT) architecture for a HEVC decoder. Compared with previous work, this work reduces the hardware cost from two aspects. First, we reduce the logical costs of 1D IDCT by proposing a reordered parallel-in serial-out (RPISO) scheme. By using the RPISO scheme, we can reduce the required calculations for butterfly inputs in each cycle. Secondly, we reduce the area of transpose architecture by proposing a cyclic data mapping scheme that can achieve 100% I/O utilization of each SRAM. To design a fully pipelined 2D IDCT architecture, we propose a pipelining schedule for row and column transform. The results show that the normalized area by maximum throughput for the logical IDCT part can be reduced by 25%, and the memory area can be reduced by 62%. The maximum throughput reaches 1248 Mpixels/s, which can support real-time decoding of a 4K × 2K 60fps video sequence.

  • Human Detection Method Based on Non-Redundant Gradient Semantic Local Binary Patterns

    Jiu XU  Ning JIANG  Wenxin YU  Heming SUN  Satoshi GOTO  

     
    PAPER

      Vol:
    E98-A No:8
      Page(s):
    1735-1742

    In this paper, a feature named Non-Redundant Gradient Semantic Local Binary Patterns (NRGSLBP) is proposed for human detection as a modified version of the conventional Semantic Local Binary Patterns (SLBP). Calculations of this feature are performed for both intensity and gradient magnitude image so that texture and gradient information are combined. Moreover, and to the best of our knowledge, non-redundant patterns are adopted on SLBP for the first time, allowing better discrimination. Compared with SLBP, no additional cost of the feature dimensions of NRGSLBP is necessary, and the calculation complexity is considerably smaller than that of other features. Experimental results on several datasets show that the detection rate of our proposed feature outperforms those of other features such as Histogram of Orientated Gradient (HOG), Histogram of Templates (HOT), Bidirectional Local Template Patterns (BLTP), Gradient Local Binary Patterns (GLBP), SLBP and Covariance matrix (COV).

  • Design of Low-Cost Approximate Multipliers Based on Probability-Driven Inexact Compressors

    Yi GUO  Heming SUN  Ping LEI  Shinji KIMURA  

     
    PAPER

      Vol:
    E102-A No:12
      Page(s):
    1781-1791

    Approximate computing has emerged as a promising approach for error-tolerant applications to improve hardware performance at the cost of some loss of accuracy. Multiplication is a key arithmetic operation in these applications. In this paper, we propose a low-cost approximate multiplier design by employing new probability-driven inexact compressors. This compressor design is introduced to reduce the height of partial product matrix into two rows, based on the probability distribution of the sum result of partial products. To compensate the accuracy loss of the multiplier, a grouped error recovery scheme is proposed and achieves different levels of accuracy. In terms of mean relative error distance (MRED), the accuracy losses of the proposed multipliers are from 1.07% to 7.86%. Compared with the Wallace multiplier using 40nm process, the most accurate variant of the proposed multipliers can reduce power by 59.75% and area by 42.47%. The critical path delay reduction is larger than 12.78%. The proposed multiplier design has a better accuracy-performance trade-off than other designs with comparable accuracy. In addition, the efficiency of the proposed multiplier design is assessed in an image processing application.

  • Approximate FPGA-Based Multipliers Using Carry-Inexact Elementary Modules

    Yi GUO  Heming SUN  Ping LEI  Shinji KIMURA  

     
    PAPER

      Vol:
    E103-A No:9
      Page(s):
    1054-1062

    Approximate multiplier design is an effective technique to improve hardware performance at the cost of accuracy loss. The current approximate multipliers are mostly ASIC-based and are dedicated for one particular application. In contrast, FPGA has been an attractive choice for many applications because of its high performance, reconfigurability, and fast development round. This paper presents a novel methodology for designing approximate multipliers by employing the FPGA-based fabrics (primarily look-up tables and carry chains). The area and latency are significantly reduced by applying approximation on carry results and cutting the carry propagation path in the multiplier. Moreover, we explore higher-order multipliers on architectural space by using our proposed small-size approximate multipliers as elementary modules. For different accuracy-hardware requirements, eight configurations for approximate 8×8 multiplier are discussed. In terms of mean relative error distance (MRED), the error of the proposed 8×8 multiplier is as low as 1.06%. Compared with the exact multiplier, our proposed design can reduce area by 43.66% and power by 24.24%. The critical path latency reduction is up to 29.50%. The proposed multiplier design has a better accuracy-hardware tradeoff than other designs with comparable accuracy. Moreover, image sharpening processing is used to assess the efficiency of approximate multipliers on application.

  • Fast Prediction Unit Selection and Mode Selection for HEVC Intra Prediction

    Heming SUN  Dajiang ZHOU  Peilin LIU  Satoshi GOTO  

     
    PAPER

      Vol:
    E97-A No:2
      Page(s):
    510-519

    As a next-generation video compression standard, High Efficiency Video Coding (HEVC) achieves enhanced coding performance relative to prior standards such as H.264/AVC. In the new standard, the improved intra prediction plays an important role in bit rate saving. Meanwhile, it also involves significantly increased complexity, due to the adoption of a highly flexible coding unit structure and a large number of angular prediction modes. In this paper, we present a low-complexity intra prediction algorithm for HEVC. We first propose a fast preprocessing stage based on a simplified cost model. Based on its results, a fast prediction unit selection scheme reduces the number of prediction unit (PU) levels that requires fine processing from 5 to 2. To supply PU size decision with appropriate thresholds, a fast training method is also designed. Still based on the preprocessing results, an efficient mode selection scheme reduces the maximum number of angular modes to evaluate from 35 to 8. This achieves further algorithm acceleration by eliminating the necessity to perform fine Hadamard cost calculation. We also propose a 32×32 PU compensation scheme to alleviate the mismatch of cost functions for large transform units, which effectively improves coding performance for high-resolution sequences. In comparison with HM 7.0, the proposed algorithm achieves over 50% complexity reduction in terms of encoding time, with the corresponding bit rate increase lower than 2.0%. Moreover, the achieved complexity reduction is relatively stable and independent to sequence characteristics.