The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] h.264(137hit)

41-60hit(137hit)

  • A 530 Mpixels/s Intra Prediction Architecture for Ultra High Definition H.264/AVC Encoder

    Gang HE  Dajiang ZHOU  Jinjia ZHOU  Tianruo ZHANG  Satoshi GOTO  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    419-427

    Intra coding in H.264/AVC significantly enhances video compression efficiency. However, due to the high data dependency of intra prediction in H.264, both pipelining and parallel processing techniques are limited to be applied. Moreover, it is difficult to get high hardware utilization and throughput because of the long block/MB-level reconstruction loops. This paper proposes a high-performance intra prediction architecture that can support H.264/AVC high profile. The proposed MB/block co-reordering can avoid data dependency and improve pipeline utilization. Therefore, the timing constraint of real-time 40962160 encoding can be achieved with negligible quality loss. 1616 prediction engine and 88 prediction engine work parallel for prediction and coefficients generating. A reordering interlaced reconstruction is also designed for fully pipelined architecture. It takes only 160 cycles to process one macroblock (MB). Hardware utilization of prediction and reconstruction modules is almost 100%. Furthermore, PE-reusable 88 intra predictor and hybrid SAD & SATD mode decision are proposed to save hardware cost. The design is implemented by 90 nm CMOS technology with 113.2 k gates and can encode 40962160 video sequences at 60 fps with operation frequency of 332 MHz.

  • A Novel Low-Cost High-Throughput CAVLC Decoder for H.264/AVC

    Kyu-Yeul WANG  Byung-Soo KIM  Sang-Seol LEE  Dong-Sun KIM  Duck-Jin CHUNG  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E94-D No:4
      Page(s):
    895-904

    This paper presents a novel low-cost high-performance CAVLC decoder for H.264/AVC. The proposed CAVLC decoder generates the length of coeff_token and total_zeros symbols with simple arithmetic operation. So, it can be implemented with reduced look-up table. And we propose multi-symbol run_before decoder which has enhanced throughput. It can decode more than 2.5 symbols in a cycle if there are run_before symbols to be decoded. The hardware cost is about 12 K gates when synthesized at 125 MHz.

  • An All-Zero Block Mode Decision Algorithm for H.264/AVC Optimization

    Chaoke PEI  Li GAO  Donghui WANG  Chaohuan HOU  

     
    LETTER-Image Processing and Video Processing

      Vol:
    E94-D No:2
      Page(s):
    384-387

    The H.264/AVC standard achieves significantly high coding efficiency if multiple block size Motion Estimation is adopted. However, the complexity of Motion Estimation and DCT is dramatically increased as a result. In previous work we propose an early mode decision algorithm to control the complexity, based on all-zero-blocks detection in 1616 size. In this paper, we improve the algorithm. Firstly, we propose to detect all-zero blocks in 1616, 88 and 44 sizes to simplify the course of mode decision. Secondly, we define the thresholds which are used to terminate motion estimation and mode decision in advance for these sizes. Last, we present the whole proposed algorithm. Experiments show that about 77% encoding time and 85% motion estimation time can be saved on average, which is better than state-of-the-art approaches.

  • Parallelization of Computing-Intensive Tasks of the H.264 High Profile Decoding Algorithm on a Reconfigurable Multimedia System

    Tongsheng GENG  Leibo LIU  Shouyi YIN  Min ZHU  Shaojun WEI  

     
    PAPER

      Vol:
    E93-D No:12
      Page(s):
    3223-3231

    This paper proposes approaches to perform HW/SW (Hardware/Software) partition and parallelization of computing-intensive tasks of the H.264 HiP (High Profile) decoding algorithm on an embedded coarse-grained reconfigurable multimedia system, called REMUS (REconfigurable MUltimedia System). Several techniques, such as MB (Macro-Block) based parallelization, unfixed sub-block operation etc., are utilized to speed up the decoding process, satisfying the requirements of real-time and high quality H.264 applications. Tests show that the execution performance of MC (Motion Compensation), deblocking, and IDCT-IQ (Inverse Discrete Cosine Transform-Inverse Quantization) on REMUS is improved by 60%, 73%, 88.5% in the typical case and 60%, 69%, 88.5% in the worst case, respectively compared with that on XPP PACT (a commercial reconfigurable processor). Compared with ASIC solutions, the performance of MC is improved by 70%, 74% in the typical and in the worst case, respectively, while those of Deblocking remain the same. As for IDCT_IQ, the performance is improved by 17% no matter in the typical or worst case. Relying on the proposed techniques, 1080p@30 fps of H.264 HiP@ Level 4 decoding could be achieved on REMUS when utilizing a 200 MHz working frequency.

  • Architecture and Circuit Optimization of Hardwired Integer Motion Estimation Engine for H.264/AVC

    Zhenyu LIU  Dongsheng WANG  Takeshi IKENAGA  

     
    PAPER-Image Processing

      Vol:
    E93-A No:11
      Page(s):
    2065-2073

    Variable block size motion estimation developed by the latest video coding standard H.264/AVC is the efficient approach to reduce the temporal redundancies. The intensive computational complexity coming from the variable block size technique makes the hardwired accelerator essential, for real-time applications. Propagate partial sums of absolute differences (Propagate Partial SAD) and SAD Tree hardwired engines outperform other counterparts, especially considering the impact of supporting variable block size technique. In this paper, the authors apply the architecture-level and the circuit-level approaches to improve the maximum operating frequency and reduce the hardware overhead of Propagate Partial SAD and SAD Tree, while other metrics, in terms of latency, memory bandwidth and hardware utilization, of the original architectures are maintained. Experiments demonstrate that by using the proposed approaches, at 110.8 MHz operating frequency, compared with the original architectures, 14.7% and 18.0% gate count can be saved for Propagate Partial SAD and SAD Tree, respectively. With TSMC 0.18 µm 1P6M CMOS technology, the proposed Propagate Partial SAD architecture achieves 231.6 MHz operating frequency at a cost of 84.1 k gates. Correspondingly, the maximum work frequency of the optimized SAD Tree architecture is improved to 204.8 MHz, which is almost two times of the original one, while its hardware overhead is merely 88.5 k-gate.

  • A High-Throughput Binary Arithmetic Coding Architecture for H.264/AVC CABAC

    Yizhong LIU  Tian SONG  Takashi SHIMAMOTO  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E93-A No:9
      Page(s):
    1594-1604

    In this paper, we propose a high-throughput binary arithmetic coding architecture for CABAC (Context Adaptive Binary Arithmetic Coding) which is one of the entropy coding tools used in the H.264/AVC main and high profiles. The full CABAC encoding functions, including binarization, context model selection, arithmetic encoding and bits generation, are implemented in this proposal. The binarization and context model selection are implemented in a proposed binarizer, in which a FIFO is used to pack the binarization results and output 4 bins in one clock. The arithmetic encoding and bits generation are implemented in a four-stage pipeline with the encoding ability of 4 bins/clock. In order to improve the processing speed, the context variables access and update for 4 bins are paralleled and the pipeline path is balanced. Also, because of the outstanding bits issue, a bits packing and generation strategy for 4 bins paralleled processing is proposed. After implemented in verilog-HDL and synthesized with Synopsys Design Compiler using 90 nm libraries, this proposal can work at the clock frequency of 250 MHz and takes up about 58 K standard cells, 3.2 Kbits register files and 27.6 K bits ROM. The throughput of processing 1000 M bins per second can be achieved in this proposal for the HDTV applications.

  • Adaptive Zero-Coefficient Distribution Scan for Inter Block Mode Coding of H.264/AVC

    Jing-Xin WANG  Alvin W.Y. SU  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E93-D No:8
      Page(s):
    2273-2280

    Scanning quantized transform coefficients is an important tool for video coding. For example, the MPEG-4 video coder adopts three different scans to get better coding efficiency. This paper proposes an adaptive zero-coefficient distribution scan in inter block coding. The proposed method attempts to improve H.264/AVC zero coefficient coding by modifying the scan operation. Since the zero-coefficient distribution is changed by the proposed scan method, new VLC tables for syntax elements used in context-adaptive variable length coding (CAVLC) are also provided. The savings in bit-rate range from 2.2% to 5.1% in the high bit-rate cases, depending on different test sequences.

  • A Bandwidth Optimized, 64 Cycles/MB Joint Parameter Decoder Architecture for Ultra High Definition H.264/AVC Applications

    Jinjia ZHOU  Dajiang ZHOU  Xun HE  Satoshi GOTO  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E93-A No:8
      Page(s):
    1425-1433

    In this paper, VLSI architecture of a joint parameter decoder is proposed to realize the calculation of motion vector (MV), intra prediction mode (IPM) and boundary strength (BS) for ultra high definition H.264/AVC applications. For this architecture, a 64-cycle-per-MB pipeline with simplified control modes is designed to increase system throughput and reduce hardware cost. Moreover, in order to save memory bandwidth, the data which includes the motion information for the co-located picture and the last decoded line, is pre-processed before being stored to DRAM. A partition based storage format is applied to condense the MB level data, while variable length coding based compression method is utilized to reduce the data size in each partition. Experimental results show our design is capable of real-time 38402160@60 fps decoding at less than 133 MHz, with 37.2 k logic gates. Meanwhile, by applying the proposed scheme, 85-98% bandwidth saving is achieved, compared with storing the original information for every 44 block to DRAM.

  • Fast Intra Prediction Mode Decision for H.264/AVC

    Do QUAN  Yo-Sung HO  

     
    LETTER-Image Processing and Video Processing

      Vol:
    E93-D No:7
      Page(s):
    2012-2015

    In this letter, we present a simple but efficient intra prediction mode decision for H.264/AVC. Based on our investigation, the DC mode appears to be the superior prediction mode among the various candidates. We propose an intra-mode decision algorithm where the DC mode is chosen as a candidate for the best prediction mode. By experimental results, on average, the proposed algorithm significantly saves 81.905% of the entire encoding time compared to the H.264 reference software; besides, it reduces negligible peak signal-to-noise ratio (PSNR) values and slightly increases bitrates.

  • Constant Bit-Rate Multi-Stage Rate Control for Rate-Distortion Optimized H.264/AVC Encoders

    Shuijiong WU  Peilin LIU  Yiqing HUANG  Qin LIU  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E93-D No:7
      Page(s):
    1716-1726

    H.264/AVC encoder employs rate control to adaptively adjust quantization parameter (QP) to enable coded video to be transmitted over a constant bit-rate (CBR) channel. In this topic, bit allocation is crucial since it is directly related with actual bit generation and the coding quality. Meanwhile, the rate-distortion-optimization (RDO) based mode-decision technique also affects performance a lot for the strong relation among mode, bits, and quality. This paper presents a multi-stage rate control scheme for R-D optimized H.264/AVC encoders under CBR video transmission. To enhance the precision of the complexity estimation and bit allocation, a frequency-domain parameter named mean-absolute-transform-difference (MATD) is adopted to represent frame and macroblock (MB) residual complexity. Second, the MATD ratio is utilized to enhance the accuracy of frame layer bit prediction. Then, by considering the bit usage status of whole sequence, a measurement combining forward and backward bit analysis is proposed to adjust the Lagrange multiplier λMODE on frame layer to optimize the mode decision for all MBs within the current frame. On the next stage, bits are allocated on MB layer by proposed remaining complexity analysis. Computed QP is further adjusted according to predicted MB texture bits. Simulation results show the PSNR improvement is up to 1.13 dB by using our algorithm, and the stress of output buffer control is also largely released compared with the recommended rate control in H.264/AVC reference software JM13.2.

  • A Predictive Block Based DC Offset for H.264/AVC Video Coding

    Jie JIA  Daeil YOON  Hae Kwang KIM  

     
    LETTER-Digital Signal Processing

      Vol:
    E93-A No:5
      Page(s):
    976-980

    Video coding standard H.264/AVC employs transform coding to explore spatial correlation in inter picture prediction residue. This paper presents a block based DC offset to further explore the correlation in spatially neighboring blocks and provides H.264/AVC an enhanced coding efficiency performance. The proposed method applies DC offset to inter picture prediction residue, and encodes the offset compensated residual signal. The DC offset is derived from reconstructed residue in neighboring blocks. No additional bits are required for the DC offset representation. Simulation results report that the proposed method yields an average of 2.67% bit rate reduction for high resolution video over the H.264 baseline profile.

  • Fast Intra Mode Decision Using DCT Coefficient Distribution in H.264/AVC

    Sung-Wook HONG  Yung-Lyul LEE  

     
    LETTER-Image

      Vol:
    E93-A No:3
      Page(s):
    660-663

    The rate-distortion optimization (RDO) method in the H.264/AVC encoder is an informative technology that improves the coding efficiency, but increases the computational complexity. In this letter, a fast Intra mode decision algorithm using DCT (Discrete Cosine Transform) coefficients distribution is proposed to reduce the H.264 encoder complexity. The proposed method reduces the encoder complexity on average 63.44%, while the coding efficiency is slightly decreased compared with the H.264/AVC encoder.

  • Highly Parallel Fractional Motion Estimation Engine for Super Hi-Vision 4k4k@60 fps

    Yiqing HUANG  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E93-C No:3
      Page(s):
    244-252

    One Super Hi-Vision (SHV) 4k4k@60 fps fractional motion estimation (FME) engine is proposed in our paper. Firstly, two complexity reduction schemes are proposed in the algorithm level. By analyzing the integer motion cost of sub blocks in each inter mode, the mode reduction based mode pre-filtering scheme can achieve 48% clock cycle saving compared with previous algorithm. By further check the motion cost of search points around best integer candidate, the motion cost oriented directional one-pass scheme can provide 50% clock cycle saving and 36% reduction in the number of processing units (PU). Secondly, in the hardware level, two parallel improved schemes namely 16-Pel processing and MB-parallel scheme are given out in our paper, which reduces design effort to only 145 MHz for SHV FME processing. Also, quarter sub-sampling is adopted in our design and 75% hardware cost is reduced for each PU. Thirdly, one unified pixel block loading scheme is proposed. About 28.67% to 86.39% pixels are reused and the related memory access is saved. Furthermore, we also give out one parity pixel organization scheme to solve memory access conflict of MB-parallel scheme. By using TSMC 0.18 µm technology in worst work conditions (1.62 V, 125), our FME engine can achieve real-time processing for SHV 4k4k@60 fps with 412k gates hardware.

  • An Improved Run_before Coding for H.264 CAVLC

    Jie JIA  Daeil YOON  Hae Kwang KIM  

     
    LETTER-Coding Theory

      Vol:
    E93-A No:2
      Page(s):
    561-564

    Context-based adaptive variable length coding (CAVLC) is an entropy coding scheme employed in H.264/AVC for transform coefficient compression. The CAVLC encodes levels of nonzero-valued coefficients. Then indicates their positions with run_before which is number of zeros preceding each nonzero coefficient in scan order. In H.264, the run_before is coded using lookup tables depending on number of zero-valued coefficients that have not been coded. This paper presents an improved run_before coding method which encodes run_before using tables taking both zero-valued and nonzero-valued coefficients into consideration. Simulation results report that the proposed method yields an average of 4.40% bit rate reduction for run_before coding over H.264 baseline profile with intra-only coding structure. It corresponds to 0.52% bit rate saving over total bit rate on average.

  • Entropy Decoding Processor for Modern Multimedia Applications

    Sumek WISAYATAKSIN  Dongju LI  Tsuyoshi ISSHIKI  Hiroaki KUNIEDA  

     
    PAPER-Embedded, Real-Time and Reconfigurable Systems

      Vol:
    E92-A No:12
      Page(s):
    3248-3257

    An entropy decoding engine plays an important role in modern multimedia decoders. Previous researches that focused on the decoding performance paid a considerable attention to only one parameter such as the data parsing speed, but they did not consider the performance caused by a table configuration time and memory size. In this paper, we developed a novel method of entropy decoding based on the two step group matching scheme. Our approach achieves the high performance on both data parsing speed and configuration time with small memory needed. We also deployed our decoding scheme to implement an entropy decoding processor, which performs operations based on normal processor instructions and VLD instructions for decoding variable length codes. Several extended VLD instructions are prepared to increase the bitstream parsing process in modern multimedia applications. This processor provides a solution with software flexibility and hardware high speed for stand-alone entropy decoding engines. The VLSI hardware is designed by the Language for Instruction Set Architecture (LISA) with 23 Kgates and 110 MHz maximum clock frequency under TSMC 0.18 µm technology. The experimental simulations revealed that proposed processor achieves the higher performance and suitable for many practical applications such as MPEG-2, MPEG-4, H.264/AVC and AAC.

  • Fast Mode Decision on the Enhancement Layer in H.264 Scalable Extension

    Tae-Kyoung KIM  Jeong-Hwan BOO  Sang Ju PARK  

     
    LETTER-Image Processing and Video Processing

      Vol:
    E92-D No:12
      Page(s):
    2545-2547

    Scalable video coding (SVC) was standardized as an extension of H.264/AVC by the JVT (Joint Video Team) in Nov. 2007. The biggest feature of SVC is multi-layered coding where two or more video sequences are compressed into a single bit-stream. This letter proposes a fast block mode decision algorithm in spatial enhancement layer of SVC. The proposed algorithm achieves early decision by limiting the number of candidate modes for block with certain characteristic called same motion vector block (SMVB). Our proposed method reduces the complexity, in terms of encoding time by up to 66.17%. Nevertheless, it shows negligible PSNR degradation by only up to 0.16 dB and increases the bit-rate by only up to 0.64%, respectively.

  • A 48 Cycles/MB H.264/AVC Deblocking Filter Architecture for Ultra High Definition Applications

    Dajiang ZHOU  Jinjia ZHOU  Jiayi ZHU  Satoshi GOTO  

     
    PAPER-Embedded, Real-Time and Reconfigurable Systems

      Vol:
    E92-A No:12
      Page(s):
    3203-3210

    In this paper, a highly parallel deblocking filter architecture for H.264/AVC is proposed to process one macroblock in 48 clock cycles and give real-time support to QFHD@60 fps sequences at less than 100 MHz. 4 edge filters organized in 2 groups for simultaneously processing vertical and horizontal edges are applied in this architecture to enhance its throughput. While parallelism increases, pipeline hazards arise owing to the latency of edge filters and data dependency of deblocking algorithm. To solve this problem, a zig-zag processing schedule is proposed to eliminate the pipeline bubbles. Data path of the architecture is then derived according to the processing schedule and optimized through data flow merging, so as to minimize the cost of logic and internal buffer. Meanwhile, the architecture's data input rate is designed to be identical to its throughput, while the transmission order of input data can also match the zig-zag processing schedule. Therefore no intercommunication buffer is required between the deblocking filter and its previous component for speed matching or data reordering. As a result, only one 2464 two-port SRAM as internal buffer is required in this design. When synthesized with SMIC 130 nm process, the architecture costs a gate count of 30.2 k, which is competitive considering its high performance.

  • Fast Mode Decision Using Global Disparity Vector for Multiview Video Coding

    Dong-Hoon HAN  Yung-Ki LEE  Yung-Lyul LEE  

     
    LETTER-Image

      Vol:
    E92-A No:12
      Page(s):
    3407-3411

    Since multiview video coding (MVC) based on H.264/AVC uses a prediction scheme exploiting inter-view correlation among multiview video, MVC encoder compresses multiple views more efficiently than simulcast H.264/AVC encoder. However, in case that the number of views to be encoded increases in MVC, the total encoding time will be greatly increased. To reduce computational complexity in MVC, a fast mode decision using both Macroblock-based region segmentation information and global disparity vector among views is proposed to reduce the encoding time. The proposed method achieves on the average 1.5 2.9 reduction of the total encoding time with the PSNR (Peak Signal-to-Noise Ratio) degradation of about 0.05 dB.

  • Macroblock and Motion Feature Analysis to H.264/AVC Fast Inter Mode Decision

    Yiqing HUANG  Qin LIU  Shuijiong WU  Zhewen ZHENG  Takeshi IKENAGA  

     
    PAPER-Coding

      Vol:
    E92-A No:12
      Page(s):
    3361-3368

    One fast inter mode decision algorithm is proposed in this paper. The whole algorithm is divided into two stages. In the pre-stage, by exploiting spatial and temporal information of encoded macrobocks (MBs), a skip mode early detection scheme is proposed. The homogeneity of current MB is also analyzed to filter out small inter modes in this stage. Secondly, during the block matching stage, a motion feature based inter mode decision scheme is introduced by analyzing the motion vector predictor's accuracy, the block overlapping situation and the smoothness of SAD (sum of absolute difference) value. Moreover, the rate distortion cost is checked in an early stage and we set some constraints to speed up the whole decision flow. Experiments show that our algorithm can achieve a speed up factor of up to 53.4% for sequences with different motion type. The overall bit increment and quality degradation is negligible compared with existing works.

  • Adaptive Sub-Sampling Based Reconfigurable SAD Tree Architecture for HDTV Application

    Yiqing HUANG  Qin LIU  Satoshi GOTO  Takeshi IKENAGA  

     
    PAPER-Video Coding

      Vol:
    E92-A No:11
      Page(s):
    2819-2829

    This paper presents a reconfigurable SAD Tree (RSADT) architecture based on adaptive sub-sampling algorithm for HDTV application. Firstly, to obtain the the feature of HDTV picture, pixel difference analysis is applied on each macroblock (MB). Three hardware friendly sub-sampling patterns are selected adaptively to release complexity of homogeneous MB and keep video quality for texture MB. Secondly, since two pipeline stages are inserted, the whole clock speed of RSADT structure is enhanced. Thirdly, to solve data reuse and hardware utilization problem of adaptive algorithm, the RSADT structure adopts pixel data organization in both memory and architecture level, which leads to full data reuse and hardware utilization. Additionally, a cross reuse structure is proposed to efficiently generate 16 pixel scaled configurable SAD (sum of absolute difference). Experimental results show that, our RSADT architecture can averagely save 61.71% processing cycles for integer motion estimation engine and accomplish twice or four times processing capability for homogeneous MBs. The maximum clock frequency of our design is 208 MHz under TSMC 0.18 µm technology in worst work conditions(1.62 V, 125C). Furthermore, the proposed algorithm and reconfigurable structure are favorable to power aware real-time encoding system.

41-60hit(137hit)