Gang HE Dajiang ZHOU Jinjia ZHOU Tianruo ZHANG Satoshi GOTO
Intra coding in H.264/AVC significantly enhances video compression efficiency. However, due to the high data dependency of intra prediction in H.264, both pipelining and parallel processing techniques are limited to be applied. Moreover, it is difficult to get high hardware utilization and throughput because of the long block/MB-level reconstruction loops. This paper proposes a high-performance intra prediction architecture that can support H.264/AVC high profile. The proposed MB/block co-reordering can avoid data dependency and improve pipeline utilization. Therefore, the timing constraint of real-time 40962160 encoding can be achieved with negligible quality loss. 1616 prediction engine and 88 prediction engine work parallel for prediction and coefficients generating. A reordering interlaced reconstruction is also designed for fully pipelined architecture. It takes only 160 cycles to process one macroblock (MB). Hardware utilization of prediction and reconstruction modules is almost 100%. Furthermore, PE-reusable 88 intra predictor and hybrid SAD & SATD mode decision are proposed to save hardware cost. The design is implemented by 90 nm CMOS technology with 113.2 k gates and can encode 40962160 video sequences at 60 fps with operation frequency of 332 MHz.
Kyu-Yeul WANG Byung-Soo KIM Sang-Seol LEE Dong-Sun KIM Duck-Jin CHUNG
This paper presents a novel low-cost high-performance CAVLC decoder for H.264/AVC. The proposed CAVLC decoder generates the length of coeff_token and total_zeros symbols with simple arithmetic operation. So, it can be implemented with reduced look-up table. And we propose multi-symbol run_before decoder which has enhanced throughput. It can decode more than 2.5 symbols in a cycle if there are run_before symbols to be decoded. The hardware cost is about 12 K gates when synthesized at 125 MHz.
Chaoke PEI Li GAO Donghui WANG Chaohuan HOU
The H.264/AVC standard achieves significantly high coding efficiency if multiple block size Motion Estimation is adopted. However, the complexity of Motion Estimation and DCT is dramatically increased as a result. In previous work we propose an early mode decision algorithm to control the complexity, based on all-zero-blocks detection in 1616 size. In this paper, we improve the algorithm. Firstly, we propose to detect all-zero blocks in 1616, 88 and 44 sizes to simplify the course of mode decision. Secondly, we define the thresholds which are used to terminate motion estimation and mode decision in advance for these sizes. Last, we present the whole proposed algorithm. Experiments show that about 77% encoding time and 85% motion estimation time can be saved on average, which is better than state-of-the-art approaches.
Tongsheng GENG Leibo LIU Shouyi YIN Min ZHU Shaojun WEI
This paper proposes approaches to perform HW/SW (Hardware/Software) partition and parallelization of computing-intensive tasks of the H.264 HiP (High Profile) decoding algorithm on an embedded coarse-grained reconfigurable multimedia system, called REMUS (REconfigurable MUltimedia System). Several techniques, such as MB (Macro-Block) based parallelization, unfixed sub-block operation etc., are utilized to speed up the decoding process, satisfying the requirements of real-time and high quality H.264 applications. Tests show that the execution performance of MC (Motion Compensation), deblocking, and IDCT-IQ (Inverse Discrete Cosine Transform-Inverse Quantization) on REMUS is improved by 60%, 73%, 88.5% in the typical case and 60%, 69%, 88.5% in the worst case, respectively compared with that on XPP PACT (a commercial reconfigurable processor). Compared with ASIC solutions, the performance of MC is improved by 70%, 74% in the typical and in the worst case, respectively, while those of Deblocking remain the same. As for IDCT_IQ, the performance is improved by 17% no matter in the typical or worst case. Relying on the proposed techniques, 1080p@30 fps of H.264 HiP@ Level 4 decoding could be achieved on REMUS when utilizing a 200 MHz working frequency.
Zhenyu LIU Dongsheng WANG Takeshi IKENAGA
Variable block size motion estimation developed by the latest video coding standard H.264/AVC is the efficient approach to reduce the temporal redundancies. The intensive computational complexity coming from the variable block size technique makes the hardwired accelerator essential, for real-time applications. Propagate partial sums of absolute differences (Propagate Partial SAD) and SAD Tree hardwired engines outperform other counterparts, especially considering the impact of supporting variable block size technique. In this paper, the authors apply the architecture-level and the circuit-level approaches to improve the maximum operating frequency and reduce the hardware overhead of Propagate Partial SAD and SAD Tree, while other metrics, in terms of latency, memory bandwidth and hardware utilization, of the original architectures are maintained. Experiments demonstrate that by using the proposed approaches, at 110.8 MHz operating frequency, compared with the original architectures, 14.7% and 18.0% gate count can be saved for Propagate Partial SAD and SAD Tree, respectively. With TSMC 0.18 µm 1P6M CMOS technology, the proposed Propagate Partial SAD architecture achieves 231.6 MHz operating frequency at a cost of 84.1 k gates. Correspondingly, the maximum work frequency of the optimized SAD Tree architecture is improved to 204.8 MHz, which is almost two times of the original one, while its hardware overhead is merely 88.5 k-gate.
Yizhong LIU Tian SONG Takashi SHIMAMOTO
In this paper, we propose a high-throughput binary arithmetic coding architecture for CABAC (Context Adaptive Binary Arithmetic Coding) which is one of the entropy coding tools used in the H.264/AVC main and high profiles. The full CABAC encoding functions, including binarization, context model selection, arithmetic encoding and bits generation, are implemented in this proposal. The binarization and context model selection are implemented in a proposed binarizer, in which a FIFO is used to pack the binarization results and output 4 bins in one clock. The arithmetic encoding and bits generation are implemented in a four-stage pipeline with the encoding ability of 4 bins/clock. In order to improve the processing speed, the context variables access and update for 4 bins are paralleled and the pipeline path is balanced. Also, because of the outstanding bits issue, a bits packing and generation strategy for 4 bins paralleled processing is proposed. After implemented in verilog-HDL and synthesized with Synopsys Design Compiler using 90 nm libraries, this proposal can work at the clock frequency of 250 MHz and takes up about 58 K standard cells, 3.2 Kbits register files and 27.6 K bits ROM. The throughput of processing 1000 M bins per second can be achieved in this proposal for the HDTV applications.
Scanning quantized transform coefficients is an important tool for video coding. For example, the MPEG-4 video coder adopts three different scans to get better coding efficiency. This paper proposes an adaptive zero-coefficient distribution scan in inter block coding. The proposed method attempts to improve H.264/AVC zero coefficient coding by modifying the scan operation. Since the zero-coefficient distribution is changed by the proposed scan method, new VLC tables for syntax elements used in context-adaptive variable length coding (CAVLC) are also provided. The savings in bit-rate range from 2.2% to 5.1% in the high bit-rate cases, depending on different test sequences.
Jinjia ZHOU Dajiang ZHOU Xun HE Satoshi GOTO
In this paper, VLSI architecture of a joint parameter decoder is proposed to realize the calculation of motion vector (MV), intra prediction mode (IPM) and boundary strength (BS) for ultra high definition H.264/AVC applications. For this architecture, a 64-cycle-per-MB pipeline with simplified control modes is designed to increase system throughput and reduce hardware cost. Moreover, in order to save memory bandwidth, the data which includes the motion information for the co-located picture and the last decoded line, is pre-processed before being stored to DRAM. A partition based storage format is applied to condense the MB level data, while variable length coding based compression method is utilized to reduce the data size in each partition. Experimental results show our design is capable of real-time 38402160@60 fps decoding at less than 133 MHz, with 37.2 k logic gates. Meanwhile, by applying the proposed scheme, 85-98% bandwidth saving is achieved, compared with storing the original information for every 44 block to DRAM.
In this letter, we present a simple but efficient intra prediction mode decision for H.264/AVC. Based on our investigation, the DC mode appears to be the superior prediction mode among the various candidates. We propose an intra-mode decision algorithm where the DC mode is chosen as a candidate for the best prediction mode. By experimental results, on average, the proposed algorithm significantly saves 81.905% of the entire encoding time compared to the H.264 reference software; besides, it reduces negligible peak signal-to-noise ratio (PSNR) values and slightly increases bitrates.
Shuijiong WU Peilin LIU Yiqing HUANG Qin LIU Takeshi IKENAGA
H.264/AVC encoder employs rate control to adaptively adjust quantization parameter (QP) to enable coded video to be transmitted over a constant bit-rate (CBR) channel. In this topic, bit allocation is crucial since it is directly related with actual bit generation and the coding quality. Meanwhile, the rate-distortion-optimization (RDO) based mode-decision technique also affects performance a lot for the strong relation among mode, bits, and quality. This paper presents a multi-stage rate control scheme for R-D optimized H.264/AVC encoders under CBR video transmission. To enhance the precision of the complexity estimation and bit allocation, a frequency-domain parameter named mean-absolute-transform-difference (MATD) is adopted to represent frame and macroblock (MB) residual complexity. Second, the MATD ratio is utilized to enhance the accuracy of frame layer bit prediction. Then, by considering the bit usage status of whole sequence, a measurement combining forward and backward bit analysis is proposed to adjust the Lagrange multiplier λMODE on frame layer to optimize the mode decision for all MBs within the current frame. On the next stage, bits are allocated on MB layer by proposed remaining complexity analysis. Computed QP is further adjusted according to predicted MB texture bits. Simulation results show the PSNR improvement is up to 1.13 dB by using our algorithm, and the stress of output buffer control is also largely released compared with the recommended rate control in H.264/AVC reference software JM13.2.
Jie JIA Daeil YOON Hae Kwang KIM
Video coding standard H.264/AVC employs transform coding to explore spatial correlation in inter picture prediction residue. This paper presents a block based DC offset to further explore the correlation in spatially neighboring blocks and provides H.264/AVC an enhanced coding efficiency performance. The proposed method applies DC offset to inter picture prediction residue, and encodes the offset compensated residual signal. The DC offset is derived from reconstructed residue in neighboring blocks. No additional bits are required for the DC offset representation. Simulation results report that the proposed method yields an average of 2.67% bit rate reduction for high resolution video over the H.264 baseline profile.
The rate-distortion optimization (RDO) method in the H.264/AVC encoder is an informative technology that improves the coding efficiency, but increases the computational complexity. In this letter, a fast Intra mode decision algorithm using DCT (Discrete Cosine Transform) coefficients distribution is proposed to reduce the H.264 encoder complexity. The proposed method reduces the encoder complexity on average 63.44%, while the coding efficiency is slightly decreased compared with the H.264/AVC encoder.
One Super Hi-Vision (SHV) 4k4k@60 fps fractional motion estimation (FME) engine is proposed in our paper. Firstly, two complexity reduction schemes are proposed in the algorithm level. By analyzing the integer motion cost of sub blocks in each inter mode, the mode reduction based mode pre-filtering scheme can achieve 48% clock cycle saving compared with previous algorithm. By further check the motion cost of search points around best integer candidate, the motion cost oriented directional one-pass scheme can provide 50% clock cycle saving and 36% reduction in the number of processing units (PU). Secondly, in the hardware level, two parallel improved schemes namely 16-Pel processing and MB-parallel scheme are given out in our paper, which reduces design effort to only 145 MHz for SHV FME processing. Also, quarter sub-sampling is adopted in our design and 75% hardware cost is reduced for each PU. Thirdly, one unified pixel block loading scheme is proposed. About 28.67% to 86.39% pixels are reused and the related memory access is saved. Furthermore, we also give out one parity pixel organization scheme to solve memory access conflict of MB-parallel scheme. By using TSMC 0.18 µm technology in worst work conditions (1.62 V, 125), our FME engine can achieve real-time processing for SHV 4k4k@60 fps with 412k gates hardware.
Jie JIA Daeil YOON Hae Kwang KIM
Context-based adaptive variable length coding (CAVLC) is an entropy coding scheme employed in H.264/AVC for transform coefficient compression. The CAVLC encodes levels of nonzero-valued coefficients. Then indicates their positions with run_before which is number of zeros preceding each nonzero coefficient in scan order. In H.264, the run_before is coded using lookup tables depending on number of zero-valued coefficients that have not been coded. This paper presents an improved run_before coding method which encodes run_before using tables taking both zero-valued and nonzero-valued coefficients into consideration. Simulation results report that the proposed method yields an average of 4.40% bit rate reduction for run_before coding over H.264 baseline profile with intra-only coding structure. It corresponds to 0.52% bit rate saving over total bit rate on average.
Sumek WISAYATAKSIN Dongju LI Tsuyoshi ISSHIKI Hiroaki KUNIEDA
An entropy decoding engine plays an important role in modern multimedia decoders. Previous researches that focused on the decoding performance paid a considerable attention to only one parameter such as the data parsing speed, but they did not consider the performance caused by a table configuration time and memory size. In this paper, we developed a novel method of entropy decoding based on the two step group matching scheme. Our approach achieves the high performance on both data parsing speed and configuration time with small memory needed. We also deployed our decoding scheme to implement an entropy decoding processor, which performs operations based on normal processor instructions and VLD instructions for decoding variable length codes. Several extended VLD instructions are prepared to increase the bitstream parsing process in modern multimedia applications. This processor provides a solution with software flexibility and hardware high speed for stand-alone entropy decoding engines. The VLSI hardware is designed by the Language for Instruction Set Architecture (LISA) with 23 Kgates and 110 MHz maximum clock frequency under TSMC 0.18 µm technology. The experimental simulations revealed that proposed processor achieves the higher performance and suitable for many practical applications such as MPEG-2, MPEG-4, H.264/AVC and AAC.
Tae-Kyoung KIM Jeong-Hwan BOO Sang Ju PARK
Scalable video coding (SVC) was standardized as an extension of H.264/AVC by the JVT (Joint Video Team) in Nov. 2007. The biggest feature of SVC is multi-layered coding where two or more video sequences are compressed into a single bit-stream. This letter proposes a fast block mode decision algorithm in spatial enhancement layer of SVC. The proposed algorithm achieves early decision by limiting the number of candidate modes for block with certain characteristic called same motion vector block (SMVB). Our proposed method reduces the complexity, in terms of encoding time by up to 66.17%. Nevertheless, it shows negligible PSNR degradation by only up to 0.16 dB and increases the bit-rate by only up to 0.64%, respectively.
Dajiang ZHOU Jinjia ZHOU Jiayi ZHU Satoshi GOTO
In this paper, a highly parallel deblocking filter architecture for H.264/AVC is proposed to process one macroblock in 48 clock cycles and give real-time support to QFHD@60 fps sequences at less than 100 MHz. 4 edge filters organized in 2 groups for simultaneously processing vertical and horizontal edges are applied in this architecture to enhance its throughput. While parallelism increases, pipeline hazards arise owing to the latency of edge filters and data dependency of deblocking algorithm. To solve this problem, a zig-zag processing schedule is proposed to eliminate the pipeline bubbles. Data path of the architecture is then derived according to the processing schedule and optimized through data flow merging, so as to minimize the cost of logic and internal buffer. Meanwhile, the architecture's data input rate is designed to be identical to its throughput, while the transmission order of input data can also match the zig-zag processing schedule. Therefore no intercommunication buffer is required between the deblocking filter and its previous component for speed matching or data reordering. As a result, only one 2464 two-port SRAM as internal buffer is required in this design. When synthesized with SMIC 130 nm process, the architecture costs a gate count of 30.2 k, which is competitive considering its high performance.
Dong-Hoon HAN Yung-Ki LEE Yung-Lyul LEE
Since multiview video coding (MVC) based on H.264/AVC uses a prediction scheme exploiting inter-view correlation among multiview video, MVC encoder compresses multiple views more efficiently than simulcast H.264/AVC encoder. However, in case that the number of views to be encoded increases in MVC, the total encoding time will be greatly increased. To reduce computational complexity in MVC, a fast mode decision using both Macroblock-based region segmentation information and global disparity vector among views is proposed to reduce the encoding time. The proposed method achieves on the average 1.5 2.9 reduction of the total encoding time with the PSNR (Peak Signal-to-Noise Ratio) degradation of about 0.05 dB.
Yiqing HUANG Qin LIU Shuijiong WU Zhewen ZHENG Takeshi IKENAGA
One fast inter mode decision algorithm is proposed in this paper. The whole algorithm is divided into two stages. In the pre-stage, by exploiting spatial and temporal information of encoded macrobocks (MBs), a skip mode early detection scheme is proposed. The homogeneity of current MB is also analyzed to filter out small inter modes in this stage. Secondly, during the block matching stage, a motion feature based inter mode decision scheme is introduced by analyzing the motion vector predictor's accuracy, the block overlapping situation and the smoothness of SAD (sum of absolute difference) value. Moreover, the rate distortion cost is checked in an early stage and we set some constraints to speed up the whole decision flow. Experiments show that our algorithm can achieve a speed up factor of up to 53.4% for sequences with different motion type. The overall bit increment and quality degradation is negligible compared with existing works.
Yiqing HUANG Qin LIU Satoshi GOTO Takeshi IKENAGA
This paper presents a reconfigurable SAD Tree (RSADT) architecture based on adaptive sub-sampling algorithm for HDTV application. Firstly, to obtain the the feature of HDTV picture, pixel difference analysis is applied on each macroblock (MB). Three hardware friendly sub-sampling patterns are selected adaptively to release complexity of homogeneous MB and keep video quality for texture MB. Secondly, since two pipeline stages are inserted, the whole clock speed of RSADT structure is enhanced. Thirdly, to solve data reuse and hardware utilization problem of adaptive algorithm, the RSADT structure adopts pixel data organization in both memory and architecture level, which leads to full data reuse and hardware utilization. Additionally, a cross reuse structure is proposed to efficiently generate 16 pixel scaled configurable SAD (sum of absolute difference). Experimental results show that, our RSADT architecture can averagely save 61.71% processing cycles for integer motion estimation engine and accomplish twice or four times processing capability for homogeneous MBs. The maximum clock frequency of our design is 208 MHz under TSMC 0.18 µm technology in worst work conditions(1.62 V, 125C). Furthermore, the proposed algorithm and reconfigurable structure are favorable to power aware real-time encoding system.