Jong-Ho KIM Byung-Gyu KIM Chang-Sik CHO
A fast intra-mode decision algorithm is proposed on the basis of an inter-mode block type for inter-frames (P-slices). Each macroblock (MB) type has its own intra prediction modes (I16MB and 88 chroma: 4 modes, I4MB and I8MB: 9 modes). This procedure creates a large computational complexity in addition to the inter mode decision procedure. In most cases, there is a high correlation between the best inter-mode block type and the direction of the texture edge or object boundary. Therefore, only a small number of intra-prediction modes are chosen to determine the best intra mode based on this correlation. We experimentally verify that the proposed scheme can significantly improve the speed of the overall encoding time with a negligible loss of image quality and a minimal bit increase. The average loss in PSNR was -0.0120.036 dB and the bit increment was approximately -0.1940.751%.
Jeongae PARK Misun YOON Hyunchul SHIN
Motion estimation (ME) is a computation intensive procedure in H.264. In ME for variable block sizes, an effective scan ordering method has been devised for early termination of absolute difference computation when the termination does not affect the performance. The new ME circuit with effective scan ordering can reduce the amount of computation by 70% compared to JM8.2 and by 30% compared to the disable approximation unit (DAU) approach.
Dae-Yeon KIM Dong-Kyun KIM Yung-Lyul LEE
In H.264/AVC, the quantized coefficients are scanned in a zigzag pattern. But the zigzag scanning is not always efficient for the directional spatial predictions in the intra coding of H.264/AVC. In this letter, we propose an adaptive scanning using the pixel similarity of the neighboring pixels to achieve enhanced intra coding performance. The proposed method reduces the bit rate approximately 2% compared with H.264/AVC without video quality degradation.
Yutao DONG Xiangzhong FANG Jing YANG
The frame-level R-D optimization in H.264 is very important in video storage scenarios. Among all of the sub-optimal algorithms, a greedy iteration algorithm (GIA) can best lower the computational complexity of frame-level R-D optimization. In order to further lower the computational complexity, a ρ-domain frame-level R-D optimization algorithm is proposed in this letter. Different from GIA, every frame's rate and distortion can be estimated accurately without actual encoding in our proposed algorithm. Simulation results show that our proposed algorithm can lower the computational complexity greatly with negligible variation in peak signal-to-noise ratio (PSNR) compared with GIA.
Ming SHAO Zhenyu LIU Satoshi GOTO Takeshi IKENAGA
Fractional Motion Estimation (FME) is an advanced feature adopted in H.264/AVC video compression standard with quarter-pixel accuracy. Although FME could gain considerably higher encoding efficiency, sub-pixel interpolation and sum of absolute transformed difference (SATD) computation, as main parts of FME, increase the computation complexity a lot. To reduce the complexity of FME, this paper proposes a full computation reusable VLSI oriented algorithm. Through exploiting the similarity among motion vectors (MVs) of partitions in the same macroblock (MB), temporary computation results can be fully reused. Furthermore, a simple and effective searching method is adopted to make the proposed method more suitable for VLSI implementation. Experiment results show that up to 80% add operations and 85% internal reference frame memory access operations are saved without any degradation in the coding quality.
In this paper, novel hardware sharing architectures are proposed for realizations of fast 44 and 88 forward/inverse integer transforms in H.264/AVC applications. Based on matrix factorizations, the cost-effective architectures for fast one-dimensional (1-D) 44 and 88 forward/inverse integer transforms can be derived through the Kronecker and direct sum operations. By applying the concept of hardware sharing, the proposed hardware schemes for fast integer transforms need a smaller number of shifters and adders than the direct realization architecture, where the direct architecture just implements the individual 44 and individual 88 integer transforms independently. With low hardware cost and regular modularity, the proposed hardware sharing architectures can process up to 125 MHz with the cost-effective area and are suitable for VLSI implementations to accomplish the H.264/AVC signal processing.
Shen LI Lingfeng LI Takeshi IKENAGA Shunichi ISHIWATA Masataka MATSUI Satoshi GOTO
The coexistence of MPEG-2 and its powerful successor H.264/AVC has created a huge need for MPEG-2/H.264 video transcoding. However, a traditional transcoder where an MPEG-2 decoder is simply cascaded to an H.264 encoder requires huge computational power due to the adoption of a complicated rate-distortion based mode decision process in H.264. This paper proposes a 2-D Sobel filter based motion vector domain method and a DCT domain method to measure macroblock complexity and realize content-based H.264 candidate mode decision. A new local edge based fast INTRA prediction mode decision method is also adopted to boost the encoding efficiency. Simulation results confirm that with the proposed methods the computational burden of a traditional transcoder can be reduced by 20%30% with only a negligible bit-rate increase for a wide range of video sequences.
In this paper, we study and analyze the computational complexity of deblocking filter in H.264/AVC baseline decoder based on SimpleScalar/ARM simulator. The simulation result shows that the memory reference, content activity check operations, and filter operations are known to be very time consuming in the decoder of this new video coding standard. In order to improve overall system performance, we propose a novel processing order with efficient VLSI architecture which simultaneously processes the horizontal filtering of vertical edge and vertical filtering of horizontal edge. As a result, the memory performance of the proposed architecture is improved by four times when compared to the software implementation. Moreover, the system performance of our design significantly outperforms the previous proposals.
Zhenyu LIU Yang SONG Takeshi IKENAGA Satoshi GOTO
One full search variable block size motion estimation (VBSME) architecture with integer pixel accuracy is proposed in this paper. This proposed architecture has following features: (1) Through widening data path from the search area memories, m processing element groups (PEG) could be scheduled to work in parallel and fully utilized, where m is a factor of sixteen. Each PEG has sixteen processing elements (PE) and just costs 8.5K gates. This feature provides users more flexibility to make tradeoff between the hardware cost and the performance. (2) Based on pipelining and multi-cycle data path techniques, this architecture can work at high clock frequency. (3) The memory partition number is greatly reduced. When sixteen PEGs are adopted, only two memory partitions are required for the search area data storage. Therefore, both the system hardware cost and power consumption can be saved. A 16-PEG design with 4832 search range has been implemented with TSMC 0.18 µm CMOS technology. In typical work conditions, its maximum clock frequency is 261 MHz. Compared with the previous 2-D architecture [9], about 13.4% hardware cost and 5.7% power consumption can be saved.
In this letter, efficient two-dimensional (2-D) fast algorithms for realizations of 88 forward and inverse integer transforms in H.264/AVC fidelity range extensions (FRExt) are proposed. Based on matrix factorizations with Kronecker product and direct sum operations, efficient fast 2-D 88 forward and inverse integer transforms can be derived from the one-dimensional (1-D) fast 88 forward and inverse integer transforms through matrix operations. The proposed fast 2-D 88 forward and inverse integer transform designs don't require transpose memory in hardware realizations. The fast 2-D 88 integer transforms require fewer latency delays and provide a larger throughput rate than the row-column based method. With regular modularity, the proposed fast algorithms are suitable for VLSI implementations to achieve H.264/AVC FRExt high-profile signal processing.
Kentaro KAWAKAMI Jun TAKEMURA Mitsuhiko KURODA Hiroshi KAWAGUCHI Masahiko YOSHIMOTO
We propose an elastic pipeline that can apply dynamic voltage scaling (DVS) to hardwired logic circuits. In order to demonstrate its feasibility, a hardwired H.264/AVC HDTV decoder is designed as a real-time application. An entropy decoding process is divided into context-based adaptive binary arithmetic coding (CABAC) and syntax element decoding (SED), which has advantages of smoothing workload for CABAC and keeping efficiency of the elastic pipeline. An operating frequency and supply voltage are dynamically modulated every slot depending on workload of H.264 decoding to minimize power. We optimize the number of slots per frame to enhance power reduction. The proposed decoder achieves a power reduction of 50% in a 90-nm process technology, compared to the conventional clock-gating scheme.
In H.264, the context-based adaptive variable length coding (CAVLC) is used for lossless compression. Direct table-lookup implementation requires higher cost because it employs a larger memory to produce the encoded results. In this letter, we present a more efficient technique for CAVLC implementation. Compared with those previous CAVLC chips, our design requires the lowest hardware cost.
Yang SONG Zhenyu LIU Takeshi IKENAGA Satoshi GOTO
A one-dimensional (1-D) full search variable block size motion estimation (VBSME) architecture is presented in this paper. By properly choosing the partial sum of absolute differences (SAD) registers and scheduling the addition operations, the architecture can be implemented with simple control logic and regular workflow. Moreover, only one single-port SRAM is used to store the search area data. The design is realized in TSMC 0.18 µm 1P6M technology with a hardware cost of 67.6K gates. In typical working conditions (1.8 V, 25), a clock frequency of 266 MHz can be achieved.
Junichi MIYAKOSHI Yuichiro MURACHI Tetsuro MATSUNO Masaki HAMAMOTO Takahiro IINUMA Tomokazu ISHIHARA Hiroshi KAWAGUCHI Masayuki MIYAMA Masahiko YOSHIMOTO
We propose a sub-mW H.264 baseline-profile motion estimation processor for portable video applications. It features a VLSI-oriented block partitioning strategy and low-power SIMD/systolic-array datapath architecture, where the datapath can be switched between an SIMD and systolic array depending on processing flow. The processor supports all the seven kinds of block modes, and can handle three reference frames for a CIF (352288) 30-fps to QCIF (176144) 15-fps sequences with a quarter-pixel accuracy. It integrates 3.3 million transistors, and occupies 2.83.1 mm2 in a 130-nm CMOS technology. The proposed processor achieves a power of 800 µW in a QCIF 15-fps sequence with one reference picture.
Junichi MIYAKOSHI Yuichiro MURACHI Tomokazu ISHIHARA Hiroshi KAWAGUCHI Masahiko YOSHIMOTO
For super-parallel video processing, we proposed a power- and area-efficient SRAM core architecture with a segmentation-free access, which means accessibility to arbitrary consecutive pixels, and horizontal/vertical access. To achieve these flexible accesses, a spirally-connected local-wordline select signal and multi-selection scheme in wordlines are proposed, so that extra X-decoders in the conventional multi-division SRAM can be eliminated. Consequently, the proposed SRAM reduces a power and area by 57-60% and 60%, respectively, when it is applied to a 128 parallel architecture. The proposed 160-kbit SRAM with 16-read ports (2-read port SRAM with eight-parallel architecture) is implemented to a search window buffer for an H.264 motion estimation processor core which dissipates 800 µW for QCIF 15-fps in a 130-nm technology.
This paper proposes a fast motion estimation algorithm for variable block-sizes by utilizing motion vector bottom-up procedure for H.264. The refined motion vectors of adjacent small blocks are merged to predict the motion vectors of larger blocks for reducing the computation. Experimental results show that our proposed method has lower computational complexity than full search, fast full search and fast motion estimation of the H.264 reference software JM93 with slight quality decrease and little bit-rate increase.
Donghyung KIM Jongho KIM Jechang JEONG
The H.264 standard allows each macroblock to have up to sixteen motion vectors, four reference frames, and a macroblock mode. Exploiting this feature, we present an efficient temporal error concealment algorithm for H.264-coded video. The proposed method turns out to show good performance compared with conventional approaches.
We devised an efficient architecture of deblocking filter and implemented the circuit with 15,400 logic gates and a 16032 dual-port SRAM using 0.25 µm standard cell technology. This circuit can process 88 image frames with 1,280720 pixels per second at 166 MHz. Our circuit requires smaller number of accesses to the external memory than other approaches and hence causes less bus traffic in the SoC design platform.
This paper presents a video coding method that improves error resilient functionality of H.264 with good coding efficiency. The method is based on PD (polyphase downsampling) multiple description coding. The only changes to H.264 are inserting PD before the DCT process and having new data partitioning NAL units. A coded slice is sent on 3 data partitioning NAL units. A header NAL unit contains motion vectors and block modes. Each of the other two NAL units contains a description generated by PD multiple description coding. The experimental results on all 9 of the test sequences of JVT SVC show that the proposed method gives 0.5 to 5 dB enhancement over the existing H.264 FMO checker board mode with motion vector based error-concealment.
Myung-Suk BYEON Yil-Mi SHIN Yong-Beom CHO
This paper describes the efficiency of VLSI architecture for UMHexagonS (hybrid Unsymmetrical cross Multi Hexagon grid Search) matching algorithm. This algorithm is used for ME (Motion Estimation) of H.264/AVC video compression standard. The UMHexagonS is called a hybrid algorithm since it uses different kinds of searching patterns. VLSI architecture based on UMHexagonS is designed to provide a good tradeoff between gate sizes and high throughput. We implemented this architecture with about 309 K gates and 1/1792 throughput [block/cycle] for a search range of 16 and 44 macro blocks using synthesizable Verilog HDL.