1-8hit |
Dajiang ZHOU Jinjia ZHOU Jiayi ZHU Satoshi GOTO
In this paper, a highly parallel deblocking filter architecture for H.264/AVC is proposed to process one macroblock in 48 clock cycles and give real-time support to QFHD@60 fps sequences at less than 100 MHz. 4 edge filters organized in 2 groups for simultaneously processing vertical and horizontal edges are applied in this architecture to enhance its throughput. While parallelism increases, pipeline hazards arise owing to the latency of edge filters and data dependency of deblocking algorithm. To solve this problem, a zig-zag processing schedule is proposed to eliminate the pipeline bubbles. Data path of the architecture is then derived according to the processing schedule and optimized through data flow merging, so as to minimize the cost of logic and internal buffer. Meanwhile, the architecture's data input rate is designed to be identical to its throughput, while the transmission order of input data can also match the zig-zag processing schedule. Therefore no intercommunication buffer is required between the deblocking filter and its previous component for speed matching or data reordering. As a result, only one 2464 two-port SRAM as internal buffer is required in this design. When synthesized with SMIC 130 nm process, the architecture costs a gate count of 30.2 k, which is competitive considering its high performance.
Shuping ZHANG Jinjia ZHOU Dajiang ZHOU Shinji KIMURA Satoshi GOTO
In this paper, a hamburger architecture with a 3D stacked reconfigurable memory is proposed for a 4K motion estimation (ME) processor. By positioning the memory dies on both the top and bottom sides of the processor die, the proposed hamburger architecture can reduce the usage of the signal through-silicon via (TSV), and balance the power delivery network and the clock tree of the entire system. It results in 1/3 reduction of the usage of signal TSVs. Moreover, a stacked reconfigurable memory architecture is proposed to reduce the fabrication complexity and further reduce the number of signal TSVs by more than 1/2. The reduction of signal TSVs in the entire design is 71.24%. Finally, we address unique issues that occur in electronic design automation (EDA) tools during 3D large-scale integration (LSI) designs. As a result, a 4K ME processor with 7-die stacking 3D system-on-chip design is implemented. The proposed design can support real time 3840 × 2160 @ 120 fps encoding at 130 MHz with less than 540 mW.
Jinjia ZHOU Dajiang ZHOU Gang HE Satoshi GOTO
In this paper, we present a cache based motion compensation (MC) architecture for Quad-HD H.264/AVC video decoder. With the significantly increased throughput requirement, VLSI design for MC is greatly challenged by the huge area cost and power consumption. Moreover, the long memory system latency leads to performance drop of the MC pipeline. To solve these problems, three optimization schemes are proposed in this work. Firstly, a high-performance interpolator based on Horizontal-Vertical Expansion and Luma-Chroma Parallelism (HVE-LCP) is proposed to efficiently increase the processing throughput to at least over 4 times as the previous designs. Secondly, an efficient cache memory organization scheme (4S×4) is adopted to improve the on-chip memory utilization, which contributes to memory area saving of 25% and memory power saving of 3949%. Finally, by employing a Split Task Queue (STQ) architecture, the cache system is capable of tolerating much longer latency of the memory system. Consequently, the cache idle time is saved by 90%, which contributes to reducing the overall processing time by 2440%. When implemented with SMIC 90 nm process, this design costs a logic gate count and on-chip memory of 108.8 k and 3.1 kB respectively. The proposed MC architecture can support real-time processing of 3840×2160@60 fps with less than 166 MHz.
Dajiang ZHOU Jinjia ZHOU Satoshi GOTO
In the latest video coding frameworks, efficiency of motion vector (MV) coding is becoming increasingly important because of the growing bit rate portion of motion information. However, neither the conventional median predictor, nor the newer schemes such as the minimum bit rate prediction scheme and the hybrid scheme, can effectively eliminate the local redundancy of motion vectors. In this paper, we present the prioritized reference decision scheme for efficient motion vector coding, based on the H.264/AVC framework. This scheme makes use of a boolean indicator to specify whether the median predictor is to be used for the current MV or not. If not, the median prediction is considered not suitable for the current MV, and this information is used for refining the possible space of a group of reference MVs including 4 neighboring MVs and the zero MV. This group of MVs is organized to be a prioritized list so that the reference MV with highest priority is to be selected as the prediction value. Furthermore, the boolean indicators are coded into the modified code words of mb_type and sub_mb_type, so as to reduce the overhead. By applying the proposed scheme, the structure and the applicability problems with the state-of-the-art MBP scheme have been overcome. Experimental result shows that the proposed scheme achieves a considerable reduction of bits for MVDs, compared with the conventional median prediction algorithm. It also achieves a better and much stabler performance than MBP-based MV coding.
Muchen LI Jinjia ZHOU Dajiang ZHOU Xiao PENG Satoshi GOTO
As the successive video compression standard of H.264/AVC, High Efficiency Video Codec (HEVC) will play an important role in video coding area. In the deblocking filter part, HEVC inherits the basic property of H.264/AVC and gives some new features. Based on this variation, this paper introduces a novel dual-mode deblocking filter architecture which could support both of the HEVC and H.264/AVC standards. For HEVC standard, the proposed symmetric unified-cross unit (SUCU) based filtering scheme greatly reduces the design complexity. As a result, processing a 1616 block needs 24 clock cycles. For H.264/AVC standard, it takes 48 clock cycles for a 1616 macro-block (MB). In synthesis result, the proposed architecture occupies 41.6k equivalent gate count at frequency of 200 MHz in SMIC 65 nm library, which could satisfy the throughput requirement of super hi-vision (SHV) on 60 fps. With filter reusing scheme, the universal design for the two standards saves 30% gate counts than the dedicated ones in filter part. In addition, the total power consumption could be reduced by 57.2% with skipping mode when the edges need not be filtered.
Jinjia ZHOU Dajiang ZHOU Xun HE Satoshi GOTO
In this paper, VLSI architecture of a joint parameter decoder is proposed to realize the calculation of motion vector (MV), intra prediction mode (IPM) and boundary strength (BS) for ultra high definition H.264/AVC applications. For this architecture, a 64-cycle-per-MB pipeline with simplified control modes is designed to increase system throughput and reduce hardware cost. Moreover, in order to save memory bandwidth, the data which includes the motion information for the co-located picture and the last decoded line, is pre-processed before being stored to DRAM. A partition based storage format is applied to condense the MB level data, while variable length coding based compression method is utilized to reduce the data size in each partition. Experimental results show our design is capable of real-time 38402160@60 fps decoding at less than 133 MHz, with 37.2 k logic gates. Meanwhile, by applying the proposed scheme, 85-98% bandwidth saving is achieved, compared with storing the original information for every 44 block to DRAM.
Gang HE Dajiang ZHOU Jinjia ZHOU Tianruo ZHANG Satoshi GOTO
Intra coding in H.264/AVC significantly enhances video compression efficiency. However, due to the high data dependency of intra prediction in H.264, both pipelining and parallel processing techniques are limited to be applied. Moreover, it is difficult to get high hardware utilization and throughput because of the long block/MB-level reconstruction loops. This paper proposes a high-performance intra prediction architecture that can support H.264/AVC high profile. The proposed MB/block co-reordering can avoid data dependency and improve pipeline utilization. Therefore, the timing constraint of real-time 40962160 encoding can be achieved with negligible quality loss. 1616 prediction engine and 88 prediction engine work parallel for prediction and coefficients generating. A reordering interlaced reconstruction is also designed for fully pipelined architecture. It takes only 160 cycles to process one macroblock (MB). Hardware utilization of prediction and reconstruction modules is almost 100%. Furthermore, PE-reusable 88 intra predictor and hybrid SAD & SATD mode decision are proposed to save hardware cost. The design is implemented by 90 nm CMOS technology with 113.2 k gates and can encode 40962160 video sequences at 60 fps with operation frequency of 332 MHz.
Shuping ZHANG Jinjia ZHOU Dajiang ZHOU Shinji KIMURA Satoshi GOTO
Motion estimation (ME) is a key encoding component of almost all modern video coding standards. ME contributes significantly to video coding efficiency, but, it also consumes the most power of any component in a video encoder. In this paper, an ME processor with 3D stacked memory architecture is proposed to reduce memory and core power consumption. First, a memory die is designed and stacked with ME die. By adding face-to-face (F2F) pads and through-silicon-via (TSV) definitions, 2D electronic design automation (EDA) tools can be extended to support the proposed 3D stacking architecture. Moreover, a special memory controller is applied to control data transmission and timing between the memory die and the ME processor die. Finally, a 3D physical design is completed for the entire system. This design includes TSV/F2F placement, floor plan optimization, and power network generation. Compared to 2D technology, the number of input/output (IO) pins is reduced by 77%. After optimizing the floor plan of the processor die and memory die, the routing wire lengths are reduced by 13.4% and 50%, respectively. The stacking static random access memory contributes the most power reduction in this work. The simulation results show that the design can support real-time 720p @ 60fps encoding at 8MHz using less than 65mW in power, which is much better compared to the state-of-the-art ME processor.