The search functionality is under construction.

Author Search Result

[Author] Zhenyu LIU(21hit)

1-20hit(21hit)

  • A Mode Mapping and Optimized MV Conjunction Based H.264/SVC to H.264/AVC Transcoder with Medium-Grain Quality Scalability for Videoconferencing

    Lei SUN  Zhenyu LIU  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E97-A No:2
      Page(s):
    501-509

    Scalable Video Coding (SVC) is an extension of H.264/AVC, aiming to provide the ability to adapt to heterogeneous networks or requirements. It offers great flexibility for bitstream adaptation in multi-point applications such as videoconferencing. However, transcoding between SVC and AVC is necessary due to the existence of legacy AVC-based systems. The straightforward re-encoding method requires great computational cost, and delay-sensitive applications like videoconferencing require much faster transcoding scheme. This paper proposes a 3-stage fast SVC-to-AVC transcoder with medium-grain quality scalability (MGS) for videoconferencing applications. Hierarchical-P structured SVC bitstream is transcoded into IPPP structured AVC bitstream with multiple reference frames. In the first stage, mode decision is accelerated by proposed SVC-to-AVC mode mapping scheme. In the second stage, INTER motion estimation is accelerated by an optimized motion vector (MV) conjunction method to predict the MV with a reduced search range. In the last stage, hadamard-based all zero block (AZB) detection is utilized for early termination. Simulation results show that proposed transcoder achieves very similar coding efficiency to the optimal result, but with averagely 89.6% computational time saving.

  • A VLSI Architecture for Variable Block Size Motion Estimation in H.264/AVC with Low Cost Memory Organization

    Yang SONG  Zhenyu LIU  Takeshi IKENAGA  Satoshi GOTO  

     
    PAPER-VLSI Architecture

      Vol:
    E89-A No:12
      Page(s):
    3594-3601

    A one-dimensional (1-D) full search variable block size motion estimation (VBSME) architecture is presented in this paper. By properly choosing the partial sum of absolute differences (SAD) registers and scheduling the addition operations, the architecture can be implemented with simple control logic and regular workflow. Moreover, only one single-port SRAM is used to store the search area data. The design is realized in TSMC 0.18 µm 1P6M technology with a hardware cost of 67.6K gates. In typical working conditions (1.8 V, 25), a clock frequency of 266 MHz can be achieved.

  • Content-Aware Write Reduction Mechanism of 3D Stacked Phase-Change RAM Based Frame Store in H.264 Video Codec System

    Sanchuan GUO  Zhenyu LIU  Guohong LI  Takeshi IKENAGA  Dongsheng WANG  

     
    PAPER

      Vol:
    E96-A No:6
      Page(s):
    1273-1282

    H.264 video codec system requires big capacity and high bandwidth of Frame Store (FS) for buffering reference frames. The up-to-date three dimensional (3D) stacked Phase change Random Access Memory (PRAM) is the promising approach for on-chip caching the reference signals, as 3D stacking offers high memory bandwidth, while PRAM possesses the advantages in terms of high density and low leakage power. However, the write endurance problem, that is a PRAM cell can only tolerant limited number of write operations, becomes the main barrier in practical applications. This paper studies the wear reduction techniques of PRAM based FS in H.264 codec system. On the basis of rate-distortion theory, the content oriented selective writing mechanisms are proposed to reduce bit updates in the reference frame buffers. With the proposed control parameter a, our methods make the quantitative trade off between the quality degradation and the PRAM lifetime prolongation. Specifically, taking a in the range of [0.2,2], experimental results demonstrate that, our methods averagely save 29.9–35.5% bit-wise write operations and reduce 52–57% power, at the cost of 12.95–20.57% BDBR bit-rate increase accordingly.

  • A Drift-Constrained Frequency-Domain Ultra-Low-Delay H.264/SVC to H.264/AVC Transcoder with Medium-Grain Quality Scalability for Videoconferencing

    Lei SUN  Zhenyu LIU  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E96-A No:6
      Page(s):
    1253-1263

    Scalable Video Coding (SVC) is an extension of H.264/AVC, aiming to provide the ability to adapt to heterogeneous networks or requirements. It offers great flexibility for bitstream adaptation in multi-point applications such as videoconferencing. However, transcoding between SVC and AVC is necessary due to the existence of legacy AVC-based systems. The straightforward re-encoding method requires great computational cost, and delay-sensitive applications like videoconferencing require much faster transcoding scheme. This paper proposes an ultra-low-delay SVC-to-AVC MGS (Medium-Grain quality Scalability) transcoder for videoconferencing applications. Transcoding is performed in pure frequency domain with partial decoding/encoding in order to achieve significant speed-up. Three fast transcoding methods in frequency domain are proposed for macroblocks with different coding modes in non-KEY pictures. KEY pictures are transcoded by reusing the base layer motion data, and error propagation is constrained between KEY pictures. Simulation results show that proposed transcoder achieves averagely 38.5 times speed-up compared with the re-encoding method, while introducing merely 0.71 dB BDPSNR coding quality loss for videoconferencing sequences as compared with the re-encoding algorithm.

  • Scalable VLSI Architecture for Variable Block Size Integer Motion Estimation in H.264/AVC

    Yang SONG  Zhenyu LIU  Satoshi GOTO  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E89-A No:4
      Page(s):
    979-988

    Because of the data correlation in the motion estimation (ME) algorithm of H.264/AVC reference software, it is difficult to implement an efficient ME hardware architecture. In order to make parallel processing feasible, four modified hardware friendly ME workflows are proposed in this paper. Based on these workflows, a scalable full search ME architecture is presented, which has following characteristics: (1) The sum of absolute differences (SAD) results of 44 sub-blocks is accumulated and reused to calculate SADs of bigger sub-blocks. (2) The number of PE groups is configurable. For a search range of MN pixels, where M is width and N is height, up to M PE groups can be configured to work in parallel with a peak processing speed of N16 clock cycles to fulfill a full search variable block size ME (VBSME). (3) Only conventional single port SRAM is required, which makes this architecture suitable for standard-cell-based implementation. A design with 8 PE groups has been realized with TSMC 0.18 µm CMOS technology. The core area is 2.13 mm1.60 mm and clock frequency is 228 MHz in typical condition (1.8 V, 25).

  • Content-Aware Fast Motion Estimation for H.264/AVC

    Zhenyu LIU  Satoshi GOTO  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E91-A No:8
      Page(s):
    1944-1952

    The key to high performance in video coding lies on efficiently reducing the temporal redundancies. For this purpose, H.264/AVC coding standard has adopted variable block size motion estimation on multiple reference frames to improve the coding gain. However, the computational complexity of motion estimation is also increased in proportion to the product of the reference frame number and the intermode number. The mathematical analysis in this paper reveals that the prediction errors mainly depend on the image edge gradient amplitude and quantization parameter. Consequently, this paper proposes the image content based early termination algorithm, which outperforms the original method adopted by JVT reference software, especially at high and moderate bit rates. In light of rate-distortion theory, this paper also relates the homogeneity of image to the quantization parameter. For the homogenous block, its search computation for futile reference frames and intermodes can be efficiently discarded. Therefore, the computation saving performance increases with the value of quantization parameter. These content based fast algorithms were integrated with Unsymmetrical-cross Multihexagon-grid Search (UMHexagonS) algorithm to demonstrate their performance. Compared to the original UMHexagonS fast matching algorithm, 26.14-54.97% search time can be saved with an average of 0.0369 dB coding quality degradation.

  • Improving Cache Partitioning Algorithms for Pseudo-LRU Policies

    Xi ZHANG  Chuanyi LIU  Zhenyu LIU  Dongsheng WANG  

     
    PAPER

      Vol:
    E96-D No:12
      Page(s):
    2514-2523

    As the number of concurrently running applications on the chip multiprocessors (CMPs) is increasing, efficient management of the shared last-level cache (LLC) is crucial to guarantee overall performance. Recent studies have shown that cache partitioning can provide benefits in throughput, fairness and quality of service. Most prior arts apply true Least Recently Used (LRU) as the underlying cache replacement policy and rely on its stack property to work properly. However, in commodity processors, pseudo-LRU policies without stack property are commonly used instead of LRU for their simplicity and low storage overhead. Therefore, this study sets out to understand whether LRU-based cache partitioning techniques can be applied to commodity processors. In this work, we instead propose a cache partitioning mechanism for two popular pseudo-LRU policies: Not Recently Used (NRU) and Binary Tree (BT). Without the help of true LRU's stack property, we propose a profiling logic that applies curve approximation methods to derive the hit curve (hit counts under varied way allocations) for an application. We then propose a hybrid partitioning mechanism, which mitigates the gap between the predicted hit curve and the actual statistics. Simulation results demonstrate that our proposal can improve throughput by 15.3% on average and outperforms the stack-estimate proposal by 12.6% on average. Similar results can be achieved in weighted speedup. For the cache configurations under study, it requires less than 0.5% storage overhead compared to the last-level cache. In addition, we also show that profiling mechanism with only one true LRU ATD achieves comparable performance and can further reduce the hardware cost by nearly two thirds compared with the hybrid mechanism.

  • Communication-Efficient Federated Indoor Localization with Layerwise Swapping Training-FedAvg

    Jinjie LIANG  Zhenyu LIU  Zhiheng ZHOU  Yan XU  

     
    PAPER-Mobile Information Network and Personal Communications

      Pubricized:
    2022/05/11
      Vol:
    E105-A No:11
      Page(s):
    1493-1502

    Federated learning is a promising strategy for indoor localization that can reduce the labor cost of constructing a fingerprint dataset in a distributed training manner without privacy disclosure. However, the traffic generated during the whole training process of federated learning is a burden on the up-and-down link, which leads to huge energy consumption for mobile devices. Moreover, the non-independent and identically distributed (Non-IID) problem impairs the global localization performance during the federated learning. This paper proposes a communication-efficient FedAvg method for federated indoor localization which is improved by the layerwise asynchronous aggregation strategy and layerwise swapping training strategy. Energy efficiency can be improved by performing asynchronous aggregation between the model layers to reduce the traffic cost in the training process. Moreover, the impact of the Non-IID problem on the localization performance can be mitigated by performing swapping training on the deep layers. Extensive experimental results show that the proposed methods reduce communication traffic and improve energy efficiency significantly while mitigating the impact of the Non-IID problem on the precision of localization.

  • Parallel Improved HDTV720p Targeted Propagate Partial SAD Architecture for Variable Block Size Motion Estimation in H.264/AVC

    Yiqing HUANG  Zhenyu LIU  Yang SONG  Satoshi GOTO  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E91-A No:4
      Page(s):
    987-997

    One hardware efficient and high speed architecture for variable block size motion estimation (VBSME) in H.264 is presented in this paper. By improving the pipeline structure and processing element (PE) circuits, the system latency and hardware cost is reduced, which makes this structure more hardware efficient than the original Propagate Partial SAD architecture. For small and middle frame size picture's coding, the proposed structure can save 12.1% hardware cost compared with original Propagate Partial SAD structure. In the case of HDTV, since small inter modes trivially contribute to the coding quality, we remove modes below 88 in our design. By adopting mode reduction technique, when the set number of PE array is less than 8, the proposed mode reduction based Propagate Partial SAD structure can work at faster clock speed and consume less hardware cost than widely used SAD Tree architecture. It is more robust to the high speed timing constraint when parallel processing is considered. With TSMC 0.18 µm technology in worst work conditions (1.62 V, 125), its peak throughput of 8-set PE array structure is 720p@30 Hz with 12864 search range and 5 reference frames. 12 k gates hardware cost can be reduced by our design compared with the parallel SAD Tree architecture.

  • Architecture and Circuit Optimization of Hardwired Integer Motion Estimation Engine for H.264/AVC

    Zhenyu LIU  Dongsheng WANG  Takeshi IKENAGA  

     
    PAPER-Image Processing

      Vol:
    E93-A No:11
      Page(s):
    2065-2073

    Variable block size motion estimation developed by the latest video coding standard H.264/AVC is the efficient approach to reduce the temporal redundancies. The intensive computational complexity coming from the variable block size technique makes the hardwired accelerator essential, for real-time applications. Propagate partial sums of absolute differences (Propagate Partial SAD) and SAD Tree hardwired engines outperform other counterparts, especially considering the impact of supporting variable block size technique. In this paper, the authors apply the architecture-level and the circuit-level approaches to improve the maximum operating frequency and reduce the hardware overhead of Propagate Partial SAD and SAD Tree, while other metrics, in terms of latency, memory bandwidth and hardware utilization, of the original architectures are maintained. Experiments demonstrate that by using the proposed approaches, at 110.8 MHz operating frequency, compared with the original architectures, 14.7% and 18.0% gate count can be saved for Propagate Partial SAD and SAD Tree, respectively. With TSMC 0.18 µm 1P6M CMOS technology, the proposed Propagate Partial SAD architecture achieves 231.6 MHz operating frequency at a cost of 84.1 k gates. Correspondingly, the maximum work frequency of the optimized SAD Tree architecture is improved to 204.8 MHz, which is almost two times of the original one, while its hardware overhead is merely 88.5 k-gate.

  • Fast Mode and Depth Decision for HEVC Intra Prediction Based on Edge Detection and Partition Reconfiguration

    Gaoxing CHEN  Lei SUN  Zhenyu LIU  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E97-A No:11
      Page(s):
    2130-2138

    High efficiency video coding (HEVC) is a video compression standard that outperforms the predecessor H.264/AVC by doubling the compression efficiency. To enhance the intra prediction accuracy, 35 intra prediction modes were used in the prediction units (PUs), with partition sizes ranging from 4 × 4 to 64 × 64 in HEVC. However, the manifold prediction modes dramatically increase the encoding complexity. This paper proposes a fast mode- and depth-decision algorithm based on edge detection and reconfiguration to alleviate the large computational complexity in intra prediction with trivial degradation in accuracy. For mode decision, we propose pixel gradient statistics (PGS) and mode refinement (MR). PGS uses pixel gradient information to assist in selecting the prediction mode after rough mode decision (RMD). MR uses the neighboring mode information to select the best PU mode (BPM). For depth decision, we propose a partition reconfiguration algorithm to replace the original partitioning order with a more reasonable structure, by using the smoothness of the coding unit as a criterion in deciding the prediction depth. Smoothness detection is based on the PGS result. Experiment results show that the proposed method saves about 41.50% of the original processing time with little degradation (BD bitrate increased by 0.66% and BDPSNR decreased by 0.060dB) in the coding gain.

  • Novel Channel Estimation Method Based on Training Sequence Cyclic Reconstruction for TDS-OFDM System

    Zhenyu LIU  Fang YANG  Jian SONG  

     
    LETTER-Wireless Communication Technologies

      Vol:
    E94-B No:7
      Page(s):
    2158-2160

    In this paper, a novel channel estimation method for time domain synchrotrons orthogonal frequency domain multiplexing (TDS-OFDM) based on training sequence cyclic reconstruction is proposed to eliminate residual inter-block interference (IBI); it estimates the channel impulse response (CIR) in an iterative manner. A simulation and analysis show that the proposed method can effectively perform the channel estimation over long-delay multipath channels with low complexity.

  • A VLSI Array Processing Oriented Fast Fourier Transform Algorithm and Hardware Implementation

    Zhenyu LIU  Yang SONG  Takeshi IKENAGA  Satoshi GOTO  

     
    PAPER-VLSI Architecture

      Vol:
    E88-A No:12
      Page(s):
    3523-3530

    Many parallel Fast Fourier Transform (FFT) algorithms adopt multiple stages architecture to increase performance. However, data permutation between stages consumes volume memory and processing time. One FFT array processing mapping algorithm is proposed in this paper to overcome this demerit. In this algorithm, arbitrary 2k butterfly units (BUs) could be scheduled to work in parallel on n=2s data (k=0,1,..., s-1). Because no inter stage data transfer is required, memory consumption and system latency are both greatly reduced. Moreover, with the increasing of BUs, not only does throughput increase linearly, system latency also decreases linearly. This array processing orientated architecture provides flexible tradeoff between hardware cost and system performance. In theory, the system latency is (s2s-k)tclk and the throughput is n/(s2s-ktclk), where tclk is the system clock period. Based on this mapping algorithm, several 18-bit word-length 1024-point FFT processors implemented with TSMC0.18 µm CMOS technology are given to demonstrate its scalability and high performance. The core area of 4-BU design is 2.9911.121 mm2 and clock frequency is 326 MHz in typical condition (1.8 V,25). This processor completes 1024 FFT calculation in 7.839 µs.

  • Bayesian Theory Based Adaptive Proximity Data Accessing for CMP Caches

    Guohong LI  Zhenyu LIU  Sanchuan GUO  Dongsheng WANG  

     
    PAPER

      Vol:
    E96-A No:6
      Page(s):
    1293-1305

    As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they induce higher overall L1 miss latencies because of the longer average distance between the requestor and the home node, and the potential congestions at certain nodes. We observed that there is a high probability that the target data of an L1 miss resides in the L1 cache of a neighbor node. In such cases, these long-distance accesses to the home nodes can be potentially avoided. In order to leverage the aforementioned property, we propose Bayesian Theory based Adaptive Proximity Data Accessing (APDA). In our proposal, we organize the multi-core into clusters of 2x2 nodes, and introduce the Proximity Data Prober (PDP) to detect whether an L1 miss can be served by one of the cluster L1 caches. Furthermore, we devise the Bayesian Decision Classifier (BDC) to adaptively select the remote L2 cache or the neighboring L1 node as the server according to the minimum miss cost. We evaluate this approach on a 64-node multi-core using SPLASH-2 and PARSEC benchmarks, and we find that the APDA can reduce the execution time by 20% and reduce the energy by 14% compared to a standard multi-core with a shared L2. The experimental results demonstrate that our proposal outperforms the up-to-date mechanisms, such as ASR, DCC and RNUCA.

  • Low-Complexity Hybrid-Domain H.264/SVC to H.264/AVC Spatial Transcoding with Drift Compensation for Videoconferencing

    Lei SUN  Zhenyu LIU  Takeshi IKENAGA  

     
    PAPER-Image Processing

      Vol:
    E96-A No:11
      Page(s):
    2142-2153

    As an extension of H.264/AVC, Scalable Video Coding (SVC) provides the ability to adapt to heterogeneous networks and user-end requirements, which offers great scalability in multi-point applications such as videoconferencing. However, transcoding between SVC and AVC becomes necessary due to the existence of legacy AVC-based systems. The straightforward full re-encoding method requires great computational cost, and the fast SVC-to-AVC spatial transcoding techniques have not been thoroughly investigated yet. This paper proposes a low-complexity hybrid-domain SVC-to-AVC spatial transcoder with drift compensation, which provides even better coding efficiency than the full re-encoding method. The macroblocks (MBs) of input SVC bitstream are divided into two types, and each type is suitable for pixel- or transform-domain processing respectively. In the pixel-domain transcoding, a fast re-encoding method is proposed based on mode mapping and motion vector (MV) refinement. In the transform-domain transcoding, the quantized transform coefficients together with other motion data are reused directly to avoid re-quantization loss. The drift problem caused by proposed transcoder is solved by compensation techniques for I frame and P frame respectively. Simulation results show that proposed transcoder achieves averagely 96.4% time reduction compared with the full re-encoding method, and outperforms the reference methods in coding efficiency.

  • Lossless VLSI Oriented Full Computation Reusing Algorithm for H.264/AVC Fractional Motion Estimation

    Ming SHAO  Zhenyu LIU  Satoshi GOTO  Takeshi IKENAGA  

     
    PAPER

      Vol:
    E90-A No:4
      Page(s):
    756-763

    Fractional Motion Estimation (FME) is an advanced feature adopted in H.264/AVC video compression standard with quarter-pixel accuracy. Although FME could gain considerably higher encoding efficiency, sub-pixel interpolation and sum of absolute transformed difference (SATD) computation, as main parts of FME, increase the computation complexity a lot. To reduce the complexity of FME, this paper proposes a full computation reusable VLSI oriented algorithm. Through exploiting the similarity among motion vectors (MVs) of partitions in the same macroblock (MB), temporary computation results can be fully reused. Furthermore, a simple and effective searching method is adopted to make the proposed method more suitable for VLSI implementation. Experiment results show that up to 80% add operations and 85% internal reference frame memory access operations are saved without any degradation in the coding quality.

  • A Fine-Grain Scalable and Low Memory Cost Variable Block Size Motion Estimation Architecture for H.264/AVC

    Zhenyu LIU  Yang SONG  Takeshi IKENAGA  Satoshi GOTO  

     
    PAPER-Integrated Electronics

      Vol:
    E89-C No:12
      Page(s):
    1928-1936

    One full search variable block size motion estimation (VBSME) architecture with integer pixel accuracy is proposed in this paper. This proposed architecture has following features: (1) Through widening data path from the search area memories, m processing element groups (PEG) could be scheduled to work in parallel and fully utilized, where m is a factor of sixteen. Each PEG has sixteen processing elements (PE) and just costs 8.5K gates. This feature provides users more flexibility to make tradeoff between the hardware cost and the performance. (2) Based on pipelining and multi-cycle data path techniques, this architecture can work at high clock frequency. (3) The memory partition number is greatly reduced. When sixteen PEGs are adopted, only two memory partitions are required for the search area data storage. Therefore, both the system hardware cost and power consumption can be saved. A 16-PEG design with 4832 search range has been implemented with TSMC 0.18 µm CMOS technology. In typical work conditions, its maximum clock frequency is 261 MHz. Compared with the previous 2-D architecture [9], about 13.4% hardware cost and 5.7% power consumption can be saved.

  • Lossy Strict Multilevel Successive Elimination Algorithm for Fast Motion Estimation

    Yang SONG  Zhenyu LIU  Takeshi IKENAGA  Satoshi GOTO  

     
    PAPER

      Vol:
    E90-A No:4
      Page(s):
    764-770

    This paper presents a simple and effective method to further reduce the search points in multilevel successive elimination algorithm (MSEA). Because the calculated sea values of those best matching search points are much smaller than the current minimum SAD, we can simply increase the calculated sea values to increase the elimination ratio without much affecting the coding quality. Compared with the original MSEA algorithm, the proposed strict MSEA algorithm (SMSEA) can provide average 6.52 times speedup. Compared with other lossy fast ME algorithms such as TSS and DS, the proposed SMSEA can maintain more stable image quality. In practice, the proposed technique can also be used in the fine granularity SEA (FGSEA) algorithm and the calculation process is almost the same.

  • Hardware Oriented Enhanced Category Determination Based on CTU Boundary Deblocking Strength Prediction for SAO in HEVC Encoder

    Gaoxing CHEN  Zhenyu PEI  Zhenyu LIU  Takeshi IKENAGA  

     
    PAPER-Digital Signal Processing

      Vol:
    E99-A No:4
      Page(s):
    788-797

    High efficiency video coding (HEVC) is a video compression standard that outperforms the predecessor H.264/AVC by doubling the compression efficiency. To enhance the coding accuracy, HEVC adopts sample adaptive offset (SAO), which reduces the distortion of reconstructed pixels using classification based non-linear filtering. In the traditional coding tree unit (CTU) grain based VLSI encoder implementation, during the pixel classification stage, SAO cannot use the raw samples in the boundary of the current CTU because these pixels have not been processed by deblocking filter (DF). This paper proposes a hardware-oriented category determination algorithm based on estimating the deblocking strengths on CTU boundaries and selectively adopting the promising samples in these areas during SAO classification. Compared with HEVC test mode (HM11.0), experimental results indicate that the proposed method achieves an average 0.13%, 0.14%, and 0.12% BD-bitrate reduction (equivalent to 0.0055dB, 0.0058dB, and 0.0097dB increases in PSNR) in CTU sizes of 64 × 64, 32 × 32, and 16 × 16, respectively.

  • Low-Power Partial Distortion Sorting Fast Motion Estimation Algorithms and VLSI Implementations

    Yang SONG  Zhenyu LIU  Takeshi IKENAGA  Satoshi GOTO  

     
    PAPER

      Vol:
    E90-D No:1
      Page(s):
    108-117

    This paper presents two hardware-friendly low-power oriented fast motion estimation (ME) algorithms and their VLSI implementations. The basic idea of the proposed partial distortion sorting (PDS) algorithm is to disable the search points which have larger partial distortions during the ME process, and only keep those search points with smaller ones. To further reduce the computation overhead, a simplified local PDS (LPDS) algorithm is also presented. Experiments show that the PDS and LPDS algorithms can provide almost the same image quality as full search only with 36.7% computation complexity. The proposed two algorithms can be integrated into different FSBMA architectures to save power consumption. In this paper, the 1-D inter ME architecture [12] is used as an detailed example. Under the worst working conditions (1.62 V, 125) and 166 MHz clock frequency, the PDS algorithm can reduce 33.3% power consumption with 4.05 K gates extra hardware cost, and the LPDS can reduce 37.8% power consumption with 1.73 K gates overhead.

1-20hit(21hit)