The search functionality is under construction.

Author Search Result

[Author] Yibo FAN(12hit)

1-12hit
  • Reconfigurable Variable Block Size Motion Estimation Architecture for Search Range Reduction Algorithm

    Yibo FAN  Takeshi IKENAGA  Satoshi GOTO  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    440-448

    Variable Block Size Motion Estimation (VBSME) costs a lot of computation during video coding. Search range reduction algorithm is widely used to reduce computational cost of motion estimation. Current VBSME designs are not suitable for this algorithm. This paper proposes a reconfigurable design of VBSME which can be efficiently used with search range reduction algorithm. While using proposed design, nm reference MBs form an MB array which can be processed in parallel. n and m can be configured according to the new search range shape calculated by algorithm. In this way, the parallelism of proposed design is very flexible and can be adapted to any search range shape. The hardware resource is also fully used while performing VBSME. There are two primary reconfigurable modules in this design: PEGA (PE Group Array) and SAD comparator. By using TSMC 0.18 µm standard cell library, the implementation results show that the hardware cost of design which uses 16 PEGs (PE Groups) is about 179 K Gates, the clock frequency is 167 MHz.

  • An 88/44 Adaptive Hadamard Transform Based FME VLSI Architecture for 4 K2 K H.264/AVC Encoder

    Yibo FAN  Jialiang LIU  Dexue ZHANG  Xiaoyang ZENG  Xinhua CHEN  

     
    PAPER

      Vol:
    E95-C No:4
      Page(s):
    447-455

    Fidelity Range Extension (FRExt) (i.e. High Profile) was added to the H.264/AVC recommendation in the second version. One of the features included in FRExt is the Adaptive Block-size Transform (ABT). In order to conform to the FRExt, a Fractional Motion Estimation (FME) architecture is proposed to support the 88/44 adaptive Hadamard Transform (88/44 AHT). The 88/44 AHT circuit contributes to higher throughput and encoding performance. In order to increase the utilization of SATD (Sum of Absolute Transformed Difference) Generator (SG) in unit time, the proposed architecture employs two 8-pel interpolators (IP) to time-share one SG. These two IPs can work in turn to provide the available data continuously to the SG, which increases the data throughput and significantly reduces the cycles that are needed to process one Macroblock. Furthermore, this architecture also exploits the linear feature of Hadamard Transform to generate the quarter-pel SATD. This method could help to shorten the long datapath in the second-step of two-iteration FME algorithm. Finally, experimental results show that this architecture could be used in the applications requiring different performances by adjusting the supported modes and operation frequency. It can support the real-time encoding of the seven-mode 4 K2 K@24 fps or six-mode 4 K2 K@30 fps video sequences.

  • Architecture and Physical Implementation of Reconfigurable Multi-Port Physical Unclonable Functions in 65 nm CMOS

    Pengjun WANG  Yuejun ZHANG  Jun HAN  Zhiyi YU  Yibo FAN  Zhang ZHANG  

     
    PAPER-Cryptography and Information Security

      Vol:
    E96-A No:5
      Page(s):
    963-970

    In modern cryptographic systems, physical unclonable functions (PUFs) are efficient mechanisms for many security applications, which extract intrinsic random physical variations to generate secret keys. The classical PUFs mainly exhibit static challenge-response behaviors and generate static keys, while many practical cryptographic systems need reconfigurable PUFs which allow dynamic keys derived from the same circuit. In this paper, the concept of reconfigurable multi-port PUFs (RM-PUFs) is proposed. RM-PUFs not only allow updating the keys without physically replacement, but also generate multiple keys from different ports in one clock cycle. A practical RM-PUFs construction is designed based on asynchronous clock and fabricated in TSMC low-power 65 nm CMOS process. The area of test chip is 1.1 mm2, and the maximum clock frequency is 0.8 GHz at 1.2 V. The average power consumption is 27.6 mW at 27. Finally, test results show that the RM-PUFs generate four reconfigurable 128-bit secret keys, and the keys are secure and reliable over a range of environmental variations such as supply voltage and temperature.

  • A High-Speed Design of Montgomery Multiplier

    Yibo FAN  Takeshi IKENAGA  Satoshi GOTO  

     
    PAPER

      Vol:
    E91-A No:4
      Page(s):
    971-977

    With the increase of key length used in public cryptographic algorithms such as RSA and ECC, the speed of Montgomery multiplication becomes a bottleneck. This paper proposes a high speed design of Montgomery multiplier. Firstly, a modified scalable high-radix Montgomery algorithm is proposed to reduce critical path. Secondly, a high-radix clock-saving dataflow is proposed to support high-radix operation and one clock cycle delay in dataflow. Finally, a hardware-reused architecture is proposed to reduce the hardware cost and a parallel radix-16 design of data path is proposed to accelerate the speed. By using HHNEC 0.25 µm standard cell library, the implementation results show that the total cost of Montgomery multiplier is 130 KGates, the clock frequency is 180 MHz and the throughput of 1024-bit RSA encryption is 352 kbps. This design is suitable to be used in high speed RSA or ECC encryption/decryption. As a scalable design, it supports any key-length encryption/decryption up to the size of on-chip memory.

  • Optimized 2-D SAD Tree Architecture of Integer Motion Estimation for H.264/AVC

    Yibo FAN  Xiaoyang ZENG  Satoshi GOTO  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    411-418

    Integer Motion Estimation (IME) costs much computation in H.264/AVC video encoder. 2-D SAD tree IME architecture provides very high performance for encoder, and it has been used by many video codec designs. This paper proposes an optimized hardware design of 2-D SAD tree IME. Firstly, a new hardware architecture is proposed to reduce on-chip memory size. Secondly, a new search pattern is proposed to fully use memory bandwidth and reduce external memory access. Thirdly, the data-path is redesigned, and the performance is greatly improved. In order to compare with other IME designs, an IME design support D1 size, 30 fps with search range [32, 32] is implemented. The hardware cost of this design includes 118 KGates and 8 Kb SRAM, the maximum clock frequency is 200 MHz. Compared to the original 2-D SAD tree IME, our design saves 87.5% on-chip memory, and achieves 3 times performance than original one. Our design provides a new way to design a low cost and high performance IME for H.264/AVC encoder.

  • A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM

    Yibo FAN  Leilei HUANG  Kewei CHEN  Xiaoyang ZENG  

     
    PAPER-Integrated Electronics

      Pubricized:
    2019/11/27
      Vol:
    E103-C No:5
      Page(s):
    263-273

    The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.

  • A 64 Cycles/MB, Luma-Chroma Parallelized H.264/AVC Deblocking Filter for 4 K2 K Applications

    Weiwei SHEN  Yibo FAN  Xiaoyang ZENG  

     
    PAPER

      Vol:
    E95-C No:4
      Page(s):
    441-446

    In this paper, a high-throughput debloking filter is presented for H.264/AVC standard, catering video applications with 4 K2 K (40962304) ultra-definition resolution. In order to strengthen the parallelism without simply increasing the area, we propose a luma-chroma parallel method. Meanwhile, this work reduces the number of processing cycles, the amount of external memory traffic and the working frequency, by using triple four-stage pipeline filters and a luma-chroma interlaced sequence. Furthermore, it eliminates most unnecessary off-chip memory bandwidth with a highly reusable memory scheme, and adopts a “slide window” buffer scheme. As a result, our design can support 4 K2 K at 30 fps applications at the working frequency of only 70.8 MHz.

  • CCTSS: The Combination of CNN and Transformer with Shared Sublayer for Detection and Classification

    Aorui GOU  Jingjing LIU  Xiaoxiang CHEN  Xiaoyang ZENG  Yibo FAN  

     
    PAPER-Image

      Pubricized:
    2023/07/06
      Vol:
    E107-A No:1
      Page(s):
    141-156

    Convolutional Neural Networks (CNNs) and Transformers have achieved remarkable performance in detection and classification tasks. Nevertheless, their feature extraction cannot consider both local and global information, so the detection and classification performance can be further improved. In addition, more and more deep learning networks are designed as more and more complex, and the amount of computation and storage space required is also significantly increased. This paper proposes a combination of CNN and transformer, and designs a local feature enhancement module and global context modeling module to enhance the cascade network. While the local feature enhancement module increases the range of feature extraction, the global context modeling is used to capture the feature maps' global information. To decrease the model complexity, a shared sublayer is designed to realize the sharing of weight parameters between the adjacent convolutional layers or cross convolutional layers, thereby reducing the number of convolutional weight parameters. Moreover, to effectively improve the detection performance of neural networks without increasing network parameters, the optimal transport assignment approach is proposed to resolve the problem of label assignment. The classification loss and regression loss are the summations of the cost between the demander and supplier. The experiment results demonstrate that the proposed Combination of CNN and Transformer with Shared Sublayer (CCTSS) performs better than the state-of-the-art methods in various datasets and applications.

  • An Unequal Secure Encryption Scheme for H.264/AVC Video Compression Standard

    Yibo FAN  Jidong WANG  Takeshi IKENAGA  Yukiyasu TSUNOO  Satoshi GOTO  

     
    PAPER-Symmetric Cryptography

      Vol:
    E91-A No:1
      Page(s):
    12-21

    H.264/AVC is the newest video coding standard. There are many new features in it which can be easily used for video encryption. In this paper, we propose a new scheme to do video encryption for H.264/AVC video compression standard. We define Unequal Secure Encryption (USE) as an approach that applies different encryption schemes (with different security strength) to different parts of compressed video data. This USE scheme includes two parts: video data classification and unequal secure video data encryption. Firstly, we classify the video data into two partitions: Important data partition and unimportant data partition. Important data partition has small size with high secure protection, while unimportant data partition has large size with low secure protection. Secondly, we use AES as a block cipher to encrypt the important data partition and use LEX as a stream cipher to encrypt the unimportant data partition. AES is the most widely used symmetric cryptography which can ensure high security. LEX is a new stream cipher which is based on AES and its computational cost is much lower than AES. In this way, our scheme can achieve both high security and low computational cost. Besides the USE scheme, we propose a low cost design of hybrid AES/LEX encryption module. Our experimental results show that the computational cost of the USE scheme is low (about 25% of naive encryption at Level 0 with VEA used). The hardware cost for hybrid AES/LEX module is 4678 Gates and the AES encryption throughput is about 50 Mbps.

  • A Micro-Code-Based IME Engine for HEVC and Its Hardware Implementation

    Leilei HUANG  Yibo FAN  Chenhao GU  Xiaoyang ZENG  

     
    PAPER-Integrated Electronics

      Vol:
    E102-C No:10
      Page(s):
    756-765

    High Efficiency Video Coding (HEVC) standard is now becoming one of the most widespread video coding standards in the world. As a successor of H.264 standard, it aims to provide a much superior encoding performance. To fulfill this goal, several new notations along with the corresponding computation processes are introduced by this standard. Among those computation processes, the integer motion estimation (IME) is one of bottlenecks due to the complex partitions of the inter prediction units (PU) and the large search window commonly adopted. Many algorithms have been proposed to address this issue and usually put emphasis on a large search window and great computation amount. However, the coding efforts should be related to the scenes. To be more specific, for relatively static videos, a small search window along with a simple search scheme should be adopted to reduce the time cost and power consumption. In view of this, a micro-code-based IME engine is proposed in this paper, which could be applied with search schemes of different complexity. To test the performance, three different search schemes based on this engine are designed and evaluated under HEVC test model (HM) 16.9, achieving a B-D rate increase of 0.55/-0.07/-0.14%. Compared with our previous work, the hardware implementation is optimized to reduce 64.2% of the SRAMs bits and 32.8% of the logic gate count. The final design could support 4K×2K @139/85/37fps videos @500MHz.

  • A Unified Forward/Inverse Transform Architecture for Multi-Standard Video Codec Design

    Sha SHEN  Weiwei SHEN  Yibo FAN  Xiaoyang ZENG  

     
    PAPER-Digital Signal Processing

      Vol:
    E96-A No:7
      Page(s):
    1534-1542

    This paper describes a unified VLSI architecture which can be applied to various types of transforms used in MPEG-2/4, H.264, VC-1, AVS and the emerging new video coding standard named HEVC (High Efficiency Video Coding). A novel design named configurable butterfly array (CBA) is also proposed to support both the forward transform and the inverse transform in this unified architecture. Hadamard transform or 4/8-point DCT/IDCT are used in traditional video coding standards while 16/32-point DCT/IDCT are newly introduced in HEVC. The proposed architecture can support all these transform types in a unified architecture. Two levels (architecture level and block level) of hardware sharing are adopted in this design. In the architecture level, the forward transform can share the hardware resource with the inverse transform. In the block level, the hardware for smaller size transform can be recursively reused by larger size transform. The multiplications of 4 or 8-point transform are implemented with Multiplierless MCM (Multiple Constant Multiplication). In order to reduce the hardware overhead, the multiplications of 16/32 point DCT are implemented with ICM (input-muxed constant multipliers) instead of MCM or regular multipliers. The proposed design is 51% more area efficient than previous work. To the author's knowledge, this is the first published work to support both forward and inverse 4/8/16/32-point integer transform for HEVC standard in a unified architecture.

  • A High-Throughput and Compact Hardware Implementation for the Reconstruction Loop in HEVC Intra Encoding

    Yibo FAN  Leilei HUANG  Zheng XIE  Xiaoyang ZENG  

     
    PAPER-Integrated Electronics

      Vol:
    E100-C No:6
      Page(s):
    643-654

    In the newly finalized video coding standard, namely high efficiency video coding (HEVC), new notations like coding unit (CU), prediction unit (PU) and transformation unit (TU) are introduced to improve the coding performance. As a result, the reconstruction loop in intra encoding is heavily burdened to choose the best partitions or modes for them. In order to solve the bottleneck problems in cycle and hardware cost, this paper proposed a high-throughput and compact implementation for such a reconstruction loop. By “high-throughput”, it refers to that it has a fixed throughput of 32 pixel/cycle independent of the TU/PU size (except for 4×4 TUs). By “compact”, it refers to that it fully explores the reusability between discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) as well as that between quantization (Q) and de-quantization (IQ). Besides the contributions made in designing related hardware, this paper also provides a universal formula to analyze the cycle cost of the reconstruction loop and proposed a parallel-process scheme to further reduce the cycle cost. This design is verified on the Stratix IV FPGA. The basic structure achieved a maximum frequency of 150MHz and a hardware cost of 64K ALUTs, which could support the real time TU/PU partition decision for 4K×2K@20fps videos.