The search functionality is under construction.
The search functionality is under construction.

Author Search Result

[Author] Lei HUANG(4hit)

1-4hit
  • A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM

    Yibo FAN  Leilei HUANG  Kewei CHEN  Xiaoyang ZENG  

     
    PAPER-Integrated Electronics

      Pubricized:
    2019/11/27
      Vol:
    E103-C No:5
      Page(s):
    263-273

    The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.

  • Computationally Efficient Method of Signal Subspace Fitting for Direction-of-Arrival Estimation

    Lei HUANG  Dazheng FENG  Linrang ZHANG  Shunjun WU  

     
    PAPER-Antennas and Propagation

      Vol:
    E88-B No:8
      Page(s):
    3408-3415

    It is interesting to resolve coherent signals impinging upon a linear sensor array with low computational complexity in array signal processing. In this paper, a computationally efficient method of signal subspace fitting (SSF) for direction-of-arrival (DOA) estimation is developed, based on the multi-stage wiener filter (MSWF). To find the new signal subspace, the proposed method only needs to compute the matched filters in the forward recursion of the MSWF, does not involve the estimate of an array covariance matrix or any eigendecomposition, thus implying that the proposed method is computationally efficient. Numerical results show that the proposed method provides the comparable estimation accuracy with the classical weighted subspace fitting (WSF) method for uncorrelated signals at reasonably high SNR and reasonably large samples, and surpasses the latter for coherent signals in the case of low SNR and small samples. When SNR is low and the samples are small, the proposed method is less accurate than the classical WSF method for uncorrelated signals. This drawback is balanced by the computational advantage of the proposed method.

  • A Micro-Code-Based IME Engine for HEVC and Its Hardware Implementation

    Leilei HUANG  Yibo FAN  Chenhao GU  Xiaoyang ZENG  

     
    PAPER-Integrated Electronics

      Vol:
    E102-C No:10
      Page(s):
    756-765

    High Efficiency Video Coding (HEVC) standard is now becoming one of the most widespread video coding standards in the world. As a successor of H.264 standard, it aims to provide a much superior encoding performance. To fulfill this goal, several new notations along with the corresponding computation processes are introduced by this standard. Among those computation processes, the integer motion estimation (IME) is one of bottlenecks due to the complex partitions of the inter prediction units (PU) and the large search window commonly adopted. Many algorithms have been proposed to address this issue and usually put emphasis on a large search window and great computation amount. However, the coding efforts should be related to the scenes. To be more specific, for relatively static videos, a small search window along with a simple search scheme should be adopted to reduce the time cost and power consumption. In view of this, a micro-code-based IME engine is proposed in this paper, which could be applied with search schemes of different complexity. To test the performance, three different search schemes based on this engine are designed and evaluated under HEVC test model (HM) 16.9, achieving a B-D rate increase of 0.55/-0.07/-0.14%. Compared with our previous work, the hardware implementation is optimized to reduce 64.2% of the SRAMs bits and 32.8% of the logic gate count. The final design could support 4K×2K @139/85/37fps videos @500MHz.

  • A High-Throughput and Compact Hardware Implementation for the Reconstruction Loop in HEVC Intra Encoding

    Yibo FAN  Leilei HUANG  Zheng XIE  Xiaoyang ZENG  

     
    PAPER-Integrated Electronics

      Vol:
    E100-C No:6
      Page(s):
    643-654

    In the newly finalized video coding standard, namely high efficiency video coding (HEVC), new notations like coding unit (CU), prediction unit (PU) and transformation unit (TU) are introduced to improve the coding performance. As a result, the reconstruction loop in intra encoding is heavily burdened to choose the best partitions or modes for them. In order to solve the bottleneck problems in cycle and hardware cost, this paper proposed a high-throughput and compact implementation for such a reconstruction loop. By “high-throughput”, it refers to that it has a fixed throughput of 32 pixel/cycle independent of the TU/PU size (except for 4×4 TUs). By “compact”, it refers to that it fully explores the reusability between discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) as well as that between quantization (Q) and de-quantization (IQ). Besides the contributions made in designing related hardware, this paper also provides a universal formula to analyze the cycle cost of the reconstruction loop and proposed a parallel-process scheme to further reduce the cycle cost. This design is verified on the Stratix IV FPGA. The basic structure achieved a maximum frequency of 150MHz and a hardware cost of 64K ALUTs, which could support the real time TU/PU partition decision for 4K×2K@20fps videos.