The search functionality is under construction.

Author Search Result

[Author] Xiaoyang ZENG(23hit)

1-20hit(23hit)

  • Efficient Iterative Frequency Domain Equalization for Single Carrier System with Insufficient Cyclic Prefix

    Chuan WU  Dan BAO  Xiaoyang ZENG  Yun CHEN  

     
    LETTER-Wireless Communication Technologies

      Vol:
    E94-B No:7
      Page(s):
    2174-2177

    In this letter we present efficient iterative frequency domain equalization for single-carrier (SC) transmission systems with insufficient cyclic prefix (CP). Based on minimum mean square error (MMSE) criteria, iterative decision feedback frequency domain equalization (IDF-FDE) combined with cyclic prefix reconstruction (CPR) is derived to mitigate inter-symbol interference (ISI) and inter-carrier interference (ICI). Computer simulation results reveal that the proposed scheme significantly improves the performance of SC systems with insufficient CP compared with previous schemes.

  • A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM

    Yibo FAN  Leilei HUANG  Kewei CHEN  Xiaoyang ZENG  

     
    PAPER-Integrated Electronics

      Pubricized:
    2019/11/27
      Vol:
    E103-C No:5
      Page(s):
    263-273

    The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.

  • A Reconfigurable 74-140Mbps LDPC Decoding System for CCSDS Standard

    Yun CHEN  Jimin WANG  Shixian LI  Jinfou XIE  Qichen ZHANG  Keshab K. PARHI  Xiaoyang ZENG  

     
    PAPER

      Pubricized:
    2021/05/25
      Vol:
    E104-A No:11
      Page(s):
    1509-1515

    Accumulate Repeat-4 Jagged-Accumulate (AR4JA) codes, which are channel codes designed for deep-space communications, are a series of QC-LDPC codes. Structures of these codes' generator matrix can be exploited to design reconfigurable encoders. To make the decoder reconfigurable and achieve shorter convergence time, turbo-like decoding message passing (TDMP) is chosen as the hardware decoder's decoding schedule and normalized min-sum algorithm (NMSA) is used as decoding algorithm to reduce hardware complexity. In this paper, we propose a reconfigurable decoder and present its FPGA implementation results. The decoder can achieve throughput greater than 74 Mbps.

  • A 64 Cycles/MB, Luma-Chroma Parallelized H.264/AVC Deblocking Filter for 4 K2 K Applications

    Weiwei SHEN  Yibo FAN  Xiaoyang ZENG  

     
    PAPER

      Vol:
    E95-C No:4
      Page(s):
    441-446

    In this paper, a high-throughput debloking filter is presented for H.264/AVC standard, catering video applications with 4 K2 K (40962304) ultra-definition resolution. In order to strengthen the parallelism without simply increasing the area, we propose a luma-chroma parallel method. Meanwhile, this work reduces the number of processing cycles, the amount of external memory traffic and the working frequency, by using triple four-stage pipeline filters and a luma-chroma interlaced sequence. Furthermore, it eliminates most unnecessary off-chip memory bandwidth with a highly reusable memory scheme, and adopts a “slide window” buffer scheme. As a result, our design can support 4 K2 K at 30 fps applications at the working frequency of only 70.8 MHz.

  • A Fully Programmable Reed-Solomon Decoder on a Multi-Core Processor Platform

    Bei HUANG  Kaidi YOU  Yun CHEN  Zhiyi YU  Xiaoyang ZENG  

     
    PAPER-Computer Architecture

      Vol:
    E95-D No:12
      Page(s):
    2939-2947

    Reed-Solomon (RS) codes are widely used in digital communication and storage systems. Unlike usual VLSI approaches, this paper presents a high throughput fully programmable Reed-Solomon decoder on a multi-core processor. The multi-core processor platform is a 2-Dimension mesh array of Single Instruction Multiple Data (SIMD) cores, and it is well suited for digital communication applications. By fully extracting the parallelizable operations of the RS decoding process, we propose multiple optimization techniques to improve system throughput, including: task level parallelism on different cores, data level parallelism on each SIMD core, minimizing memory access, and route length minimized task mapping techniques. For RS(255, 239, 8), experimental results show that our 12-core implementation achieve a throughput of 4.35 Gbps, which is much better than several other published implementations. From the results, it is predictable that the throughput is linear with the number of cores by our approach.

  • A Cost-Efficient LDPC Decoder for DVB-S2 with the Solution to Address Conflict Issue

    Yan YING  Dan BAO  Zhiyi YU  Xiaoyang ZENG  Yun CHEN  

     
    PAPER-Digital Signal Processing

      Vol:
    E93-A No:8
      Page(s):
    1415-1424

    In this paper, a cost-efficient LDPC decoder for DVB-S2 is presented. Based on the Normalized Min-Sum algorithm and the turbo-decoding message-passing (TDMP) algorithm, a dual line-scan scheduling is proposed to enable hardware reusing. Furthermore, we present the solution to the address conflict issue caused by the characteristic of the parity-check matrix defined by DVB-S2 LDPC codes. Based on SMIC 0.13 µm standard CMOS process, the LDPC decoder has an area of 12.51 mm2. The required operating frequency to meet the throughput requirement of 135 Mbps with maximum iteration number of 30 is 105 MHz. Compared with the latest published DVB-S2 LDPC decoder, the proposed decoder reduces area cost by 34%.

  • A Flexible LDPC Decoder Architecture Supporting TPMP and TDMP Decoding Algorithms

    Shuangqu HUANG  Xiaoyang ZENG  Yun CHEN  

     
    PAPER-Application

      Vol:
    E95-D No:2
      Page(s):
    403-412

    In this paper a programmable and area-efficient decoder architecture supporting two decoding algorithms for Block-LDPC codes is presented. The novel decoder can be configured to decode in either TPMP or TDMP decoding mode according to different Block-LDPC codes, essentially combining the advantages of two decoding algorithms. With a regular and scalable data-path, a Reconfigurable Serial Processing Engine (RSPE) is proposed to achieve area efficiency. To verify our proposed architecture, a flexible LDPC decoder fully compliant to IEEE 802.16e applications is implemented on a 130 nm 1P8M CMOS technology with a total area of 6.3 mm2 and maximum operating frequency of 250 MHz. The chip dissipates 592 mW when operates at 250 MHz frequency and 1.2 V supply.

  • CCTSS: The Combination of CNN and Transformer with Shared Sublayer for Detection and Classification

    Aorui GOU  Jingjing LIU  Xiaoxiang CHEN  Xiaoyang ZENG  Yibo FAN  

     
    PAPER-Image

      Pubricized:
    2023/07/06
      Vol:
    E107-A No:1
      Page(s):
    141-156

    Convolutional Neural Networks (CNNs) and Transformers have achieved remarkable performance in detection and classification tasks. Nevertheless, their feature extraction cannot consider both local and global information, so the detection and classification performance can be further improved. In addition, more and more deep learning networks are designed as more and more complex, and the amount of computation and storage space required is also significantly increased. This paper proposes a combination of CNN and transformer, and designs a local feature enhancement module and global context modeling module to enhance the cascade network. While the local feature enhancement module increases the range of feature extraction, the global context modeling is used to capture the feature maps' global information. To decrease the model complexity, a shared sublayer is designed to realize the sharing of weight parameters between the adjacent convolutional layers or cross convolutional layers, thereby reducing the number of convolutional weight parameters. Moreover, to effectively improve the detection performance of neural networks without increasing network parameters, the optimal transport assignment approach is proposed to resolve the problem of label assignment. The classification loss and regression loss are the summations of the cost between the demander and supplier. The experiment results demonstrate that the proposed Combination of CNN and Transformer with Shared Sublayer (CCTSS) performs better than the state-of-the-art methods in various datasets and applications.

  • Design Approach and Implementation of Application Specific Instruction Set Processor for SHA-3 BLAKE Algorithm

    Yuli ZHANG  Jun HAN  Xinqian WENG  Zhongzhu HE  Xiaoyang ZENG  

     
    PAPER-Electronic Circuits

      Vol:
    E95-C No:8
      Page(s):
    1415-1426

    This paper presents an Application Specific Instruction-set Processor (ASIP) for the SHA-3 BLAKE algorithm family by instruction set extensions (ISE) from an RISC (reduced instruction set computer) processor. With a design space exploration for this ASIP to increase the performance and reduce the area cost, we accomplish an efficient hardware and software implementation of BLAKE algorithm. The special instructions and their well-matched hardware function unit improve the calculation of the key section of the algorithm, namely G-functions. Also, relaxing the time constraint of the special function unit can decrease its hardware cost, while keeping the high data throughput of the processor. Evaluation results reveal the ASIP achieves 335 Mbps and 176 Mbps for BLAKE-256 and BLAKE-512. The extra area cost is only 8.06k equivalent gates. The proposed ASIP outperforms several software approaches on various platforms in cycle per byte. In fact, both high throughput and low hardware cost achieved by this programmable processor are comparable to that of ASIC implementations.

  • An Attention Nested U-Structure Suitable for Salient Ship Detection in Complex Maritime Environment

    Weina ZHOU  Ying ZHOU  Xiaoyang ZENG  

     
    PAPER-Information Network

      Pubricized:
    2022/03/23
      Vol:
    E105-D No:6
      Page(s):
    1164-1171

    Salient ship detection plays an important role in ensuring the safety of maritime transportation and navigation. However, due to the influence of waves, special weather, and illumination on the sea, existing saliency methods are still unable to achieve effective ship detection in a complex marine environment. To solve the problem, this paper proposed a novel saliency method based on an attention nested U-Structure (AU2Net). First, to make up for the shortcomings of the U-shaped structure, the pyramid pooling module (PPM) and global guidance paths (GGPs) are designed to guide the restoration of feature information. Then, the attention modules are added to the nested U-shaped structure to further refine the target characteristics. Ultimately, multi-level features and global context features are integrated through the feature aggregation module (FAM) to improve the ability to locate targets. Experiment results demonstrate that the proposed method could have at most 36.75% improvement in F-measure (Favg) compared to the other state-of-the-art methods.

  • Obstacle Detection for Unmanned Surface Vehicles by Fusion Refinement Network

    Weina ZHOU  Xinxin HUANG  Xiaoyang ZENG  

     
    PAPER-Information Network

      Pubricized:
    2022/05/12
      Vol:
    E105-D No:8
      Page(s):
    1393-1400

    As a kind of marine vehicles, Unmanned Surface Vehicles (USV) are widely used in military and civilian fields because of their low cost, good concealment, strong mobility and high speed. High-precision detection of obstacles plays an important role in USV autonomous navigation, which ensures its subsequent path planning. In order to further improve obstacle detection performance, we propose an encoder-decoder architecture named Fusion Refinement Network (FRN). The encoder part with a deeper network structure enables it to extract more rich visual features. In particular, a dilated convolution layer is used in the encoder for obtaining a large range of obstacle features in complex marine environment. The decoder part achieves the multiple path feature fusion. Attention Refinement Modules (ARM) are added to optimize features, and a learnable fusion algorithm called Feature Fusion Module (FFM) is used to fuse visual information. Experimental validation results on three different datasets with real marine images show that FRN is superior to state-of-the-art semantic segmentation networks in performance evaluation. And the MIoU and MPA of the FRN can peak at 97.01% and 98.37% respectively. Moreover, FRN could maintain a high accuracy with only 27.67M parameters, which is much smaller than the latest obstacle detection network (WaSR) for USV.

  • A Flexible Architecture for TURBO and LDPC Codes

    Yun CHEN  Yuebin HUANG  Chen CHEN  Changsheng ZHOU  Xiaoyang ZENG  

     
    LETTER-High-Level Synthesis and System-Level Design

      Vol:
    E95-A No:12
      Page(s):
    2392-2395

    Turbo codes and LDPC (Low-Density Parity-Check) codes are two of the most powerful error correction codes that can approach Shannon limit in many communication systems. But there are little architecture presented to support both LDPC and Turbo codes, especially by the means of ASIC. This paper have implemented a common architecture that can decode LDPC and Turbo codes, and it is capable of supporting the WiMAX, WiFi, 3GPP-LTE standard on the same hardware. In this paper, we will carefully describe how to share memory and logic devices in different operation mode. The chip is design in a 130 nm CMOS technology, and the maximum clock frequency can reach up to 160 MHz. The maximum throughput is about 104 Mbps@5.5 iteration for Turbo codes and 136 Mbps@10iteration for LDPC codes. Comparing to other existing structure, the design speed, area have significant advantage.

  • A Micro-Code-Based IME Engine for HEVC and Its Hardware Implementation

    Leilei HUANG  Yibo FAN  Chenhao GU  Xiaoyang ZENG  

     
    PAPER-Integrated Electronics

      Vol:
    E102-C No:10
      Page(s):
    756-765

    High Efficiency Video Coding (HEVC) standard is now becoming one of the most widespread video coding standards in the world. As a successor of H.264 standard, it aims to provide a much superior encoding performance. To fulfill this goal, several new notations along with the corresponding computation processes are introduced by this standard. Among those computation processes, the integer motion estimation (IME) is one of bottlenecks due to the complex partitions of the inter prediction units (PU) and the large search window commonly adopted. Many algorithms have been proposed to address this issue and usually put emphasis on a large search window and great computation amount. However, the coding efforts should be related to the scenes. To be more specific, for relatively static videos, a small search window along with a simple search scheme should be adopted to reduce the time cost and power consumption. In view of this, a micro-code-based IME engine is proposed in this paper, which could be applied with search schemes of different complexity. To test the performance, three different search schemes based on this engine are designed and evaluated under HEVC test model (HM) 16.9, achieving a B-D rate increase of 0.55/-0.07/-0.14%. Compared with our previous work, the hardware implementation is optimized to reduce 64.2% of the SRAMs bits and 32.8% of the logic gate count. The final design could support 4K×2K @139/85/37fps videos @500MHz.

  • A Unified Forward/Inverse Transform Architecture for Multi-Standard Video Codec Design

    Sha SHEN  Weiwei SHEN  Yibo FAN  Xiaoyang ZENG  

     
    PAPER-Digital Signal Processing

      Vol:
    E96-A No:7
      Page(s):
    1534-1542

    This paper describes a unified VLSI architecture which can be applied to various types of transforms used in MPEG-2/4, H.264, VC-1, AVS and the emerging new video coding standard named HEVC (High Efficiency Video Coding). A novel design named configurable butterfly array (CBA) is also proposed to support both the forward transform and the inverse transform in this unified architecture. Hadamard transform or 4/8-point DCT/IDCT are used in traditional video coding standards while 16/32-point DCT/IDCT are newly introduced in HEVC. The proposed architecture can support all these transform types in a unified architecture. Two levels (architecture level and block level) of hardware sharing are adopted in this design. In the architecture level, the forward transform can share the hardware resource with the inverse transform. In the block level, the hardware for smaller size transform can be recursively reused by larger size transform. The multiplications of 4 or 8-point transform are implemented with Multiplierless MCM (Multiple Constant Multiplication). In order to reduce the hardware overhead, the multiplications of 16/32 point DCT are implemented with ICM (input-muxed constant multipliers) instead of MCM or regular multipliers. The proposed design is 51% more area efficient than previous work. To the author's knowledge, this is the first published work to support both forward and inverse 4/8/16/32-point integer transform for HEVC standard in a unified architecture.

  • Efficient Implementation of OFDM Inner Receiver on a Programmable Multi-Core Processor Platform

    Wenhua FAN  Chen CHEN  Yun CHEN  Zhiyi YU  Xiaoyang ZENG  

     
    PAPER

      Vol:
    E95-B No:4
      Page(s):
    1241-1248

    This paper presents an efficient implementation of OFDM inner receiver on a programmable multi-core processor platform with CMMB as an application. The platform consists of an array of programmable SIMD processors interconnected in a 2-D mesh network, which can provide high performance and is quite suitable for wireless communication applications. Implemented on one cluster with 8 cores, the receiver includes symbol timing, carrier frequency offset and sampling frequency offset synchronization, channel estimation and equalization. Multiple optimization techniques are explored to improve system throughput such as: task-level parallelism on many cores, data-level parallelism on SIMD cores, minimization of memory access and route-length-minimization task mapping techniques. Besides, efficient memory strategy and specific instructions for complex computation increase the performance. The simulation results show that the inner receiver could achieve a throughput of up to 120 Mbps when operating at 750 MHz.

  • A 1.5 Gb/s Highly Parallel Turbo Decoder for 3GPP LTE/LTE-Advanced

    Yun CHEN  Xubin CHEN  Zhiyuan GUO  Xiaoyang ZENG  Defeng HUANG  

     
    LETTER-Fundamental Theories for Communications

      Vol:
    E96-B No:5
      Page(s):
    1211-1214

    A highly parallel turbo decoder for 3GPP LTE/LTE-Advanced systems is presented. It consists of 32 radix-4 soft-in/soft-out (SISO) decoders. Each SISO decoder is based on the proposed full-parallel sliding window (SW) schedule. Implemented in a 0.13 µm CMOS technology, the proposed design occupies 12.96 mm2 and achieves 1.5 Gb/s while decoding size-6144 blocks with 5.5 iterations. Compared with conventional SW schedule, the throughput is improved by 30–76% with 19.2% area overhead and negligible energy overhead.

  • A High-Throughput and Compact Hardware Implementation for the Reconstruction Loop in HEVC Intra Encoding

    Yibo FAN  Leilei HUANG  Zheng XIE  Xiaoyang ZENG  

     
    PAPER-Integrated Electronics

      Vol:
    E100-C No:6
      Page(s):
    643-654

    In the newly finalized video coding standard, namely high efficiency video coding (HEVC), new notations like coding unit (CU), prediction unit (PU) and transformation unit (TU) are introduced to improve the coding performance. As a result, the reconstruction loop in intra encoding is heavily burdened to choose the best partitions or modes for them. In order to solve the bottleneck problems in cycle and hardware cost, this paper proposed a high-throughput and compact implementation for such a reconstruction loop. By “high-throughput”, it refers to that it has a fixed throughput of 32 pixel/cycle independent of the TU/PU size (except for 4×4 TUs). By “compact”, it refers to that it fully explores the reusability between discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) as well as that between quantization (Q) and de-quantization (IQ). Besides the contributions made in designing related hardware, this paper also provides a universal formula to analyze the cycle cost of the reconstruction loop and proposed a parallel-process scheme to further reduce the cycle cost. This design is verified on the Stratix IV FPGA. The basic structure achieved a maximum frequency of 150MHz and a hardware cost of 64K ALUTs, which could support the real time TU/PU partition decision for 4K×2K@20fps videos.

  • A Scalable and Reconfigurable Fault-Tolerant Distributed Routing Algorithm for NoCs

    Zewen SHI  Xiaoyang ZENG  Zhiyi YU  

     
    PAPER-Computer System

      Vol:
    E94-D No:7
      Page(s):
    1386-1397

    Manufacturing defects in the deep sub-micron VLSI process and aging resulted problems of devices during lifecycle are inevitable, and fault-tolerant routing algorithms are important to provide the required communication for NoCs in spite of failures. The proposed algorithm, referred to as scalable and reconfigurable fault-tolerant distributed routing (RFDR), partitions the system into nine regions using the concept of divide-and-conquer. It is a distributed algorithm, and each router guarantees fault-tolerance within one's own region and the system can be still sustained with multiple fault areas. The proposed RFDR has excellent scalability with hardware cost keeping constant independent of system size. Also it is completely reconfigurable when new nodes fail. Simulations under various synthetic traffic patterns show its better performance compared to Extended-XY routing algorithm. Moreover, there is almost no hardware overhead compared to Logic-Based Distributed Routing (LBDR), but the fault-tolerance capacity is enhanced in the proposed algorithm. Hardware cost is reduced 37% compared to Reconfigurable Distributed Scalable Predictable Interconnect Network (R-DSPIN) which only supports single fault region.

  • An 88/44 Adaptive Hadamard Transform Based FME VLSI Architecture for 4 K2 K H.264/AVC Encoder

    Yibo FAN  Jialiang LIU  Dexue ZHANG  Xiaoyang ZENG  Xinhua CHEN  

     
    PAPER

      Vol:
    E95-C No:4
      Page(s):
    447-455

    Fidelity Range Extension (FRExt) (i.e. High Profile) was added to the H.264/AVC recommendation in the second version. One of the features included in FRExt is the Adaptive Block-size Transform (ABT). In order to conform to the FRExt, a Fractional Motion Estimation (FME) architecture is proposed to support the 88/44 adaptive Hadamard Transform (88/44 AHT). The 88/44 AHT circuit contributes to higher throughput and encoding performance. In order to increase the utilization of SATD (Sum of Absolute Transformed Difference) Generator (SG) in unit time, the proposed architecture employs two 8-pel interpolators (IP) to time-share one SG. These two IPs can work in turn to provide the available data continuously to the SG, which increases the data throughput and significantly reduces the cycles that are needed to process one Macroblock. Furthermore, this architecture also exploits the linear feature of Hadamard Transform to generate the quarter-pel SATD. This method could help to shorten the long datapath in the second-step of two-iteration FME algorithm. Finally, experimental results show that this architecture could be used in the applications requiring different performances by adjusting the supported modes and operation frequency. It can support the real-time encoding of the seven-mode 4 K2 K@24 fps or six-mode 4 K2 K@30 fps video sequences.

  • A High Speed Reconfigurable Face Detection Architecture Based on AdaBoost Cascade Algorithm

    Weina ZHOU  Lin DAI  Yao ZOU  Xiaoyang ZENG  Jun HAN  

     
    PAPER-Application

      Vol:
    E95-D No:2
      Page(s):
    383-391

    Face detection has been an independent technology playing an important role in more and more fields, which makes it necessary and urgent to have its architecture reconfigurable to meet different demands on detection capabilities. This paper proposed a face detection architecture, which could be adjusted by the user according to the background, the sensor resolution, the detection accuracy and speed in different situations. This user adjustable mode makes the reconfiguration simple and efficient, and is especially suitable for portable mobile terminals whose working condition often changes frequently. In addition, this architecture could work as an accelerator to constitute a larger and more powerful system integrated with other functional modules. Experimental results show that the reconfiguration of the architecture is very reasonable in face detection and synthesized report also indicates its advantage on little consumption of area and power.

1-20hit(23hit)