IEICE global.ieice.org Site

Author Search Result

[Author] Xiaoyang ZENG(23hit)

1-20hit(23hit)

Efficient Iterative Frequency Domain Equalization for Single Carrier System with Insufficient Cyclic Prefix
Chuan WU Dan BAO Xiaoyang ZENG Yun CHEN

LETTER-Wireless Communication Technologies

Vol:
E94-B No:7
Page(s):
2174-2177
In this letter we present efficient iterative frequency domain equalization for single-carrier (SC) transmission systems with insufficient cyclic prefix (CP). Based on minimum mean square error (MMSE) criteria, iterative decision feedback frequency domain equalization (IDF-FDE) combined with cyclic prefix reconstruction (CPR) is derived to mitigate inter-symbol interference (ISI) and inter-carrier interference (ICI). Computer simulation results reveal that the proposed scheme significantly improves the performance of SC systems with insufficient CP compared with previous schemes.
A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM
Yibo FAN Leilei HUANG Kewei CHEN Xiaoyang ZENG

PAPER-Integrated Electronics

Pubricized:
2019/11/27
Vol:
E103-C No:5
Page(s):
263-273
The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.
A Reconfigurable 74-140Mbps LDPC Decoding System for CCSDS Standard
Yun CHEN Jimin WANG Shixian LI Jinfou XIE Qichen ZHANG Keshab K. PARHI Xiaoyang ZENG

PAPER

Pubricized:
2021/05/25
Vol:
E104-A No:11
Page(s):
1509-1515
Accumulate Repeat-4 Jagged-Accumulate (AR4JA) codes, which are channel codes designed for deep-space communications, are a series of QC-LDPC codes. Structures of these codes' generator matrix can be exploited to design reconfigurable encoders. To make the decoder reconfigurable and achieve shorter convergence time, turbo-like decoding message passing (TDMP) is chosen as the hardware decoder's decoding schedule and normalized min-sum algorithm (NMSA) is used as decoding algorithm to reduce hardware complexity. In this paper, we propose a reconfigurable decoder and present its FPGA implementation results. The decoder can achieve throughput greater than 74 Mbps.
A 64 Cycles/MB, Luma-Chroma Parallelized H.264/AVC Deblocking Filter for 4 K2 K Applications
Weiwei SHEN Yibo FAN Xiaoyang ZENG

PAPER

Vol:
E95-C No:4
Page(s):
441-446
In this paper, a high-throughput debloking filter is presented for H.264/AVC standard, catering video applications with 4 K2 K (40962304) ultra-definition resolution. In order to strengthen the parallelism without simply increasing the area, we propose a luma-chroma parallel method. Meanwhile, this work reduces the number of processing cycles, the amount of external memory traffic and the working frequency, by using triple four-stage pipeline filters and a luma-chroma interlaced sequence. Furthermore, it eliminates most unnecessary off-chip memory bandwidth with a highly reusable memory scheme, and adopts a “slide window” buffer scheme. As a result, our design can support 4 K2 K at 30 fps applications at the working frequency of only 70.8 MHz.
A Fully Programmable Reed-Solomon Decoder on a Multi-Core Processor Platform
Bei HUANG Kaidi YOU Yun CHEN Zhiyi YU Xiaoyang ZENG

PAPER-Computer Architecture

Vol:
E95-D No:12
Page(s):
2939-2947
Reed-Solomon (RS) codes are widely used in digital communication and storage systems. Unlike usual VLSI approaches, this paper presents a high throughput fully programmable Reed-Solomon decoder on a multi-core processor. The multi-core processor platform is a 2-Dimension mesh array of Single Instruction Multiple Data (SIMD) cores, and it is well suited for digital communication applications. By fully extracting the parallelizable operations of the RS decoding process, we propose multiple optimization techniques to improve system throughput, including: task level parallelism on different cores, data level parallelism on each SIMD core, minimizing memory access, and route length minimized task mapping techniques. For RS(255, 239, 8), experimental results show that our 12-core implementation achieve a throughput of 4.35 Gbps, which is much better than several other published implementations. From the results, it is predictable that the throughput is linear with the number of cores by our approach.
A Cost-Efficient LDPC Decoder for DVB-S2 with the Solution to Address Conflict Issue
Yan YING Dan BAO Zhiyi YU Xiaoyang ZENG Yun CHEN

PAPER-Digital Signal Processing

Vol:
E93-A No:8
Page(s):
1415-1424
In this paper, a cost-efficient LDPC decoder for DVB-S2 is presented. Based on the Normalized Min-Sum algorithm and the turbo-decoding message-passing (TDMP) algorithm, a dual line-scan scheduling is proposed to enable hardware reusing. Furthermore, we present the solution to the address conflict issue caused by the characteristic of the parity-check matrix defined by DVB-S2 LDPC codes. Based on SMIC 0.13 µm standard CMOS process, the LDPC decoder has an area of 12.51 mm2. The required operating frequency to meet the throughput requirement of 135 Mbps with maximum iteration number of 30 is 105 MHz. Compared with the latest published DVB-S2 LDPC decoder, the proposed decoder reduces area cost by 34%.
A Flexible LDPC Decoder Architecture Supporting TPMP and TDMP Decoding Algorithms
Shuangqu HUANG Xiaoyang ZENG Yun CHEN

PAPER-Application

Vol:
E95-D No:2
Page(s):
403-412
In this paper a programmable and area-efficient decoder architecture supporting two decoding algorithms for Block-LDPC codes is presented. The novel decoder can be configured to decode in either TPMP or TDMP decoding mode according to different Block-LDPC codes, essentially combining the advantages of two decoding algorithms. With a regular and scalable data-path, a Reconfigurable Serial Processing Engine (RSPE) is proposed to achieve area efficiency. To verify our proposed architecture, a flexible LDPC decoder fully compliant to IEEE 802.16e applications is implemented on a 130 nm 1P8M CMOS technology with a total area of 6.3 mm2 and maximum operating frequency of 250 MHz. The chip dissipates 592 mW when operates at 250 MHz frequency and 1.2 V supply.
CCTSS: The Combination of CNN and Transformer with Shared Sublayer for Detection and Classification
Aorui GOU Jingjing LIU Xiaoxiang CHEN Xiaoyang ZENG Yibo FAN

PAPER-Image

Pubricized:
2023/07/06
Vol:
E107-A No:1
Page(s):
141-156
Convolutional Neural Networks (CNNs) and Transformers have achieved remarkable performance in detection and classification tasks. Nevertheless, their feature extraction cannot consider both local and global information, so the detection and classification performance can be further improved. In addition, more and more deep learning networks are designed as more and more complex, and the amount of computation and storage space required is also significantly increased. This paper proposes a combination of CNN and transformer, and designs a local feature enhancement module and global context modeling module to enhance the cascade network. While the local feature enhancement module increases the range of feature extraction, the global context modeling is used to capture the feature maps' global information. To decrease the model complexity, a shared sublayer is designed to realize the sharing of weight parameters between the adjacent convolutional layers or cross convolutional layers, thereby reducing the number of convolutional weight parameters. Moreover, to effectively improve the detection performance of neural networks without increasing network parameters, the optimal transport assignment approach is proposed to resolve the problem of label assignment. The classification loss and regression loss are the summations of the cost between the demander and supplier. The experiment results demonstrate that the proposed Combination of CNN and Transformer with Shared Sublayer (CCTSS) performs better than the state-of-the-art methods in various datasets and applications.
Design Approach and Implementation of Application Specific Instruction Set Processor for SHA-3 BLAKE Algorithm
Yuli ZHANG Jun HAN Xinqian WENG Zhongzhu HE Xiaoyang ZENG

PAPER-Electronic Circuits

Vol:
E95-C No:8
Page(s):
1415-1426
This paper presents an Application Specific Instruction-set Processor (ASIP) for the SHA-3 BLAKE algorithm family by instruction set extensions (ISE) from an RISC (reduced instruction set computer) processor. With a design space exploration for this ASIP to increase the performance and reduce the area cost, we accomplish an efficient hardware and software implementation of BLAKE algorithm. The special instructions and their well-matched hardware function unit improve the calculation of the key section of the algorithm, namely G-functions. Also, relaxing the time constraint of the special function unit can decrease its hardware cost, while keeping the high data throughput of the processor. Evaluation results reveal the ASIP achieves 335 Mbps and 176 Mbps for BLAKE-256 and BLAKE-512. The extra area cost is only 8.06k equivalent gates. The proposed ASIP outperforms several software approaches on various platforms in cycle per byte. In fact, both high throughput and low hardware cost achieved by this programmable processor are comparable to that of ASIC implementations.
An Attention Nested U-Structure Suitable for Salient Ship Detection in Complex Maritime Environment
Weina ZHOU Ying ZHOU Xiaoyang ZENG

PAPER-Information Network

Pubricized:
2022/03/23
Vol:
E105-D No:6
Page(s):
1164-1171
Salient ship detection plays an important role in ensuring the safety of maritime transportation and navigation. However, due to the influence of waves, special weather, and illumination on the sea, existing saliency methods are still unable to achieve effective ship detection in a complex marine environment. To solve the problem, this paper proposed a novel saliency method based on an attention nested U-Structure (AU2Net). First, to make up for the shortcomings of the U-shaped structure, the pyramid pooling module (PPM) and global guidance paths (GGPs) are designed to guide the restoration of feature information. Then, the attention modules are added to the nested U-shaped structure to further refine the target characteristics. Ultimately, multi-level features and global context features are integrated through the feature aggregation module (FAM) to improve the ability to locate targets. Experiment results demonstrate that the proposed method could have at most 36.75% improvement in F-measure (Favg) compared to the other state-of-the-art methods.
Obstacle Detection for Unmanned Surface Vehicles by Fusion Refinement Network
Weina ZHOU Xinxin HUANG Xiaoyang ZENG

PAPER-Information Network

Pubricized:
2022/05/12
Vol:
E105-D No:8
Page(s):
1393-1400
As a kind of marine vehicles, Unmanned Surface Vehicles (USV) are widely used in military and civilian fields because of their low cost, good concealment, strong mobility and high speed. High-precision detection of obstacles plays an important role in USV autonomous navigation, which ensures its subsequent path planning. In order to further improve obstacle detection performance, we propose an encoder-decoder architecture named Fusion Refinement Network (FRN). The encoder part with a deeper network structure enables it to extract more rich visual features. In particular, a dilated convolution layer is used in the encoder for obtaining a large range of obstacle features in complex marine environment. The decoder part achieves the multiple path feature fusion. Attention Refinement Modules (ARM) are added to optimize features, and a learnable fusion algorithm called Feature Fusion Module (FFM) is used to fuse visual information. Experimental validation results on three different datasets with real marine images show that FRN is superior to state-of-the-art semantic segmentation networks in performance evaluation. And the MIoU and MPA of the FRN can peak at 97.01% and 98.37% respectively. Moreover, FRN could maintain a high accuracy with only 27.67M parameters, which is much smaller than the latest obstacle detection network (WaSR) for USV.
A Flexible Architecture for TURBO and LDPC Codes
Yun CHEN Yuebin HUANG Chen CHEN Changsheng ZHOU Xiaoyang ZENG

LETTER-High-Level Synthesis and System-Level Design

Vol:
E95-A No:12
Page(s):
2392-2395
Turbo codes and LDPC (Low-Density Parity-Check) codes are two of the most powerful error correction codes that can approach Shannon limit in many communication systems. But there are little architecture presented to support both LDPC and Turbo codes, especially by the means of ASIC. This paper have implemented a common architecture that can decode LDPC and Turbo codes, and it is capable of supporting the WiMAX, WiFi, 3GPP-LTE standard on the same hardware. In this paper, we will carefully describe how to share memory and logic devices in different operation mode. The chip is design in a 130 nm CMOS technology, and the maximum clock frequency can reach up to 160 MHz. The maximum throughput is about 104 Mbps@5.5 iteration for Turbo codes and 136 Mbps@10iteration for LDPC codes. Comparing to other existing structure, the design speed, area have significant advantage.
A Micro-Code-Based IME Engine for HEVC and Its Hardware Implementation
Leilei HUANG Yibo FAN Chenhao GU Xiaoyang ZENG

PAPER-Integrated Electronics

Vol:
E102-C No:10
Page(s):
756-765
High Efficiency Video Coding (HEVC) standard is now becoming one of the most widespread video coding standards in the world. As a successor of H.264 standard, it aims to provide a much superior encoding performance. To fulfill this goal, several new notations along with the corresponding computation processes are introduced by this standard. Among those computation processes, the integer motion estimation (IME) is one of bottlenecks due to the complex partitions of the inter prediction units (PU) and the large search window commonly adopted. Many algorithms have been proposed to address this issue and usually put emphasis on a large search window and great computation amount. However, the coding efforts should be related to the scenes. To be more specific, for relatively static videos, a small search window along with a simple search scheme should be adopted to reduce the time cost and power consumption. In view of this, a micro-code-based IME engine is proposed in this paper, which could be applied with search schemes of different complexity. To test the performance, three different search schemes based on this engine are designed and evaluated under HEVC test model (HM) 16.9, achieving a B-D rate increase of 0.55/-0.07/-0.14%. Compared with our previous work, the hardware implementation is optimized to reduce 64.2% of the SRAMs bits and 32.8% of the logic gate count. The final design could support 4K×2K @139/85/37fps videos @500MHz.
A Unified Forward/Inverse Transform Architecture for Multi-Standard Video Codec Design
Sha SHEN Weiwei SHEN Yibo FAN Xiaoyang ZENG

PAPER-Digital Signal Processing

Vol:
E96-A No:7
Page(s):
1534-1542
This paper describes a unified VLSI architecture which can be applied to various types of transforms used in MPEG-2/4, H.264, VC-1, AVS and the emerging new video coding standard named HEVC (High Efficiency Video Coding). A novel design named configurable butterfly array (CBA) is also proposed to support both the forward transform and the inverse transform in this unified architecture. Hadamard transform or 4/8-point DCT/IDCT are used in traditional video coding standards while 16/32-point DCT/IDCT are newly introduced in HEVC. The proposed architecture can support all these transform types in a unified architecture. Two levels (architecture level and block level) of hardware sharing are adopted in this design. In the architecture level, the forward transform can share the hardware resource with the inverse transform. In the block level, the hardware for smaller size transform can be recursively reused by larger size transform. The multiplications of 4 or 8-point transform are implemented with Multiplierless MCM (Multiple Constant Multiplication). In order to reduce the hardware overhead, the multiplications of 16/32 point DCT are implemented with ICM (input-muxed constant multipliers) instead of MCM or regular multipliers. The proposed design is 51% more area efficient than previous work. To the author's knowledge, this is the first published work to support both forward and inverse 4/8/16/32-point integer transform for HEVC standard in a unified architecture.
Efficient Implementation of OFDM Inner Receiver on a Programmable Multi-Core Processor Platform
Wenhua FAN Chen CHEN Yun CHEN Zhiyi YU Xiaoyang ZENG

PAPER

Vol:
E95-B No:4
Page(s):
1241-1248
This paper presents an efficient implementation of OFDM inner receiver on a programmable multi-core processor platform with CMMB as an application. The platform consists of an array of programmable SIMD processors interconnected in a 2-D mesh network, which can provide high performance and is quite suitable for wireless communication applications. Implemented on one cluster with 8 cores, the receiver includes symbol timing, carrier frequency offset and sampling frequency offset synchronization, channel estimation and equalization. Multiple optimization techniques are explored to improve system throughput such as: task-level parallelism on many cores, data-level parallelism on SIMD cores, minimization of memory access and route-length-minimization task mapping techniques. Besides, efficient memory strategy and specific instructions for complex computation increase the performance. The simulation results show that the inner receiver could achieve a throughput of up to 120 Mbps when operating at 750 MHz.
A 1.5 Gb/s Highly Parallel Turbo Decoder for 3GPP LTE/LTE-Advanced
Yun CHEN Xubin CHEN Zhiyuan GUO Xiaoyang ZENG Defeng HUANG

LETTER-Fundamental Theories for Communications

Vol:
E96-B No:5
Page(s):
1211-1214
A highly parallel turbo decoder for 3GPP LTE/LTE-Advanced systems is presented. It consists of 32 radix-4 soft-in/soft-out (SISO) decoders. Each SISO decoder is based on the proposed full-parallel sliding window (SW) schedule. Implemented in a 0.13 µm CMOS technology, the proposed design occupies 12.96 mm2 and achieves 1.5 Gb/s while decoding size-6144 blocks with 5.5 iterations. Compared with conventional SW schedule, the throughput is improved by 30–76% with 19.2% area overhead and negligible energy overhead.
A High-Throughput and Compact Hardware Implementation for the Reconstruction Loop in HEVC Intra Encoding
Yibo FAN Leilei HUANG Zheng XIE Xiaoyang ZENG

PAPER-Integrated Electronics

Vol:
E100-C No:6
Page(s):
643-654
In the newly finalized video coding standard, namely high efficiency video coding (HEVC), new notations like coding unit (CU), prediction unit (PU) and transformation unit (TU) are introduced to improve the coding performance. As a result, the reconstruction loop in intra encoding is heavily burdened to choose the best partitions or modes for them. In order to solve the bottleneck problems in cycle and hardware cost, this paper proposed a high-throughput and compact implementation for such a reconstruction loop. By “high-throughput”, it refers to that it has a fixed throughput of 32 pixel/cycle independent of the TU/PU size (except for 4×4 TUs). By “compact”, it refers to that it fully explores the reusability between discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) as well as that between quantization (Q) and de-quantization (IQ). Besides the contributions made in designing related hardware, this paper also provides a universal formula to analyze the cycle cost of the reconstruction loop and proposed a parallel-process scheme to further reduce the cycle cost. This design is verified on the Stratix IV FPGA. The basic structure achieved a maximum frequency of 150MHz and a hardware cost of 64K ALUTs, which could support the real time TU/PU partition decision for 4K×2K@20fps videos.
A Scalable and Reconfigurable Fault-Tolerant Distributed Routing Algorithm for NoCs
Zewen SHI Xiaoyang ZENG Zhiyi YU

PAPER-Computer System

Vol:
E94-D No:7
Page(s):
1386-1397
Manufacturing defects in the deep sub-micron VLSI process and aging resulted problems of devices during lifecycle are inevitable, and fault-tolerant routing algorithms are important to provide the required communication for NoCs in spite of failures. The proposed algorithm, referred to as scalable and reconfigurable fault-tolerant distributed routing (RFDR), partitions the system into nine regions using the concept of divide-and-conquer. It is a distributed algorithm, and each router guarantees fault-tolerance within one's own region and the system can be still sustained with multiple fault areas. The proposed RFDR has excellent scalability with hardware cost keeping constant independent of system size. Also it is completely reconfigurable when new nodes fail. Simulations under various synthetic traffic patterns show its better performance compared to Extended-XY routing algorithm. Moreover, there is almost no hardware overhead compared to Logic-Based Distributed Routing (LBDR), but the fault-tolerance capacity is enhanced in the proposed algorithm. Hardware cost is reduced 37% compared to Reconfigurable Distributed Scalable Predictable Interconnect Network (R-DSPIN) which only supports single fault region.
An 88/44 Adaptive Hadamard Transform Based FME VLSI Architecture for 4 K2 K H.264/AVC Encoder
Yibo FAN Jialiang LIU Dexue ZHANG Xiaoyang ZENG Xinhua CHEN

PAPER

Vol:
E95-C No:4
Page(s):
447-455
Fidelity Range Extension (FRExt) (i.e. High Profile) was added to the H.264/AVC recommendation in the second version. One of the features included in FRExt is the Adaptive Block-size Transform (ABT). In order to conform to the FRExt, a Fractional Motion Estimation (FME) architecture is proposed to support the 88/44 adaptive Hadamard Transform (88/44 AHT). The 88/44 AHT circuit contributes to higher throughput and encoding performance. In order to increase the utilization of SATD (Sum of Absolute Transformed Difference) Generator (SG) in unit time, the proposed architecture employs two 8-pel interpolators (IP) to time-share one SG. These two IPs can work in turn to provide the available data continuously to the SG, which increases the data throughput and significantly reduces the cycles that are needed to process one Macroblock. Furthermore, this architecture also exploits the linear feature of Hadamard Transform to generate the quarter-pel SATD. This method could help to shorten the long datapath in the second-step of two-iteration FME algorithm. Finally, experimental results show that this architecture could be used in the applications requiring different performances by adjusting the supported modes and operation frequency. It can support the real-time encoding of the seven-mode 4 K2 K@24 fps or six-mode 4 K2 K@30 fps video sequences.
A High Speed Reconfigurable Face Detection Architecture Based on AdaBoost Cascade Algorithm
Weina ZHOU Lin DAI Yao ZOU Xiaoyang ZENG Jun HAN

PAPER-Application

Vol:
E95-D No:2
Page(s):
383-391
Face detection has been an independent technology playing an important role in more and more fields, which makes it necessary and urgent to have its architecture reconfigurable to meet different demands on detection capabilities. This paper proposed a face detection architecture, which could be adjusted by the user according to the background, the sensor resolution, the detection accuracy and speed in different situations. This user adjustable mode makes the reconfiguration simple and efficient, and is especially suitable for portable mobile terminals whose working condition often changes frequently. In addition, this architecture could work as an accelerator to constitute a larger and more powerful system integrated with other functional modules. Experimental results show that the reconfiguration of the architecture is very reasonable in face detection and synthesized report also indicates its advantage on little consumption of area and power.

1-20hit(23hit)

Author Search Result

[Author] Xiaoyang ZENG(23hit)

Efficient Iterative Frequency Domain Equalization for Single Carrier System with Insufficient Cyclic Prefix

A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM

A Reconfigurable 74-140Mbps LDPC Decoding System for CCSDS Standard

A 64 Cycles/MB, Luma-Chroma Parallelized H.264/AVC Deblocking Filter for 4 K2 K Applications

A Fully Programmable Reed-Solomon Decoder on a Multi-Core Processor Platform

A Cost-Efficient LDPC Decoder for DVB-S2 with the Solution to Address Conflict Issue

A Flexible LDPC Decoder Architecture Supporting TPMP and TDMP Decoding Algorithms

CCTSS: The Combination of CNN and Transformer with Shared Sublayer for Detection and Classification

Design Approach and Implementation of Application Specific Instruction Set Processor for SHA-3 BLAKE Algorithm

An Attention Nested U-Structure Suitable for Salient Ship Detection in Complex Maritime Environment

Obstacle Detection for Unmanned Surface Vehicles by Fusion Refinement Network

A Flexible Architecture for TURBO and LDPC Codes

A Micro-Code-Based IME Engine for HEVC and Its Hardware Implementation

A Unified Forward/Inverse Transform Architecture for Multi-Standard Video Codec Design

Efficient Implementation of OFDM Inner Receiver on a Programmable Multi-Core Processor Platform

A 1.5 Gb/s Highly Parallel Turbo Decoder for 3GPP LTE/LTE-Advanced

A High-Throughput and Compact Hardware Implementation for the Reconstruction Loop in HEVC Intra Encoding

A Scalable and Reconfigurable Fault-Tolerant Distributed Routing Algorithm for NoCs

An 88/44 Adaptive Hadamard Transform Based FME VLSI Architecture for 4 K2 K H.264/AVC Encoder

A High Speed Reconfigurable Face Detection Architecture Based on AdaBoost Cascade Algorithm

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles