The search functionality is under construction.
The search functionality is under construction.

Author Search Result

[Author] Jun YANG(20hit)

1-20hit
  • Fast AdaBoost-Based Face Detection System on a Dynamically Coarse Grain Reconfigurable Architecture

    Jian XIAO  Jinguo ZHANG  Min ZHU  Jun YANG  Longxing SHI  

     
    PAPER-Application

      Vol:
    E95-D No:2
      Page(s):
    392-402

    An AdaBoost-based face detection system is proposed, on a Coarse Grain Reconfigurable Architecture (CGRA) named “REMUS-II”. Our work is quite distinguished from previous ones in three aspects. First, a new hardware-software partition method is proposed and the whole face detection system is divided into several parallel tasks implemented on two Reconfigurable Processing Units (RPU) and one micro Processors Unit (µPU) according to their relationships. These tasks communicate with each other by a mailbox mechanism. Second, a strong classifier is treated as a smallest phase of the detection system, and every phase needs to be executed by these tasks in order. A phase of Haar classifier is dynamically mapped onto a Reconfigurable Cell Array (RCA) only when needed, and it's quite different from traditional Field Programmable Gate Array (FPGA) methods in which all the classifiers are fabricated statically. Third, optimized data and configuration word pre-fetch mechanisms are employed to improve the whole system performance. Implementation results show that our approach under 200 MHz clock rate can process up-to 17 frames per second on VGA size images, and the detection rate is over 95%. Our system consumes 194 mW, and the die size of fabricated chip is 23 mm2 using TSMC 65 nm standard cell based technology. To the best of our knowledge, this work is the first implementation of the cascade Haar classifier algorithm on a dynamically CGRA platform presented in the literature.

  • The Organization of On-Chip Data Memory in One Coarse-Grained Reconfigurable Architecture

    Yansheng WANG  Leibo LIU  Shouyi YIN  Min ZHU  Peng CAO  Jun YANG  Shaojun WEI  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E96-A No:11
      Page(s):
    2218-2229

    RCP (Reconfigurable Computing Processor) is intended to fill the gap between ASIC and GPP (General Purpose processor), which achieves much higher energy efficiency than GPP, while is much more flexible than ASIC. In this paper, one organization of on-chip data memory called LIBODM (LIfetime Based On-chip Data Memory) is proposed to reduce the reference delay for data and on-chip data memory size in RCP. In the LIBODM, the allocation of data is based on the data dependency. The data with low data dependency are stored off-chip to save the storage costs, while the data with high data dependency are stored on-chip to reduce the reference delay. Besides, in the LIBODM, the on-chip data are classified into two types, and the classification is based on the lifetime of data. For short lifetime data, they are preferred to be stored into FIFO to increase the reuse ratio of memory space naturally. For long lifetime data, they are preferred to be stored into RAM for several time references. The LIBODM has been testified in one CGRA (Coarse Grained Reconfigurable Architecture) called RPU (Reconfigurable Processing Unit), and two RPUs has been integrated in a RCP-REMUS_HP (High Performance version of Reconfigurable MUlti-media System) focused on video decoding. Thanks to the LIBODM, although the size of on-chip data memory in REMUS_HP is small, a high performance can still be achieved. Compared with XPP and ADRES, in REMUS_HP, the on-chip data memory size at same performance level is only 23.9% and 14.8%. REMUS_HP is implemented on a 48.9mm2 silicon with TSMC 65nm technology. Simulation shows that 1920*1088 @30fps can be achieved for H.264 high-profile decoding when exploiting a 200MHz working frequency. Compared with the high performance version of XPP, the performance is 150% boosted, while the energy efficiency is 17.59x boosted.

  • Multi-Scale Contrastive Learning for Human Pose Estimation Open Access

    Wenxia BAO  An LIN  Hua HUANG  Xianjun YANG  Hemu CHEN  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2024/06/17
      Vol:
    E107-D No:10
      Page(s):
    1332-1341

    Recent years have seen remarkable progress in human pose estimation. However, manual annotation of keypoints remains tedious and imprecise. To alleviate this problem, this paper proposes a novel method called Multi-Scale Contrastive Learning (MSCL). This method uses a siamese network structure with upper and lower branches that capture diffirent views of the same image. Each branch uses a backbone network to extract image representations, employing multi-scale feature vectors to capture information. These feature vectors are then passed through an enhanced feature pyramid for fusion, producing more robust feature representations. The feature vectors are then further encoded by mapping and prediction heads to predict the feature vector of another view. Using negative cosine similarity between vectors as a loss function, the backbone network is pre-trained on a large-scale unlabeled dataset, enhancing its capacity to extract visual representations. Finally, transfer learning is performed on a small amount of labelled data for the pose estimation task. Experiments on COCO datasets show significant improvements in Average Precision (AP) of 1.8%, 0.9%, and 1.2% with 1%, 5%, and 10% labelled data on COCO. In addition, the Percentage of Correct Keypoints (PCK) improves by 0.5% on MPII&AIC, outperforming mainstream contrastive learning methods.

  • VACED-SIM: A Simulator for Scalability Prediction in Large-Scale Parallel Computing

    Yufei LIN  Xuejun YANG  Xinhai XU  Xiaowei GUO  

     
    PAPER-Computer System

      Vol:
    E96-D No:7
      Page(s):
    1430-1442

    Scaling up the system size has been the common approach to achieving high performance in parallel computing. However, designing and implementing a large-scale parallel system can be very costly in terms of money and time. When building a target system, it is desirable to initially build a smaller version by using the processing nodes with the same architecture as those in the target system. This allows us to achieve efficient and scalable prediction by using the smaller system to predict the performance of the target system. Such scalability prediction is critical because it enables system designers to evaluate different design alternatives so that a certain performance goal can be successfully achieved. As the de facto standard for writing parallel applications, MPI is widely used in large-scale parallel computing. By categorizing the discrete event simulation methods for MPI programs and analyzing the characteristics of scalability prediction, we propose a novel simulation method, called virtual-actual combined execution-driven (VACED) simulation, to achieve scalable prediction for MPI programs. The basic idea behind is to predict the execution time of an MPI program on a target machine by running it on a smaller system so that we can predict its communication time by virtual simulation and obtain its sequential computation time by actual execution. We introduce a model for the VACED simulation as well as the design and implementation of VACED-SIM, a lightweight simulator based on fine-grained activity and event definitions. We have validated our approach on a sub-system of Tianhe-1A. Our experimental results show that VACED-SIM exhibits higher accuracy and efficiency than MPI-SIM. In particular, for a target system with 1024 cores, the relative errors of VACED-SIM are less than 10% and the slowdowns are close to 1.

  • A Fast Algorithm for the Sound Projection Using Multiple Sources

    Yuan WEN  Woon-Seng GAN  Jun YANG  

     
    LETTER

      Vol:
    E88-A No:7
      Page(s):
    1765-1766

    An algorithm for the sound projection using multiple sources is presented. The source strength vector is obtained by using a fast estimation approach instead of the conventional eigenvalue decomposition (EVD) method. The computation load is therefore greatly reduced, which makes the algorithm more efficient in practical applications.

  • Strategies for an Acoustical-Hotspot Generation

    Yuan WEN  Jun YANG  Woon-Seng GAN  

     
    PAPER-Sound Field Reproduction

      Vol:
    E88-A No:7
      Page(s):
    1739-1746

    Two methods for hotspot generation using multiple sources, known as time-delay (TD) method and maximum-control-gain (MCG) method are investigated in the two typical acoustical fields, namely, the free field and a rectangular room. Based on the theoretical analysis and simulations, strategies are developed according to the sound field where the target region is defined. In the free field, the MCG method can be used if the performance in terms of control gain is the priority for an optimal control, whereas the TD method is more preferable if the simplicity of implementation is the first consideration. In a room environment, if a target region is defined in the near field where the direct sound dominates, the TD method is still effective. However, in the far field where the reverberant sound prevails, only the MCG method is applicable. The near field/far field can be roughly separated according to the critical distance from the sources in the room.

  • Target-Oriented Acoustic Radiation Generation Technique for Sound Field Control

    Yuan WEN  Jun YANG  Woon-Seng GAN  

     
    PAPER-Engineering Acoustics

      Vol:
    E89-A No:12
      Page(s):
    3671-3677

    A multiple-source system for rendering the sound pressure distribution in a target region can be modeled as a multi-input-multi-output (MIMO) system with the inputs being the source strengths and the outputs being the pressures on multiple measuring points/sensors. In this paper, we propose a target-oriented acoustic radiation generation technique (TARGET) for sound field control. For the MIMO system of a given geometry, a series of basic radiation modes, namely, target-oriented radiation modes (TORMs) can be derived using eigenvector analysis. Different TORMs have different contributions to the system control gain, which is defined as the ratio of the acoustic energy generated in the target zone to the transmitter output power. The TARGET can be effectively applied to the sound reproduction and suppression, which correspond the generations of bright and dark zone respectively. In acoustically bright zone generation and sound beamforming, the highest-gain TORM can be employed to determine the optimal source strengths. In active noise control, the strengths of the secondary sources can be derived using low-gain TORMs. Simulation results show that the proposed method has better or comparable performance than the traditional techniques.

  • Unique Shape Reconstruction Using Interreflections

    Jun YANG  Dili ZHANG  Noboru OHNISHI  Noboru SUGIE  

     
    PAPER-Image Processing,Computer Graphics and Pattern Recognition

      Vol:
    E81-D No:3
      Page(s):
    307-316

    We discuss the uniqueness of 3-D shape reconstruction of a polyhedron from a single shading image. First, we analytically show that multiple convex (and concave) shape solutions usually exist for a simple polyhedron if interreflections are not considered. Then we propose a new approach to uniquely determine the concave shape solution using interreflections as a constraint. An example, in which two convex and two concave shapes were obtained from a single shaded image for a trihedral corner, has been given by Horn. However, how many solutions exist for a general polyhedron wasn't described. We analytically show that multiple convex (and concave) shape solutions usually exist for a pyramid using a reflectance map, if interreflection distribution is not considered. However, if interreflection distribution is used as a constraint that limits the shape solution for a concave polyhedron, the polyhedral shape can be uniquely determined. Interreflections, which were considered to be deleterious in conventional approaches, are used as a constraint to determine the shape solution in our approach.

  • Parallelism Analysis of H.264 Decoder and Realization on a Coarse-Grained Reconfigurable SoC

    Gugang GAO  Peng CAO  Jun YANG  Longxing SHI  

     
    PAPER-Application

      Vol:
    E96-D No:8
      Page(s):
    1654-1666

    One of the largest challenges for coarse-grained reconfigurable arrays (CGRAs) is how to efficiently map applications. The key issues for mapping are (1) how to reduce the memory bandwidth, (2) how to exploit parallelism in algorithms and (3) how to achieve load balancing and take full advantage of the hardware potential. In this paper, we propose a novel parallelism scheme, called ‘Hybrid partitioning’, for mapping a H.264 high definition (HD) decoder onto REMUS-II, a CGRA system-on-chip (SoC). Combining good features of data partitioning and task partitioning, our methodology mainly consists of three levels from top to bottom: (1) hybrid task pipeline based on slice and macroblock (MB) level; (2) MB row-level data parallelism; (3) sub-MB level parallelism method. Further, on the sub-MB level, we propose a few mapping strategies such as hybrid variable block size motion compensation (Hybrid VBSMC) for MC, 2D-wave for intra 44, parallel processing order for deblocking. With our mapping strategies, we improved the algorithm's performance on REMUS-II. For example, with a luma 1616 MB, the Hybrid VBSMC achieves 4 times greater performance than VBSMC and 2.2 times greater performance than fixed 44 partition approach. Finally, we achieve 1080p@33fps H.264 high-profile (HiP)@level 4.1 decoding when the working frequency of REMUS-II is 200 MHz. Compared with typical hardware platforms, we can achieve better performance, area, and flexibility. For example, our performance achieves approximately 175% improvement than that of a commercial CGRA processor XPP-III while only using 70% of its area.

  • Hardware Software Co-design of H.264 Baseline Encoder on Coarse-Grained Dynamically Reconfigurable Computing System-on-Chip

    Hung K. NGUYEN  Peng CAO  Xue-Xiang WANG  Jun YANG  Longxing SHI  Min ZHU  Leibo LIU  Shaojun WEI  

     
    PAPER-Computer System

      Vol:
    E96-D No:3
      Page(s):
    601-615

    REMUS-II (REconfigurable MUltimedia System 2) is a coarse-grained dynamically reconfigurable computing system for multimedia and communication baseband processing. This paper proposes a real-time H.264 baseline profile encoder on REMUS-II. First, we propose an overall mapping flow for mapping algorithms onto the platform of REMUS-II system and then illustrate it by implementing the H.264 encoder. Second, parallel and pipelining techniques are considered for fully exploiting the abundant computing resources of REMUS-II, thus increasing total computing throughput and solving high computational complexity of H.264 encoder. Besides, some data-reuse schemes are also used to increase data-reuse ratio and therefore reduce the required data bandwidth. Third, we propose a scheduling scheme to manage run-time reconfiguration of the system. The scheduling is also responsible for synchronizing the data communication between tasks and handling conflict between hardware resources. Experimental results prove that the REMUS-MB (REMUS-II version for mobile applications) system can perform a real-time H.264/AVC baseline profile encoder. The encoder can encode CIF@30 fps video sequences with two reference frames and maximum search range of [-16,15]. The implementation, thereby, can be applied to handheld devices targeted at mobile multimedia applications. The platform of REMUS-MB system is designed and synthesized by using TSMC 65 nm low power technology. The die size of REMUS-MB is 13.97 mm2. REMUS-MB consumes, on average, about 100 mW while working at 166 MHz. To my knowledge, in the literature this is the first implementation of H.264 encoding algorithm on a coarse-grained dynamically reconfigurable computing system.

  • Nonlinear Wave Propagation for a Parametric Loudspeaker

    Jun YANG  Kan SHA  Woon-Seng GAN  Jing TIAN  

     
    PAPER

      Vol:
    E87-A No:9
      Page(s):
    2395-2400

    A directional audible sound can be generated by amplitude-modulated (AM) into ultrasound wave from a parametric array. To synthesize audio signals produced by the self-demodulation effect of the AM sound wave, a quasi-linear analytical solution, which describes the nonlinear wave propagation, is developed for fast numerical evaluation. The radiated sound field is expressed as the superposition of Gaussian Beams. Numerical results are presented for a rectangular parametric loudspeaker, which are in good agreement with the experimental data published previously.

  • A Novel Fast-Lock-in Digitally Controlled Phase-Locked Loop

    Xin CHEN  Jun YANG  Long-xing SHI  

     
    LETTER-Integrated Electronics

      Vol:
    E91-C No:12
      Page(s):
    1971-1975

    A novel fast lock-in digitally controlled phase-locked loop (DCPLL) is proposed in this letter. This DCPLL adopts a novel frequency search algorithm to reduce the lock-in time. Furthermore, to reduce the power consumption, the frequency divider is reused as a frequency detector during the frequency acquisition, and reused as a time-to-digital converter module during the phase acquisition. To verify the proposed algorithm and architecture, a DCPLL design is implemented by SMIC 0.18 µm 1P6M CMOS technology. The Spice simulation results show that the DCPLL can achieve frequency acquisition in 3 reference cycles and complete phase acquisition in 11 reference cycles when locking to 200 MHz. The corresponding power consumption of DCPLL is 3.71 mW.

  • Shape and Reflectance of a Polyhedron from Interreflections by Two-Image Photometric Stereo

    Jun YANG  Noboru OHNISHI  Noboru SUGIE  

     
    LETTER

      Vol:
    E77-D No:9
      Page(s):
    1017-1021

    In this paper, we extend two-image photometric stereo method to treat a concave polyhedron, and present an iterative algorithm to remove the influence of interreflections. By the method we can obtain the shape and reflectance of a concave polyhedron with perfectly diffuse (Lambertian) and unknown constant reflectance. Both simulation and experiment show the feasibility and accuracy of the method.

  • Analysis of Delay Characteristics in MPsLS Forwarding Scheme

    Jun YANG  Yasushi HIBINO  

     
    PAPER-Switching for Communications

      Vol:
    E89-B No:6
      Page(s):
    1738-1746

    The delay characteristics of the MPsLS, a data forwarding scheme used for a core area of the integrated data service network, are discussed and analyzed. MPsLS has the capability of guaranteeing QoS on the per-flow level for time-sensitive applications and simultaneously maintaining the high utilization of network resources. In the MPsLS core area, the forwarding process is implemented with a fine-grain slot synchronization model, and at the ingress edge nodes, the forwarding process is carried with a coarse-grain frame synchronization model. The delay analyses are done according to three service models: the exact synchronization model, the less strict synchronization model for the appointed channels, and an asynchronous model for the filler channels. The authors give estimation equations of mean delay between edge-to-edge nodes in an MPsLS network, and introduce an effective method to determine the reserved bandwidth for given application flows based on numerical calculations from those theory analysis and simple simulation results.

  • A Broadband Kalman Filtering Approach to Blind Multichannel Identification

    Yuanlei QI  Feiran YANG  Ming WU  Jun YANG  

     
    PAPER-Digital Signal Processing

      Vol:
    E102-A No:6
      Page(s):
    788-795

    The blind multichannel identification is useful in many applications. Although many approaches have been proposed to address this challenging problem, the adaptive filtering-based methods are attractive due to their computational efficiency and good convergence property. The multichannel normalized least mean-square (MCNLMS) algorithm is easy to implement, but it converges very slowly for a correlated input. The multichannel affine projection algorithm (MCAPA) is thus proposed to speed up the convergence. However, the convergence of the MCNLMS and MCAPA is still unsatisfactory in practice. In this paper, we propose a time-domain Kalman filtering approach to the blind multichannel identification problem. Specifically, the proposed adaptive Kalman filter is based on the cross relation method and also uses more past input vectors to explore the decorrelation property. Simulation results indicate that the proposed method outperforms the MCNLMS and MCAPA significantly in terms of the initial convergence and tracking capability.

  • Analysis of Jitter in CMOS Ring Oscillators due to Power Supply Noise

    Xiaoying DENG  Xin CHEN  Jun YANG  Jianhui WU  

     
    LETTER-Electronic Circuits

      Vol:
    E92-C No:7
      Page(s):
    973-975

    In this letter a new analytical method is presented for estimating the timing jitter of CMOS ring oscillators due to power supply noise. Predictive jitter equation is presented, and the proposed method is utilized to study the jitter induced by power supply noise in an inverter-based ring oscillator, which is designed and simulated in SMIC 0.13-µm standard CMOS process. A comparison between the results obtained by the proposed method and those obtained by HSPICE simulation proves the accuracy of the predictive equation. Most of the errors between the theoretic calculation and simulation results are less than 3 ps.

  • WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs

    Xinhai XU  Xuejun YANG  Yufei LIN  

     
    PAPER-Computer System

      Vol:
    E95-D No:3
      Page(s):
    786-796

    As supercomputers increase in size, the mean time between failures (MTBF) of a system becomes shorter, and the reliability problem of supercomputers becomes more and more serious. MPI is currently the de facto standard used to build high-performance applications, and researches on the fault tolerance methods of MPI are always hot topics. However, due to the characteristics of MPI programs, most current checkpointing methods for MPI programs need to modify the MPI library (even operating system), or implement a complicated protocol by logging lots of messages. In this paper, we carry forward the idea of Application-Level Checkpointing (ALC). Based on the general fact that programmers are familiar with the communication characteristics of applications, we have developed BC-ALC, a new portable blocking coordinated ALC for MPI programs. BC-ALC neither modifies the MPI library (even operating system) nor logs any message. It implements coordination only by the Barrier operations instead of any complicated protocol. Furthermore, in order to reduce the cost of fault-tolerance, we reduce the synchronization range of the barrier, and design WBC-ALC, a weak blocking coordinated ALC utilizing group synchronization instead of global synchronization based on the communication relationship between processes. We also propose a fault-tolerance framework developed on top of WBC-ALC and discuss an implementation of it. Experimental results on NPB3.3-MPI benchmarks validate BC-ALC and WBC-ALC, and show that compared with BC-ALC, the average coordination time and the average backup time of a single checkpoint in WBC-ALC are reduced by 44.5% and 5.7% respectively.

  • Reconfiguration Process Optimization of Dynamically Coarse Grain Reconfigurable Architecture for Multimedia Applications

    Bo LIU  Peng CAO  Min ZHU  Jun YANG  Leibo LIU  Shaojun WEI  Longxing SHI  

     
    PAPER-Computer System

      Vol:
    E95-D No:7
      Page(s):
    1858-1871

    This paper presents a novel architecture design to optimize the reconfiguration process of a coarse-grained reconfigurable architecture (CGRA) called Reconfigurable Multimedia System II ( REMUS-II ). In REMUS-II, the tasks in multi-media applications are divided into two parts: computing-intensive tasks and control-intensive tasks. Two Reconfigurable Processor Units (RPUs) for accelerating computing-intensive tasks and a Micro-Processor Unit (µPU) for accelerating control-intensive tasks are contained in REMUS-II. As a large-scale CGRA, REMUS-II can provide satisfying solutions in terms of both efficiency and flexibility. This feature makes REMUS-II well-suited for video processing, where higher flexibility requirements are posed and a lot of computation tasks are involved. To meet the high requirement of the dynamic reconfiguration performance for multimedia applications, the reconfiguration architecture of REMUS-II should be well designed. To optimize the reconfiguration architecture of REMUS-II, a hierarchical configuration storage structure and a 3-stage reconfiguration processing structure are proposed. Furthermore, several optimization methods for configuration reusing are also introduced, to further improve the performance of reconfiguration process. The optimization methods include two aspects: the multi-target reconfiguration method and the configuration caching strategies. Experimental results showed that, with the reconfiguration architecture proposed, the performance of reconfiguration process will be improved by 4 times. Based on RTL simulation, REMUS-II can support the 1080p@32 fps of H.264 HiP@Level4 and 1080p@40 fps High-level MPEG-2 stream decoding at the clock frequency of 200 MHz. The proposed REMUS-II system has been implemented on a TSMC 65 nm process. The die size is 23.7 mm2 and the estimated on-chip dynamic power is 620 mW.

  • A GPS Bit Synchronization Method Based on Frequency Compensation

    Xinning LIU  Yuxiang NIU  Jun YANG  Peng CAO  

     
    PAPER-Navigation, Guidance and Control Systems

      Vol:
    E98-B No:4
      Page(s):
    746-753

    TTFF (Time-To-First-Fix) is an important indicator of GPS receiver performance, and must be reduced as much as possible. Bit synchronization is the pre-condition of positioning, which affects TTFF. The frequency error leads to power loss, which makes it difficult to find the bit edge. The conventional bit synchronization methods only work well when there is no or very small frequency error. The bit synchronization process is generally carried out after the pull-in stage, where the carrier loop is already stable. In this paper, a new bit synchronization method based on frequency compensation is proposed. Through compensating the frequency error, the new method reduces the signal power loss caused by the accumulation of coherent integration. The performances of the new method in different frequency error scenarios are compared. The parameters in the proposed method are analyzed and optimized to reduce the computational complexity. Simulation results show that the new method has good performance when the frequency error is less than 25Hz. Test results show that the new method can tolerate dynamic frequency errors, and it is possible to move the bit synchronization to the pull-in process to reduce the TTFF.

  • Robust Regularization for Enhanced Virtual Sound Imaging

    Jun YANG  Yew-Hin LIEW  Woon-Seng GAN  

     
    LETTER

      Vol:
    E86-A No:8
      Page(s):
    2061-2062

    This letter outlines a scheme to produce a wider robust bandwidth, with better approximations to the perfect reproduction of pre-recorded acoustic signals. Multi-parameter inverse filtering method is proposed in the virtual sound imaging system for improving the robustness performance. The superiority of this new type of inverse filter is demonstrated on a 3-speaker system.