The search functionality is under construction.

Keyword Search Result

[Keyword] acceleration(52hit)

1-20hit(52hit)

  • GPU-Accelerated Estimation and Targeted Reduction of Peak IR-Drop during Scan Chain Shifting

    Shiling SHI  Stefan HOLST  Xiaoqing WEN  

     
    PAPER-Dependable Computing

      Pubricized:
    2023/07/07
      Vol:
    E106-D No:10
      Page(s):
    1694-1704

    High power dissipation during scan test often causes undue yield loss, especially for low-power circuits. One major reason is that the resulting IR-drop in shift mode may corrupt test data. A common approach to solving this problem is partial-shift, in which multiple scan chains are formed and only one group of scan chains is shifted at a time. However, existing partial-shift based methods suffer from two major problems: (1) their IR-drop estimation is not accurate enough or computationally too expensive to be done for each shift cycle; (2) partial-shift is hence applied to all shift cycles, resulting in long test time. This paper addresses these two problems with a novel IR-drop-aware scan shift method, featuring: (1) Cycle-based IR-Drop Estimation (CIDE) supported by a GPU-accelerated dynamic power simulator to quickly find potential shift cycles with excessive peak IR-drop; (2) a scan shift scheduling method that generates a scan chain grouping targeted for each considered shift cycle to reduce the impact on test time. Experiments on ITC'99 benchmark circuits show that: (1) the CIDE is computationally feasible; (2) the proposed scan shift schedule can achieve a global peak IR-drop reduction of up to 47%. Its scheduling efficiency is 58.4% higher than that of an existing typical method on average, which means our method has less test time.

  • Adaptive Channel Scheduling for Acceleration and Fine Control of RNN-Based Image Compression

    Sang Hoon KIM  Jong Hwan KO  

     
    LETTER-Image

      Pubricized:
    2023/06/13
      Vol:
    E106-A No:9
      Page(s):
    1211-1215

    The existing target-dependent scalable image compression network can control the target of the compressed images between the human visual system and the deep learning based classification task. However, in its RNN based structure controls the bit-rate through the number of iterations, where each iteration generates a fixed size of the bit stream. Therefore, a large number of iterations are required at the high BPP, and fine-grained image quality control is not supported at the low BPP. In this paper, we propose a novel RNN-based image compression model that can schedule the channel size per iteration, to reduce the number of iterations at the high BPP and fine-grained bit-rate control at the low BPP. To further enhance the efficiency, multiple network models for various channel sizes are combined into a single model using the slimmable network architecture. The experimental results show that the proposed method achieves comparable performance to the existing method with finer BPP adjustment, increases parameters by only 0.15% and reduces the average amount of computation by 40.4%.

  • An Accuracy Reconfigurable Vector Accelerator based on Approximate Logarithmic Multipliers for Energy-Efficient Computing

    Lingxiao HOU  Yutaka MASUDA  Tohru ISHIHARA  

     
    PAPER

      Pubricized:
    2022/09/02
      Vol:
    E106-A No:3
      Page(s):
    532-541

    The approximate logarithmic multiplier proposed by Mitchell provides an efficient alternative for processing dense multiplication or multiply-accumulate operations in applications such as image processing and real-time robotics. It offers the advantages of small area, high energy efficiency and is suitable for applications that do not necessarily achieve high accuracy. However, its maximum error of 11.1% makes it challenging to deploy in applications requiring relatively high accuracy. This paper proposes a novel operand decomposition method (OD) that decomposes one multiplication into the sum of multiple approximate logarithmic multiplications to widely reduce Mitchell multiplier errors while taking full advantage of its area savings. Based on the proposed OD method, this paper also proposes an accuracy reconfigurable multiply-accumulate (MAC) unit that provides multiple reconfigurable accuracies with high parallelism. Compared to a MAC unit consisting of accurate multipliers, the area is significantly reduced to less than half, improving the hardware parallelism while satisfying the required accuracy for various scenarios. The experimental results show the excellent applicability of our proposed MAC unit in image smoothing and robot localization and mapping application. We have also designed a prototype processor that integrates the minimum functionality of this MAC unit as a vector accelerator and have implemented a software-level accuracy reconfiguration in the form of an instruction set extension. We experimentally confirmed the correct operation of the proposed vector accelerator, which provides the different degrees of accuracy and parallelism at the software level.

  • An eFPGA Generation Suite with Customizable Architecture and IDE

    Morihiro KUGA  Qian ZHAO  Yuya NAKAZATO  Motoki AMAGASAKI  Masahiro IIDA  

     
    PAPER

      Pubricized:
    2022/10/07
      Vol:
    E106-A No:3
      Page(s):
    560-574

    From edge devices to cloud servers, providing optimized hardware acceleration for specific applications has become a key approach to improve the efficiency of computer systems. Traditionally, many systems employ commercial field-programmable gate arrays (FPGAs) to implement dedicated hardware accelerator as the CPU's co-processor. However, commercial FPGAs are designed in generic architectures and are provided in the form of discrete chips, which makes it difficult to meet increasingly diversified market needs, such as balancing reconfigurable hardware resources for a specific application, or to be integrated into a customer's system-on-a-chip (SoC) in the form of embedded FPGA (eFPGA). In this paper, we propose an eFPGA generation suite with customizable architecture and integrated development environment (IDE), which covers the entire eFPGA design generation, testing, and utilization stages. For the eFPGA design generation, our intellectual property (IP) generation flow can explore the optimal logic cell, routing, and array structures for given target applications. For the testability, we employ a previously proposed shipping test method that is 100% accurate at detecting all stuck-at faults in the entire FPGA-IP. In addition, we propose a user-friendly and customizable Web-based IDE framework for the generated eFPGA based on the NODE-RED development framework. In the case study, we show an eFPGA architecture exploration example for a differential privacy encryption application using the proposed suite. Then we show the implementation and evaluation of the eFPGA prototype with a 55nm test element group chip design.

  • Flow Processing Optimization with Accelerated Flow Actions on High Speed Programmable Data Plane

    Zhiyuan LING  Xiao CHEN  Lei SONG  

     
    PAPER-Network System

      Pubricized:
    2022/08/10
      Vol:
    E106-B No:2
      Page(s):
    133-144

    With the development of network technology, next-generation networks must satisfy many new requirements for network functions and performance. The processing of overlong packet fields is one of the requirements and is also the basis for ID-based routing and content lookup, and packet field addition/deletion mechanisms. The current SDN switches do not provide good support for the processing of overlong fields. In this paper, we propose a series of optimization mechanisms for protocol-oblivious instructions, in which we address the problem of insufficient support for overlong data in existing SDN switches by extending the bit width of instructions and accelerating them using SIMD instruction sets. We also provide an intermediate representation of the protocol-oblivious instruction set to improve the efficiency of storing and reading instruction blocks, and further reduce the execution time of instruction blocks by preprocessing them. The experiments show that our approach improves the performance of overlong data processing by 56%. For instructions involving packet field addition and deletion, the improvement in performance reaches 455%. In normal forwarding scenarios, our solution reduces the packet forwarding latency by around 30%.

  • Resource Efficient Top-K Sorter on FPGA

    Binhao HE  Meiting XUE  Shubiao LIU  Feng YU  Weijie CHEN  

     
    LETTER-Digital Signal Processing

      Pubricized:
    2022/03/02
      Vol:
    E105-A No:9
      Page(s):
    1372-1376

    The top-K sorting is a variant of sorting used heavily in applications such as database management systems. Recently, the use of field programmable gate arrays (FPGAs) to accelerate sorting operation has attracted the interest of researchers. However, existing hardware top-K sorting algorithms are either resource-intensive or of low throughput. In this paper, we present a resource-efficient top-K sorting architecture that is composed of L cascading sorting units, and each sorting unit is composed of P sorting cells. K=PL largest elements are produced when a variable length input sequence is processed. This architecture can operate at a high frequency while consuming fewer resources. The experimental results show that our architecture achieved a maximum 1.2x throughput-to-resource improvement compared to previous studies.

  • Ray Tracing Acceleration using Rank Minimization for Radio Map Simulation

    Norisato SUGA  Ryohei SASAKI  

     
    LETTER-Digital Signal Processing

      Pubricized:
    2022/02/22
      Vol:
    E105-A No:8
      Page(s):
    1157-1161

    In this letter, a ray tracing (RT) acceleration method based on rank minimization is proposed. RT is a general tool used to simulate wireless communication environments. However, the simulation is time consuming because of the large number of ray calculations. This letter focuses on radio map interpolation as an acceleration approach. In the conventional methods cannot appropriately estimate short-span variation caused by multipath fading. To overcome the shortage of the conventional methods, we adopt rank minimization based interpolation. A computational simulation using commercial RT software revealed that the interpolation accuracy of the proposed method was higher than those of other radio map interpolation methods and that RT simulation can be accelerated approximate five times faster with the missing rate of 0.8.

  • Acceleration of Automatic Building Extraction via Color-Clustering Analysis Open Access

    Masakazu IWAI  Takuya FUTAGAMI  Noboru HAYASAKA  Takao ONOYE  

     
    LETTER-Computer Graphics

      Vol:
    E103-A No:12
      Page(s):
    1599-1602

    In this paper, we improve upon the automatic building extraction method, which uses a variational inference Gaussian mixture model for performing color clustering, by accelerating its computational speed. The improved method decreases the computational time using an image with reduced resolution upon applying color clustering. According to our experiment, in which we used 106 scenery images, the improved method could extract buildings at a rate 86.54% faster than that of the conventional methods. Furthermore, the improved method significantly increased the extraction accuracy by 1.8% or more by preventing over-clustering using the reduced image, which also had a reduced number of the colors.

  • Representative Spatial Selection and Temporal Combination for 60fps Real-Time 3D Tracking of Twelve Volleyball Players on GPU

    Xina CHENG  Yiming ZHAO  Takeshi IKENAGA  

     
    PAPER-Image

      Vol:
    E102-A No:12
      Page(s):
    1882-1890

    Real-time 3D players tracking plays an important role in sports analysis, especially for the live services of sports broadcasting, which have a strict limitation on processing time. For these kinds of applications, 3D trajectories of players contribute to high-level game analysis such as tactic analysis and commercial applications such as TV contents. Thus real-time implementation for 3D players tracking is expected. In order to achieve real-time for 60fps videos with high accuracy, (that means the processing time should be less than 16.67ms per frame), the factors that limit the processing time of target algorithm include: 1) Large image area of each player. 2) Repeated processing of multiple players in multiple views. 3) Complex calculation of observation algorithm. To deal with the above challenges, this paper proposes a representative spatial selection and temporal combination based real-time implementation for multi-view volleyball players tracking on the GPU device. First, the representative spatial pixel selection, which detects the pixels that mostly represent one image region to scale down the image spatially, reduces the number of processing pixels. Second, the representative temporal likelihood combination shares observation calculation by using the temporal correlation between images so that the times of complex calculation is reduced. The experiments are based on videos of the Final and Semi-Final Game of 2014 Japan Inter High School Games of Men's Volleyball in Tokyo Metropolitan Gymnasium. On the GPU device GeForce GTX 1080Ti, the tracking system achieves real-time on 60fps videos and keeps the tracking accuracy higher than 97%.

  • High Performance Application Specific Stream Architecture for Hardware Acceleration of HOG-SVM on FPGA

    Piyumal RANAWAKA  Mongkol EKPANYAPONG  Adriano TAVARES  Mathew DAILEY  Krit ATHIKULWONGSE  Vitor SILVA  

     
    PAPER

      Vol:
    E102-A No:12
      Page(s):
    1792-1803

    Conventional sequential processing on software with a general purpose CPU has become significantly insufficient for certain heavy computations due to the high demand of processing power to deliver adequate throughput and performance. Due to many reasons a high degree of interest could be noted for high performance real time video processing on embedded systems. However, embedded processing platforms with limited performance could least cater the processing demand of several such intensive computations in computer vision domain. Therefore, hardware acceleration could be noted as an ideal solution where process intensive computations could be accelerated using application specific hardware integrated with a general purpose CPU. In this research we have focused on building a parallelized high performance application specific architecture for such a hardware accelerator for HOG-SVM computation implemented on Zynq 7000 FPGA. Histogram of Oriented Gradients (HOG) technique combined with a Support Vector Machine (SVM) based classifier is versatile and extremely popular in computer vision domain in contrast to high demand for processing power. Due to the popularity and versatility, various previous research have attempted on obtaining adequate throughput on HOG-SVM. This research with a high throughput of 240FPS on single scale on VGA frames of size 640x480 out performs the best case performance on a single scale of previous research by approximately a factor of 3-4. Further it's an approximately 15x speed up over the GPU accelerated software version with the same accuracy. This research has explored the possibility of using a novel architecture based on deep pipelining, parallel processing and BRAM structures for achieving high performance on the HOG-SVM computation. Further the above developed (video processing unit) VPU which acts as a hardware accelerator will be integrated as a co-processing peripheral to a host CPU using a novel custom accelerator structure with on chip buses in a System-On-Chip (SoC) fashion. This could be used to offload the heavy video stream processing redundant computations to the VPU whereas the processing power of the CPU could be preserved for running light weight applications. This research mainly focuses on the architectural techniques used to achieve higher performance on the hardware accelerator and on the novel accelerator structure used to integrate the accelerator with the host CPU.

  • Acceleration Using Upper and Lower Smoothing Filters for Generating Oil-Film-Like Images

    Toru HIRAOKA  Kiichi URAHAMA  

     
    LETTER-Computer Graphics

      Pubricized:
    2019/09/10
      Vol:
    E102-D No:12
      Page(s):
    2642-2645

    A non-photorealistic rendering method has been proposed for generating oil-film-like images from photographic images by bilateral infra-envelope filter. The conventional method has a disadvantage that it takes much time to process. We propose a method for generating oil-film-like images that can be processed faster than the conventional method. The proposed method uses an iterative process with upper and lower smoothing filters. To verify the effectiveness of the proposed method, we conduct experiments using Lenna image. As a result of the experiments, we show that the proposed method can process faster than the conventional method.

  • A Deep Learning Approach to Writer Identification Using Inertial Sensor Data of Air-Handwriting

    Yanfang DING  Yang XUE  

     
    LETTER-Pattern Recognition

      Pubricized:
    2019/07/18
      Vol:
    E102-D No:10
      Page(s):
    2059-2063

    To the best of our knowledge, there are a few researches on air-handwriting character-level writer identification only employing acceleration and angular velocity data. In this paper, we propose a deep learning approach to writer identification only using inertial sensor data of air-handwriting. In particular, we separate different representations of degree of freedom (DoF) of air-handwriting to extract local dependency and interrelationship in different CNNs separately. Experiments on a public dataset achieve an average good performance without any extra hand-designed feature extractions.

  • A Method for Smartphone Theft Prevention When the Owner Dozes Off Open Access

    Kouhei NAGATA  Yoshiaki SEKI  

     
    LETTER-Physical Security

      Pubricized:
    2019/06/04
      Vol:
    E102-D No:9
      Page(s):
    1686-1688

    We propose a method for preventing smartphone theft when the owner dozes off. The owner of the smartphone wears a wristwatch type device that has an acceleration sensor and a vibration mode. This device detects when the owner dozes off. When the acceleration sensor in the smartphone detects an accident while dozing, the device vibrates. We implemented this function and tested its usefulness.

  • Bicycle Behavior Recognition Using 3-Axis Acceleration Sensor and 3-Axis Gyro Sensor Equipped with Smartphone

    Yuri USAMI  Kazuaki ISHIKAWA  Toshinori TAKAYAMA  Masao YANAGISAWA  Nozomu TOGAWA  

     
    PAPER-Intelligent Transport System

      Vol:
    E102-A No:8
      Page(s):
    953-965

    It becomes possible to prevent accidents beforehand by predicting dangerous riding behavior based on recognition of bicycle behaviors. In this paper, we propose a bicycle behavior recognition method using a three-axis acceleration sensor and three-axis gyro sensor equipped with a smartphone when it is installed on a bicycle handlebar. We focus on the periodic handlebar motions for balancing while running a bicycle and reduce the sensor noises caused by them. After that, we use machine learning for recognizing the bicycle behaviors, effectively utilizing the motion features in bicycle behavior recognition. The experimental results demonstrate that the proposed method accurately recognizes the four bicycle behaviors of stop, run straight, turn right, and turn left and its F-measure becomes around 0.9. The results indicate that, even if the smartphone is installed on the noisy bicycle handlebar, our proposed method can recognize the bicycle behaviors with almost the same accuracy as the one when a smartphone is installed on a rear axle of a bicycle on which the handlebar motion noises can be much reduced.

  • Fast Superpixel Segmentation via Boundary Sampling and Interpolation

    Li XU  Bing LUO  Mingming KONG  Bo LI  Zheng PEI  

     
    LETTER-Image Recognition, Computer Vision

      Pubricized:
    2019/01/22
      Vol:
    E102-D No:4
      Page(s):
    871-874

    This letter proposes a fast superpixel segmentation method based on boundary sampling and interpolation. The basic idea is as follow: instead of labeling local region pixels, we estimate superpixel boundary by interpolating candidate boundary pixel from a down-sampling image segmentation. On the one hand, there exists high spatial redundancy within each local region, which could be discarded. On the other hand, we estimate the labels of candidate boundary pixels via sampling superpixel boundary within corresponding neighbour. Benefiting from the reduction of candidate pixel distance calculation, the proposed method significantly accelerates superpixel segmentation. Experiments on BSD500 benchmark demonstrate that our method needs half the time compared with the state-of-the-arts while almost no accuracy reduction.

  • Proposed Hyperbolic NILT Method — Acceleration Techniques and Two-Dimensional Expansion for Electrical Engineering Applications

    Nawfal AL-ZUBAIDI R-SMITH  Lubomír BRANČÍK  

     
    PAPER-Numerical Analysis and Optimization

      Vol:
    E101-A No:5
      Page(s):
    763-771

    Numerical inverse Laplace transform (NILT) methods are potential methods for time domain simulations, for instance the analysis of the transient phenomena in systems with lumped and/or distributed parameters. This paper proposes a numerical inverse Laplace transform method based originally on hyperbolic relations. The method is further enhanced by properly adapting several convergence acceleration techniques, namely, the epsilon algorithm of Wynn, the quotient-difference algorithm of Rutishauser and the Euler transform. The resulting accelerated models are compared as for their accuracy and computational efficiency. Moreover, an expansion to two dimensions is presented for the first time in the context of the accelerated hyperbolic NILT method, followed by the error analysis. The expansion is done by repeated application of one-dimensional partial numerical inverse Laplace transforms. A detailed static error analysis of the resulting 2D NILT is performed to prove the effectivness of the method. The work is followed by a practical application of the 2D NILT method to simulate voltage/current distributions along a transmission line. The method and application are programmed using the Matlab language.

  • Multi-Peak Estimation for Real-Time 3D Ping-Pong Ball Tracking with Double-Queue Based GPU Acceleration

    Ziwei DENG  Yilin HOU  Xina CHENG  Takeshi IKENAGA  

     
    PAPER-Machine Vision and its Applications

      Pubricized:
    2018/02/16
      Vol:
    E101-D No:5
      Page(s):
    1251-1259

    3D ball tracking is of great significance in ping-pong game analysis, which can be utilized to applications such as TV contents and tactic analysis, with some of them requiring real-time implementation. This paper proposes a CPU-GPU platform based Particle Filter for multi-view ball tracking including 4 proposals. The multi-peak estimation and the ball-like observation model are proposed in the algorithm design. The multi-peak estimation aims at obtaining a precise ball position in case the particles' likelihood distribution has multiple peaks under complex circumstances. The ball-like observation model with 4 different likelihood evaluation, utilizes the ball's unique features to evaluate the particle's similarity with the target. In the GPU implementation, the double-queue structure and the vectorized data combination are proposed. The double-queue structure aims at achieving task parallelism between some data-independent tasks. The vectorized data combination reduces the time cost in memory access by combining 3 different image data to 1 vector data. Experiments are based on ping-pong videos recorded in an official match taken by 4 cameras located in 4 corners of the court. The tracking success rate reaches 99.59% on CPU. With the GPU acceleration, the time consumption is 8.8 ms/frame, which is sped up by a factor of 98 compared with its CPU version.

  • Hardware Accelerated Marking for Mark & Sweep Garbage Collection

    Shinji KAWAMURA  Tomoaki TSUMURA  

     
    PAPER-Computer System

      Pubricized:
    2018/01/15
      Vol:
    E101-D No:4
      Page(s):
    1107-1115

    Many mobile systems need to achieve both high performance and low memory usage, and the total performance of such the systems can be largely affected by the effectiveness of GC. Hence, the recent popularization of mobile devices makes the GC performance play one of the important roles on the wide range of platforms. The response performance degradation caused by suspending all processes for GC has been a well-known potential problem. Therefore, GC algorithms have been actively studied and improved, but they still have not reached any fundamental solution. In this paper, we focus on the point that the same objects are redundantly marked during the GC procedure implemented on DalvikVM, which is one of the famous runtime environments for the mobile devices. Then we propose a hardware support technique for improving marking routine of GC. We installed a set of tables to a processor for managing marked objects, and redundant marking for marked objects can be omitted by referring these tables. The result of the simulation experiment shows that the percentage of redundant marking is reduced by more than 50%.

  • Enabling FPGA-as-a-Service in the Cloud with hCODE Platform

    Qian ZHAO  Motoki AMAGASAKI  Masahiro IIDA  Morihiro KUGA  Toshinori SUEYOSHI  

     
    PAPER-Design Methodology and Platform

      Pubricized:
    2017/11/17
      Vol:
    E101-D No:2
      Page(s):
    335-343

    Major cloud service providers, including Amazon and Microsoft, have started employing field-programmable gate arrays (FPGAs) to build high-performance and low-power-consumption cloud capability. However, utilizing an FPGA-enabled cloud is still challenging because of two main reasons. First, the introduction of software and hardware co-design leads to high development complexity. Second, FPGA virtualization and accelerator scheduling techniques are not fully researched for cluster deployment. In this paper, we propose an open-source FPGA-as-a-service (FaaS) platform, the hCODE, to simplify the design, management and deployment of FPGA accelerators at cluster scale. The proposed platform implements a Shell-and-IP design pattern and an open accelerator repository to reduce design and management costs of FPGA projects. Efficient FPGA virtualization and accelerator scheduling techniques are proposed to deploy accelerators on the FPGA-enabled cluster easily. With the proposed hCODE, hardware designers and accelerator users can be organized on one platform to efficiently build open-hardware ecosystem.

  • Accelerated Widely-Linear Signal Detection by Polynomials for Over-Loaded Large-Scale MIMO Systems

    Qian DENG  Li GUO  Chao DONG  Jiaru LIN  Xueyan CHEN  

     
    PAPER-Antennas and Propagation

      Pubricized:
    2017/07/13
      Vol:
    E101-B No:1
      Page(s):
    185-194

    In this paper, we propose a low-complexity widely-linear minimum mean square error (WL-MMSE) signal detection based on the Chebyshev polynomials accelerated symmetric successive over relaxation (SSORcheb) algorithm for uplink (UL) over-loaded large-scale multiple-input multiple-output (MIMO) systems. The technique of utilizing Chebyshev acceleration not only speeds up the convergence rate significantly, and maximizes the data throughput, but also reduces the cost. By utilizing the random matrix theory, we present good estimates for the Chebyshev acceleration parameters of the proposed signal detection in real large-scale MIMO systems. Simulation results demonstrate that the new WL-SSORcheb-MMSE detection not only outperforms the recently proposed linear iterative detection, and the optimal polynomial expansion (PE) WL-MMSE detection, but also achieves a performance close to the exact WL-MMSE detection. Additionally, the proposed detection offers superior sum rate and bit error rate (BER) performance compared to the precision MMSE detection with substantially fewer arithmetic operations in a short coherence time. Therefore, the proposed detection can satisfy the high-density and high-mobility requirements of some of the emerging wireless networks, such as, the high-mobility Internet of Things (IoT) networks.

1-20hit(52hit)