Hiromu MIYAZAKI Takuto KANAMORI Md Ashraful ISLAM Kenji KISE
RISC-V is a RISC based open and loyalty free instruction set architecture which has been developed since 2010, and can be used for cost-effective soft processors on FPGAs. The basic 32-bit integer instruction set in RISC-V is defined as RV32I, which is sufficient to support the operating system environment and suits for embedded systems. In this paper, we propose an optimized RV32I soft processor named RVCoreP adopting five-stage pipelining. Three effective methods are applied to the processor to improve the operating frequency. These methods are instruction fetch unit optimization, ALU optimization, and data memory optimization. We implement RVCoreP in Verilog HDL and verify the behavior using Verilog simulation and an actual Xilinx Atrix-7 FPGA board. We evaluate IPC (instructions per cycle), operating frequency, hardware resource utilization, and processor performance. From the evaluation results, we show that RVCoreP achieves 30.0% performance improvement compared with VexRiscv, which is a high-performance and open source RV32I processor selected from some related works.
Piyumal RANAWAKA Mongkol EKPANYAPONG Adriano TAVARES Mathew DAILEY Krit ATHIKULWONGSE Vitor SILVA
Conventional sequential processing on software with a general purpose CPU has become significantly insufficient for certain heavy computations due to the high demand of processing power to deliver adequate throughput and performance. Due to many reasons a high degree of interest could be noted for high performance real time video processing on embedded systems. However, embedded processing platforms with limited performance could least cater the processing demand of several such intensive computations in computer vision domain. Therefore, hardware acceleration could be noted as an ideal solution where process intensive computations could be accelerated using application specific hardware integrated with a general purpose CPU. In this research we have focused on building a parallelized high performance application specific architecture for such a hardware accelerator for HOG-SVM computation implemented on Zynq 7000 FPGA. Histogram of Oriented Gradients (HOG) technique combined with a Support Vector Machine (SVM) based classifier is versatile and extremely popular in computer vision domain in contrast to high demand for processing power. Due to the popularity and versatility, various previous research have attempted on obtaining adequate throughput on HOG-SVM. This research with a high throughput of 240FPS on single scale on VGA frames of size 640x480 out performs the best case performance on a single scale of previous research by approximately a factor of 3-4. Further it's an approximately 15x speed up over the GPU accelerated software version with the same accuracy. This research has explored the possibility of using a novel architecture based on deep pipelining, parallel processing and BRAM structures for achieving high performance on the HOG-SVM computation. Further the above developed (video processing unit) VPU which acts as a hardware accelerator will be integrated as a co-processing peripheral to a host CPU using a novel custom accelerator structure with on chip buses in a System-On-Chip (SoC) fashion. This could be used to offload the heavy video stream processing redundant computations to the VPU whereas the processing power of the CPU could be preserved for running light weight applications. This research mainly focuses on the architectural techniques used to achieve higher performance on the hardware accelerator and on the novel accelerator structure used to integrate the accelerator with the host CPU.
Eun-Sung JUNG Si LIU Rajkumar KETTIMUTHU Sungwook CHUNG
The scale of scientific data generated by experimental facilities and simulations in high-performance computing facilities has been proliferating with the emergence of IoT-based big data. In many cases, this data must be transmitted rapidly and reliably to remote facilities for storage, analysis, or sharing, for the Internet of Things (IoT) applications. Simultaneously, IoT data can be verified using a checksum after the data has been written to the disk at the destination to ensure its integrity. However, this end-to-end integrity verification inevitably creates overheads (extra disk I/O and more computation). Thus, the overall data transfer time increases. In this article, we evaluate strategies to maximize the overlap between data transfer and checksum computation for astronomical observation data. Specifically, we examine file-level and block-level (with various block sizes) pipelining to overlap data transfer and checksum computation. We analyze these pipelining approaches in the context of GridFTP, a widely used protocol for scientific data transfers. Theoretical analysis and experiments are conducted to evaluate our methods. The results show that block-level pipelining is effective in maximizing the overlap mentioned above, and can improve the overall data transfer time with end-to-end integrity verification by up to 70% compared to the sequential execution of transfer and checksum, and by up to 60% compared to file-level pipelining.
This paper proposes and analyzes a pipelining scheme for a hardware squarer that can square unsigned integers of up to 12 bits. Each stage is designed and adjusted such that stage delays are well balanced and that the critical path delay of the design does not exceed the reference value which is set up based on the analysis. The resultant design has the critical path delay of approximately 3.5 times a full-adder delay. In an implementation using an Intel Stratix V FPGA, the design operates at approximately 23% higher frequency than the comparable pipelined squarer provided in the Intel library.
Runzi ZHANG Jinlin WANG Yiqiang SHENG Xiao CHEN Xiaozhou YE
Cache affinity has been proved to have great impact on the performance of packet processing applications on multi-core platforms. Flow-based packet scheduling can make the best of data cache affinity with flow associated data and context structures. However, little work on packet scheduling algorithms has been conducted when it comes to instruction cache (I-Cache) affinity in modified pipelining (MPL) architecture for multi-core systems. In this paper, we propose a protocol-aware packet scheduling (PAPS) algorithm aiming at maximizing I-Cache affinity at protocol dependent stages in MPL architecture for multi-protocol processing (MPP) scenario. The characteristics of applications in MPL are analyzed and a mapping model is introduced to illustrate the procedure of MPP. Besides, a stage processing time model for MPL is presented based on the analysis of multi-core cache hierarchy. PAPS is a kind of flow-based packet scheduling algorithm and it schedules flows in consideration of both application-level protocol of flows and load balancing. Experiments demonstrate that PAPS outperforms the Round Robin algorithm and the HRW-based (HRW) algorithm for MPP applications. In particular, PAPS can eliminate all I-Cache misses at protocol dependent stage and reduce the average CPU cycle consumption per packet by more than 10% in comparison with HRW.
We present a hierarchical replicated state machine (H-RSM) and its corresponding consensus protocol D-Paxos for replication across multiple data centers in the cloud. Our H-RSM is based on the idea of parallel processing and aims to improve resource utilization. We detail D-Paxos and theoretically prove that D-Paxos implements an H-RSM. With batching and logical pipelining, D-Paxos efficiently utilizes the idle time caused by high-latency message transmission in a wide-area network and available bandwidth in a local-area network. Experiments show that D-Paxos provides higher throughput and better scalability than other Paxos variants for replication across multiple data centers. To predict the optimal batch sizes when D-Paxos reaches its maximum throughput, an analytical model is developed theoretically and validated experimentally.
Bing XU Shouyi YIN Leibo LIU Shaojun WEI
Coarse Grained Reconfigurable Architectures (CGRAs) are promising platform based on its high-performance and low cost. Researchers have developed efficient compilers for mapping compute-intensive applications on CGRA using modulo scheduling. In order to generate loop kernel, every stage of kernel are forced to have the same execution time which is determined by the critical PE. Hence non-critical PEs can decrease the supply voltage according to its slack time. The variable Dual-VDD CGRA incorporates this feature to reduce power consumption. Previous work mainly focuses on calculating a global optimal VDDL using overall optimization method that does not fully exploit the flexibility of architecture. In this brief, we adopt variable optimal VDDL in each stage of kernel concerning their pattern respectively instead of the fixed simulated global optimal VDDL. Experiment shows our proposed heuristic approach could reduce the power by 27.6% on average without decreasing performance. The compilation time is also acceptable.
This paper presents a prediction model based on historical data to achieve optimal values of pipelining, concurrency and parallelism (PCP) in GridFTP data transfers in Cloud systems. Setting the correct values for these three parameters is crucial in achieving high throughput in end-to-end data movement. However, predicting and setting the optimal values for these parameters is a challenging task, especially in shared and non-predictive network conditions. Several factors can affect the optimal values for these parameters such as the background network traffic, available bandwidth, Round-Trip Time (RTT), TCP buffer size, and file size. Existing models either fail to provide accurate predictions or come with very high prediction overheads. The author shows that new model based on historical data can achieve high accuracy with low overhead.
Yunpyo HONG Juwon BYUN Youngjo KIM Jaeseok KIM
This letter proposes a pipelined architecture with prediction mode scheduling for high efficiency video coding (HEVC). An increased number of intra prediction modes in HEVC have introduced a new technique, named rough mode decision (RMD). This development, however, means that pipeline architectures for H.264 cannot be used in HEVC. The proposed scheme executes the RMD and the rate-distortion optimization (RDO) process simultaneously by grouping the intra prediction modes and changing the candidate selection method of the RMD algorithm. The proposed scheme reduces execution cycle by up to 26% with negligible coding loss.
The BCH code is one of the well-known error correction codes and its decoding contains many operations in Galois field. These operations require many instruction steps or large memory area for look-up tables on ordinary processors. While dedicated hardware BCH decoders achieves higher decoding speed than software, the advantage of software decoding is its flexibility to decode BCH codes of variable parameters. In this paper, an auxiliary circuit to be embedded in a pipelined processor is proposed which accelerates software decoding of various BCH codes.
Anh-Tuan HOANG Katsuhiro YAMAZAKI Shigeru OYANAGI
The security hash algorithm 512 (SHA-512), which is used to verify the integrity of a message, involves computational iterations on data. The huge computation delay generated in such iterations limits the entire throughput of the system and makes it difficult to pipeline the computation. We describe a way to pipeline the computation using fine-grained pipelining with balanced critical paths. In this method, one critical path is broken into two stages by using data forwarding. The other critical path is broken into three stages by using computation postponement. The resulting critical paths all have two adder-layers with some data movements, and thus are balanced. In addition, the method also allows register reduction. Also, the similarity in SHA-384 and SHA-512 are used for a multi-mode design, which can generate a message digest for both versions with the same throughput, but with only a small increase in hardware size. Experimental results show that our implementation achieved not only the best area performance rate (throughput divided by area), but also a higher throughput than almost all related work.
Shanghua GAO Hiroaki YOSHIDA Kenshu SETO Satoshi KOMATSU Masahiro FUJITA
In the deep-submicron era, interconnect delays are becoming one of the most important factors that can affect performance in the VLSI design. Many state-of-the-art research in high level synthesis try to consider the effect of interconnect delays. These research indeed achieve better performance compared with traditional ones which ignore interconnect delays. When applications contain large loops, however, there is still much room to improve the performance by exploiting the parallelism. In this paper, we, for the first time, propose a method to utilize pipelining techniques and take interconnect delays into account together so as to improve the quality of high level synthesis. The proposed method has the following two characteristics: 1) it separates the consideration of interconnect delay from computation delay, and allows concurrent data transfer and computation; 2) it belongs to modulo scheduling framework, in the sense that all iterations have identical schedules, and are initiated periodically. We evaluate our method from two different points of view. Firstly, we compare our method with an existing interconnect-aware high level synthesis that does not utilize pipelining techniques, and the experimental results show that our method can obtain about 3.4 times performance improvement on average. Secondly, we compare our method with an existing pipeline synthesis that does not consider interconnect delays, and the results show that our method can obtain about 1.5 times performance improvement on average. In addition, we also evaluate our proposed architecture and the experimental results demonstrate that it is better than existing architecture in [1].
Yeu-Horng SHIAU Jer Min JOU Chin-Chi LIU
In this paper, two efficient VLSI architectures for biorthogonal wavelet transform are proposed. One is constructed by the filter bank implementation and another is constructed by the lifting scheme. In the filter bank implementation, due to the symmetric property of biorthogonal wavelet transform, the proposed architecture uses fewer multipliers than the orthogonal wavelet transform. Besides, the polyphase decomposition is adopted to speed up the processing by a factor of 2. In the lifting scheme implementation, the pipeline-scheduling technique is employed to optimize the architecture. Both two architectures are with advantages of lower implementation complexity and higher throughput rate. Moreover, they can also be applied to realize the inverse DWT efficiently. Based on the above properties, the two architectures can be applied to time-critical image compressions, such as JPEG2000. Finally, the architecture constructed by the lifting scheme is implemented into a single chip on 0.35 µm 1P4M CMOS technology, and its area and working performance are 5.005 5.005 mm2 and 50 MHz, respectively.
Akihiko HYODO Masanori MUROYAMA Hiroto YASUURA
This paper presents a variable pipeline depth processor, which can dynamically adjust its pipeline depth and operating voltage at run-time, we call dynamic pipeline and voltage scaling (DPVS), depending on the workload characteristics under timing constraints. The advantage of adjusting pipeline depth is that it can eliminate the useless energy dissipation of the additional stalls, or NOPs and wrong-path instructions which would increase as the pipeline depth grow deeper in excess of the inherent parallelism. Although dynamic voltage scaling (DVS) is a very effective technique in itself for reducing energy dissipation, lowering supply voltage also causes performance degradation. By combining with dynamic pipeline scaling (DPS), it would be possible to retain performance at required level while reducing energy dissipation much further. Experimental results show the effectiveness of our DPVS approach for a variety of benchmarks, reducing total energy dissipation by up to 64.90% with an average of 27.42% without any effect on performance, compared with a processor using only DVS.
In this paper, a high-performance pipelining architecture for 2-D inverse discrete wavelet transform (IDWT) is proposed. We use a tree-block pipeline-scheduling scheme to increase computation performance and reduce temporary buffers. The scheme divides the input subbands into several wavelet blocks and processes these blocks one by one, so the size of buffers for storing temporal subbands is greatly reduced. After scheduling the data flow, we fold the computations of all wavelet blocks into the same low-pass and high-pass filters to achieve higher hardware utilization and minimize hardware cost, and pipeline these two filters efficiently to reach higher throughput rate. For the computations of N N-sample 2-D IDWT with filter length of size K, our architecture takes at most (2/3)N2 cycles and requires 2N(K-2) registers. In addition, each filter is designed regularly and modularly, so it is easily scalable for different filter lengths and different levels. Because of its small storage, regularity, and high performance, the architecture can be applied to time-critical image decompression.
Dirk FIMMEL Jan MULLER Renate MERKER
We present a new approach to the loop scheduling problem, which excels previous solutions in two important aspects: The resource constraints are formulated using flow graphs, and the initiation interval λ is treated as a rational variable. The approach supports heterogeneous processor architectures and pipelined functional units, and the Integer Linear Programming implementation produces an optimum loop schedule, whereby a minimum λ is achieved. Our flow graph model facilitates the cyclic binding of loop operations to functional units. Compared to previous research results, the solution can provide faster loop schedules and a significant reduction of the problem complexity and solution time.
This paper presents the design of a modulated complex lapped transform (MCLT) processor and its complex programmable logic device (CPLD) implementation. The MCLT is a 2x oversampled DFT filter bank; it performs well in applications that require a complex filter bank, such as noise reduction and acoustic echo cancellation. First, we show that the MCLT can be mapped to a Fast Fourier Transform (FFT). Then efficient implementation for fast MCLT computation is realized on the CPLD hardware using pipelining techniques. Detailed circuit design for the MLCT processor is presented, as well as timing diagrams for design verification and performance evaluation.
This paper presents unified VLSI architectures which can efficiently realize some widespread one-dimensional (1-D) and two-dimensional (2-D) real discrete trigonometric transforms, including the discrete Hartley transform (DHT), discrete sine transform (DST), and discrete cosine transform (DCT). First, succinct and unrestrictive Clenshaw's recurrence formula along with the inherent symmetry of the trigonometric functions are adequately employed to render efficient recurrences for computing these 1-D RDTT. By utilizing an appropriate row-column decomposition approach, the same set of recurrences can also be used to compute both of the row transform and column transform of the 2-D RDTT. Array architectures, basing on the developed recurrences, are then introduced to implement these 1-D and 2-D RDTT. Both architectures provide substantial hardware savings as compared with previous works. In addition, they are not only applicable to the 1-D and 2-D RDTT of arbitrary size, but they can also be easily adapted to compute all aforementioned RDTT with only minor modifications. A complete set of input/output (I/O) buffers along with a bidirectional circular shift matrix are addressed as well to enable the architectures to operate in a fully-pipelined manner and to rectify the transformed results in a natural order. Moreover, the resulting architectures are both highly regular, modular, and locally-connected, thus being amenable to VLSI implementations.
Akio HARADA Kiyoshi NISHIKAWA Hitoshi KIYA
In this paper, we propose two new pipelined adaptive digital filter architectures. The architectures are based on an equivalent expression of the least mean square (LMS) algorithm. It is shown that one of the proposed architectures achieves the minimum output latency, or zero without affecting the convergence characteristics. We also show that, by increasing the output latency be one, the other architecture can be obtained which has a shorter critical path.
Katsushige MATSUBARA Kiyoshi NISHIKAWA Hitoshi KIYA
A pipelined adaptive digital filter (ADF) architecture based on a two-dimensional least mean square algorithm is proposed. This architecture enables the ADF to be operated at a high clock rate and reduction of the required amount of hardware. To achieve this reduction we introduce a new building unit, called a block, and propose implementing the pipelined ADF using the block, Since the number of blocks in a cell is adjustable, we derive a condition for satisfying given specifications. We show the smallest number of blocks and the corresponding delay can be determined by using the proposed method.