Binhao HE Meiting XUE Shubiao LIU Feng YU Weijie CHEN
The top-K sorting is a variant of sorting used heavily in applications such as database management systems. Recently, the use of field programmable gate arrays (FPGAs) to accelerate sorting operation has attracted the interest of researchers. However, existing hardware top-K sorting algorithms are either resource-intensive or of low throughput. In this paper, we present a resource-efficient top-K sorting architecture that is composed of L cascading sorting units, and each sorting unit is composed of P sorting cells. K=PL largest elements are produced when a variable length input sequence is processed. This architecture can operate at a high frequency while consuming fewer resources. The experimental results show that our architecture achieved a maximum 1.2x throughput-to-resource improvement compared to previous studies.
Wen SHAO Rei KAWAKAMI Takeshi NAEMURA
Previous studies on anomaly detection in videos have trained detectors in which reconstruction and prediction tasks are performed on normal data so that frames on which their task performance is low will be detected as anomalies during testing. This paper proposes a new approach that involves sorting video clips, by using a generative network structure. Our approach learns spatial contexts from appearances and temporal contexts from the order relationship of the frames. Experiments were conducted on four datasets, and we categorized the anomalous sequences by appearance and motion. Evaluations were conducted not only on each total dataset but also on each of the categories. Our method improved detection performance on both anomalies with different appearance and different motion from normality. Moreover, combining our approach with a prediction method produced improvements in precision at a high recall.
Computing the Lempel-Ziv Factorization (LZ77) of a string is one of the most important problems in computer science. Nowadays, it has been widely used in many applications such as data compression, text indexing and pattern discovery, and already become the heart of many file compressors like gzip and 7zip. In this paper, we show a linear time algorithm called Xone for computing the LZ77, which has the same space requirement with the previous best space requirement for linear time LZ77 factorization called BGone. Xone greatly improves the efficiency of BGone. Experiments show that the two versions of Xone: XoneT and XoneSA are about 27% and 31% faster than BGoneT and BGoneSA, respectively.
Data sorting is an important operation in computer science. It is extensively used in several applications such as database and searching. While high-performance sorting accelerators are in demand, it is very important to pay attention to the hardware resources for such kind of high-performance sorters. In this paper, we propose three FPGA based architectures to accelerate sorting operation based on the merge sorting algorithm. We call our proposals as WMS: Wide Merge Sorter, EHMS: Efficient Hardware Merge Sorter, and EHMSP: Efficient Hardware Merge Sorter Plus. We target the Virtex UltraScale FPGA device. Evaluation results show that our proposed merge sorters maintain both the high-performance and cost-effective properties. While using much fewer hardware resources, our proposed merge sorters achieve higher performance compared to the state-of-the-art. For instance, with 256 sorted records are produced per cycle, implementation results of proposed EHMS show a significant reduction in the required number of Flip Flops (FFs) and Look-Up Tables (LUTs) to about 66% and 79%, respectively over the state-of-the-art merge sorter. Moreover, while requiring fewer hardware resources, EHMS achieves about 1.4x higher throughput than the state-of-the-art merge sorter. For the same number of produced records, proposed WMS also achieves about 1.6x throughput improvement over the state-of-the-art while requiring about 81% of FFs and 76% of LUTs needed by the state-of-the-art sorter.
Takahiro OTA Hiroyoshi MORITA Akiko MANADA
This paper proposes two variants of improved Compression by Substring Enumeration (CSE) with a finite alphabet. In previous studies on CSE, an encoder utilizes inequalities which evaluate the number of occurrences of a substring or a minimal forbidden word (MFW) to be encoded. The inequalities are derived from a contingency table including the number of occurrences of a substring or an MFW. Moreover, codeword length of a substring and an MFW grows with the difference between the upper and lower bounds deduced from the inequalities, however the lower bound is not tight. Therefore, we derive a new tight lower bound based on the contingency table and consequently propose a new CSE algorithm using the new inequality. We also propose a new encoding order of substrings and MFWs based on a sorted contingency table such that both its row and column marginal total are sorted in descending order instead of a lexicographical order used in previous studies. We then propose a new CSE algorithm which is the first proposed CSE algorithm using the new encoding order. Experimental results show that compression ratios of all files of the Calgary corpus in the proposed algorithms are better than those of a previous study on CSE with a finite alphabet. Moreover, compression ratios under the second proposed CSE get better than or equal to that under a well-known compressor for 11 files amongst 14 files in the corpus.
Meiting XUE Huan ZHANG Weijun LI Feng YU
Sorting is one of the most fundamental problems in mathematics and computer science. Because high-throughput and flexible sorting is a key requirement in modern databases, this paper presents efficient techniques for designing a high-throughput sorting matrix that supports continuous data sequences. There have been numerous studies on the optimization of sorting circuits on FPGA (field-programmable gate array) platforms. These studies focused on attaining high throughput for a single command with fixed data width. However, the architectures proposed do not meet the requirement of diversity for database data types. A sorting matrix architecture is thus proposed to overcome this problem. Our design consists of a matrix of identical basic sorting cells. The sorting cells work in a pipeline and in parallel, and the matrix can simultaneously process multiple data streams, which can be combined into a high-width single-channel data stream or low-width multiple-channel data streams. It can handle continuous sequences and allows for sorting variable-length data sequences. Its maximum throughput is approximately 1.4 GB/s for 32-bit sequences and approximately 2.5 GB/s for 64-bit sequences on our platform.
Runze WU Jiajia ZHU Liangrui TANG Chen XU Xin WU
Deploying low power nodes (LPNs), which reuse the spectrum licensed to a macrocell network, is considered to be a promising way to significantly boost network capacity. Due to the spectrum-sharing, the deployment of LPNs could trigger the severe problem of interference including intra-tier interference among dense LPNs and inter-tier interference between LPNs and the macro base station (MBS), which influences the system performance strongly. In this paper, we investigate a spectrum-sharing approach in the downlink for two-tier networks, which consists of small cells (SCs) with several LPNs and a macrocell with a MBS, aiming to mitigate the interference and improve the capacity of SCs. The spectrum-sharing approach is described as a multi-objective optimization problem. The problem is solved by the nondominated sorting genetic algorithm version II (NSGA-II), and the simulations show that the proposed spectrum-sharing approach is superior to the existing one.
Sorting is an extremely important computation kernel that has been accelerated in a lot of fields such as databases, image processing, and genome analysis. Given that advent of Internet of Things (IoT) era due to mobile technology progressions, the future needs a sorting method that is available on any environment, such as not only high performance systems like servers but also low computational performance machines like embedded systems. In this paper, we present an FPGA-based sorting accelerator combining Sorting Network and Merge Sorter Tree, which is customizable by means of tuning design parameters. The proposed FPGA accelerator sorts data sent from a host PC via the PCIe bus, and sends back the fully sorted data sequence to it. We also present a detailed analytical model that accurately estimates the sorting performance. Due to these characteristics, designers can know how fast a developed sorting hardware is in advance and can implement the best one to fulfill the cost and performance constraints. Our experiments show that the proposed hardware achieves up to 19.5x sorting performance, compared with Intel Core i7-3770K operating at 3.50GHz, when sorting 256M 32-bits integer elements. However, this result is limited because of insufficient memory bandwidth. To overcome this problem, we propose a data compression mechanism and the experimental result shows that the sorting hardware with it achieves almost 90% of the estimated performance, while the hardware without it does about 60%. In order to allow every designer to easily and freely use this accelerator, the RTL source code is released as open-source hardware.
Tingting CHEN Weijun LI Feng YU Qianjian XING
A modular serial pipelined sorting architecture for continuous input sequences is presented. It supports continuous sequences, whose lengths can be dynamically changed, and does so using a very simple control strategy. It consists of identical serial cascaded sorting cells, and lends itself to high frequency implementation with any number of sorting cells, because both data and control signals are pipelined. With L cascaded sorting cells, it produces a fully sorted result for sequences whose length N is equal to or less than L+1; for longer sequences, the largest L elements are sorted out. Being modularly designed, several independent smaller sorters can be dynamically configured to form a larger sorter.
Pyung KIM Younho LEE Hyunsoo YOON
In this paper, we present a faster (wall-clock time) sorting method for numerical data subjected to fully homomorphic encryption (FHE). Owing to circuit-based construction and the FHE security property, most existing sorting methods cannot be applied to encrypted data without significantly compromising efficiency. The proposed algorithm utilizes the cryptographic single-instruction multiple-data (SIMD) operation, which is supported by most existing FHE algorithms, to reduce the computational overhead. We conducted a careful analysis of the number of required recryption operations, which are the computationally dominant operations in FHE. Accordingly, we verified that the proposed SIMD-based sorting algorithm completes the given task more quickly than existing sorting methods if the number of data items and (or) the maximum bit length of each data item exceed specific thresholds.
Atsuki NAGAO Kazuhisa SETO Junichi TERUYAMA
We propose efficient algorithms for Sorting k-Sets in Bins. The Sorting k-Sets in Bins problem can be described as follows. We are given numbered n bins with k balls in each bin. Balls in the i-th bin are numbered n-i+1. We can only swap balls between adjacent bins. Our task is to move all of the balls to the same numbered bins. For this problem, we give an efficient greedy algorithm with $rac{k+1}{4}n^2+O(k+n)$ swaps and provide a detailed analysis for k=3. In addition, we give a more efficient recursive algorithm using $rac{15}{16}n^2+O(n)$ swaps for k=3.
Yuling LIU Xinxin QU Guojiang XIN Peng LIU
A novel ROI-based reversible data hiding scheme is proposed for medical images, which is able to hide electronic patient record (EPR) and protect the region of interest (ROI) with tamper localization and recovery. The proposed scheme combines prediction error expansion with the sorting technique for embedding EPR into ROI, and the recovery information is embedded into the region of non-interest (RONI) using histogram shifting (HS) method which hardly leads to the overflow and underflow problems. The experimental results show that the proposed scheme not only can embed a large amount of information with low distortion, but also can localize and recover the tampered area inside ROI.
Arne KUTZNER Pok-Son KIM Won-Kwang PARK
We propose a family of algorithms for efficiently merging on contemporary GPUs, so that each algorithm requires O(m log (+1)) element comparisons, where m and n are the sizes of the input sequences with m ≤ n. According to the lower bounds for merging all proposed algorithms are asymptotically optimal regarding the number of necessary comparisons. First we introduce a parallely structured algorithm that splits a merging problem of size 2l into 2i subproblems of size 2l-i, for some arbitrary i with (0 ≤ i ≤ l). This algorithm represents a merger for i=l but it is rather inefficient in this case. The efficiency is boosted by moving to a two stage approach where the splitting process stops at some predetermined level and transfers control to several parallely operating block-mergers. We formally prove the asymptotic optimality of the splitting process and show that for symmetrically sized inputs our approach delivers up to 4 times faster runtimes than the thrust::merge function that is part of the Thrust library. For assessing the value of our merging technique in the context of sorting we construct and evaluate a MergeSort on top of it. In the context of our benchmarking the resulting MergeSort clearly outperforms the MergeSort implementation provided by the Thrust library as well as Cederman's GPU optimized variant of QuickSort.
Ukrit WATCHAREERUETAI Tetsuya MATSUMOTO Yoshinori TAKEUCHI Hiroaki KUDO Noboru OHNISHI
We propose a new multi-objective genetic programming (MOGP) for automatic construction of image feature extraction programs (FEPs). The proposed method was originated from a well known multi-objective evolutionary algorithm (MOEA), i.e., NSGA-II. The key differences are that redundancy-regulation mechanisms are applied in three main processes of the MOGP, i.e., population truncation, sampling, and offspring generation, to improve population diversity as well as convergence rate. Experimental results indicate that the proposed MOGP-based FEP construction system outperforms the two conventional MOEAs (i.e., NSGA-II and SPEA2) for a test problem. Moreover, we compared the programs constructed by the proposed MOGP with four human-designed object recognition programs. The results show that the constructed programs are better than two human-designed methods and are comparable with the other two human-designed methods for the test problem.
Young-Shin HAN SoYoung KIM TaeKyu KIM Jason J. JUNG
We exploit a structural knowledge representation scheme called System Entity Structure (SES) methodology to represent and manage wafer failure patterns which can make a significant influence to FABs in the semiconductor industry. It is important for the engineers to simulate various system verification processes by using predefined system entities (e.g., decomposition, taxonomy, and coupling relationships of a system) contained in the SES. For better computational performance, given a certain failure pattern, a Pruned SES (PES) can be extracted by selecting the only relevant system entities from the SES. Therefore, the SES-based simulation system allows the engineers to efficiently evaluate and monitor semiconductor data by i) analyzing failures to find out the corresponding causes and ii) managing historical data related to such failures.
In this paper, we propose CRadix sort, a new string sorting algorithm based on MSD radix sort. CRadix sort causes fewer cache misses than MSD radix sort by uniquely associating a small block of main memory called the key buffer to each key and temporarily storing a portion of each key into its corresponding key buffer. Experimental results in running time comparisons with other string sorting algorithms are provided for showing the effectiveness of CRadix sort.
Yang SONG Zhenyu LIU Takeshi IKENAGA Satoshi GOTO
This paper presents two hardware-friendly low-power oriented fast motion estimation (ME) algorithms and their VLSI implementations. The basic idea of the proposed partial distortion sorting (PDS) algorithm is to disable the search points which have larger partial distortions during the ME process, and only keep those search points with smaller ones. To further reduce the computation overhead, a simplified local PDS (LPDS) algorithm is also presented. Experiments show that the PDS and LPDS algorithms can provide almost the same image quality as full search only with 36.7% computation complexity. The proposed two algorithms can be integrated into different FSBMA architectures to save power consumption. In this paper, the 1-D inter ME architecture [12] is used as an detailed example. Under the worst working conditions (1.62 V, 125) and 166 MHz clock frequency, the PDS algorithm can reduce 33.3% power consumption with 4.05 K gates extra hardware cost, and the LPDS can reduce 37.8% power consumption with 1.73 K gates overhead.
A radio network (RN for short) is a distributed system with no central arbiter, consisting of n radio transceivers, henceforth referred to as stations. We assume that the stations run on batteries and expends power while broadcasting/receiving a data packet. Thus, the most important measure to evaluate protocols on the radio network is the number of awake time slots, in which a station is broadcasting/receiving a data packet. We also assume that the stations are identical and have no unique ID number, and no station knows the number n of the stations. For given n keys one for each station, the ranking problem asks each station to determine the number of keys in the RN smaller than its own key. The main contribution of this paper is to present an optimal randomized ranking protocol on the k-channel RN. Our protocol solves the ranking problem, with high probability, in O(+log n) time slots with every station being awake for at most O(log n) time slots. We also prove that any randomized ranking protocol is required to run in expected Ω(+log n) time slots with at least one station being awake for expected Ω(log n) time slots. Therefore, our ranking protocol is optimal.
Shyue-Horng SHIAU Chang-Biau YANG
The generalized sorting problem is to find the first k largest elements among n input elements and to report them in a sorted order. In this paper, we propose a fast generalized sorting algorithm under the single hop wireless networks model with collision detection (WNCD). The algorithm is based on the maximum finding algorithm and the sorting algorithm. The key point of our algorithm is to use successful broadcasts to build broadcasting layers logically and then to distribute the data elements into those logic layers properly. Thus, the number of broadcast conflicts is reduced. We prove that the average time complexity required for our generalized sorting algorithm is Θ(k + log(n - k)). When k = 1, our generalized sorting algorithm does the work of finding maximum, and when k = n, it does the work of sorting. Thus, the analysis of our algorithm builds a connection between the two extremely special cases which are maximum finding and sorting.
A new sorting algorithm and architecture for fast median filter are proposed. This algorithm results in low area VLSI architecture producing low switching activity and without using feedback. The main idea is to employ the extra matrix for fast search operation of rank of oldest window element. We simulated and synthesized this algorithm using SYNOPSYSTM and showed the sufficiency in real time operation.