IEICE global.ieice.org Site

Keyword Search Result

[Keyword] parallel algorithm(73hit)

1-20hit(73hit)

Parallelization on a Minimal Substring Search Algorithm for Regular Expressions
Yosuke OBE Hiroaki YAMAMOTO Hiroshi FUJIWARA

PAPER-Fundamentals of Information Systems

Pubricized:
2023/02/08
Vol:
E106-D No:5
Page(s):
952-958
Let us consider a regular expression r of length m and a text string T of length n over an alphabet Σ. Then, the RE minimal substring search problem is to find all minimal substrings of T matching r. Yamamoto proposed O(mn) time and O(m) space algorithm using a Thompson automaton. In this paper, we improve Yamamoto's algorithm by introducing parallelism. The proposed algorithm runs in O(mn) time in the worst case and in O(mn/p) time in the best case, where p denotes the number of processors. Besides, we show a parameter related to the parallel time of the proposed algorithm. We evaluate the algorithm experimentally.
Accelerating the Smith-Waterman Algorithm Using the Bitwise Parallel Bulk Computation Technique on the GPU
Takahiro NISHIMURA Jacir Luiz BORDIM Yasuaki ITO Koji NAKANO

PAPER-Fundamentals of Information Systems

Pubricized:
2019/07/09
Vol:
E102-D No:12
Page(s):
2400-2408
The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. It is known that the bulk execution of an oblivious sequential algorithm can be implemented to run efficiently on a GPU. The bulk execution supports fine grained bitwise parallelism, allowing it to achieve high acceleration over a straightforward sequential computation. The main contribution of this work is to present a Bitwise Parallel Bulk Computation (BPBC) to accelerate the Smith-Waterman Algorithm (SWA) using the affine gap penalty. Thus, our idea is to convert this computation into a circuit simulation using the BPBC technique to compute multiple instances simultaneously. The proposed BPBC technique for the SWA has been implemented on the GPU and CPU. Experimental results show that the proposed BPBC for the SWA accelerates the computation by over 646 times as compared to a single CPU implementation and by 6.9 times as compared to a multi-core CPU implementation with 160 threads.
An Efficient GPU Implementation of CKY Parsing Using the Bitwise Parallel Bulk Computation Technique
Toru FUJITA Koji NAKANO Yasuaki ITO Daisuke TAKAFUJI

PAPER-GPU computing

Pubricized:
2017/08/04
Vol:
E100-D No:12
Page(s):
2857-2865
The main contribution of this paper is to present an efficient GPU implementation of bulk computation of the CKY parsing for a context-free grammar, which determines if a context-free grammar derives each of a lot of input strings. The bulk computation is to execute the same algorithm for a lot of inputs in turn or at the same time. The CKY parsing is to determine if a context-free grammar derives a given string. We show that the bulk computation of the CKY parsing can be implemented in the GPU efficiently using Bitwise Parallel Bulk Computation (BPBC) technique. We also show the rule minimization technique and the dynamic scheduling method for further acceleration of the CKY parsing on the GPU. The experimental results using NVIDIA TITAN X GPU show that our implementation of the bitwise-parallel CKY parsing for strings of length 32 takes 395µs per string with 131072 production rules for 512 non-terminal symbols.
Fully Parallelized LZW Decompression for CUDA-Enabled GPUs
Shunji FUNASAKA Koji NAKANO Yasuaki ITO

PAPER-GPU computing

Pubricized:
2016/08/25
Vol:
E99-D No:12
Page(s):
2986-2994
The main contribution of this paper is to present a work-optimal parallel algorithm for LZW decompression and to implement it in a CUDA-enabled GPU. Since sequential LZW decompression creates a dictionary table by reading codes in a compressed file one by one, it is not easy to parallelize it. We first present a work-optimal parallel LZW decompression algorithm on the CREW-PRAM (Concurrent-Read Exclusive-Write Parallel Random Access Machine), which is a standard theoretical parallel computing model with a shared memory. We then go on to present an efficient implementation of this parallel algorithm on a GPU. The experimental results show that our GPU implementation performs LZW decompression in 1.15 milliseconds for a gray scale TIFF image with 4096×3072 pixels stored in the global memory of GeForce GTX 980. On the other hand, sequential LZW decompression for the same image stored in the main memory of Intel Core i7 CPU takes 50.1 milliseconds. Thus, our parallel LZW decompression on the global memory of the GPU is 43.6 times faster than a sequential LZW decompression on the main memory of the CPU for this image. To show the applicability of our GPU implementation for LZW decompression, we evaluated the SSD-GPU data loading time for three scenarios. The experimental results show that the scenario using our LZW decompression on the GPU is faster than the others.
A Cloud-Friendly Communication-Optimal Implementation for Strassen's Matrix Multiplication Algorithm
Jie ZHOU Feng YU

PAPER-Fundamentals of Information Systems

Pubricized:
2015/07/27
Vol:
E98-D No:11
Page(s):
1896-1905
Due to its on-demand and pay-as-you-go properties, cloud computing has become an attractive alternative for HPC applications. However, communication-intensive applications with complex communication patterns still cannot be performed efficiently on cloud platforms, which are equipped with MapReduce technologies, such as Hadoop and Spark. In particular, one major obstacle is that MapReduce's simple programming model cannot explicitly manipulate data transfers between compute nodes. Another obstacle is cloud's relatively poor network performance compared with traditional HPC platforms. The traditional Strassen's algorithm of square matrix multiplication has a recursive and complex pattern on the HPC platform. Therefore, it cannot be directly applied to the cloud platform. In this paper, we demonstrate how to make Strassen's algorithm with complex communication patterns “cloud-friendly”. By reorganizing Strassen's algorithm in an iterative pattern, we completely separate its computations and communications, making it fit to MapReduce programming model. By adopting a novel data/task parallel strategy, we solve Strassen's data dependency problems, making it well balanced. This is the first instance of Strassen's algorithm in MapReduce-style systems, which also matches Strassen's communication lower bound. Further experimental results show that it achieves a speedup ranging from 1.03× to 2.50× over the classical Θ(n3) algorithm. We believe the principle can be applied to many other complex scientific applications.
Parallelization of Dynamic Time Warping on a Heterogeneous Platform
Yao ZHENG Limin XIAO Wenqi TANG Lihong SHANG Guangchao YAO Li RUAN

LETTER-Algorithms and Data Structures

Vol:
E97-A No:11
Page(s):
2258-2262
The dynamic time warping (DTW) algorithm is widely used to determine time series similarity search. As DTW has quadratic time complexity, the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. In this paper, we present a parallel approach for DTW on a heterogeneous platform with a graphics processing unit (GPU). In order to exploit fine-grained data-level parallelism, we propose a specific parallel decomposition in DTW. Furthermore, we introduce an optimization technique called diamond tiling to improve the utilization of threads. Results show that our approach substantially reduces computational time.
Hybrid Consultant-Guided Search for the Traveling Salesperson Problem
Hiroyuki EBARA Yudai HIRANUMA Koki NAKAYAMA

PAPER-Algorithms and Data Structures

Vol:
E97-A No:8
Page(s):
1728-1738
Metaheuristic methods have been studied for combinational optimization problems for some time. Recently, a Consultant-Guided Search (CGS) has been proposed as a metaheuristic method for the Traveling Salesperson Problem (TSP). This approach is an algorithm in which a virtual person called a client creates a solution based on consultation with a virtual person called a consultant. In this research, we propose a parallel algorithm which uses the Ant Colony System (ACS) to create a solution with a consultant in a Consultant-Guided Search, and calculate an approximation solution for the TSP. Finally, we execute a computer experiment using the benchmark problems (TSPLIB). Our algorithm provides a solution with less than 2% error rate for problem instances using less than 2000 cities.
A Fast Parallel Algorithm for Indexing Human Genome Sequences
Woong-Kee LOH Kyoung-Soo HAN

LETTER-Data Engineering, Web Information Systems

Vol:
E97-D No:5
Page(s):
1345-1348
A suffix tree is widely adopted for indexing genome sequences. While supporting highly efficient search, the suffix tree has a few shortcomings such as very large size and very long construction time. In this paper, we propose a very fast parallel algorithm to construct a disk-based suffix tree for human genome sequences. Our algorithm constructs a suffix array for part of the suffixes in the human genome sequence and then converts it into a suffix tree very quickly. It outperformed the previous algorithms by Loh et al. and Barsky et al. by up to 2.09 and 3.04 times, respectively.
Asynchronous Memory Machine Models with Barrier Synchronization
Koji NAKANO

PAPER-Parallel and Distributed Computing

Vol:
E97-D No:3
Page(s):
431-441
The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. It is assumed that warps (or groups of threads) on the DMM and the UMM work synchronously in a round-robin manner. However, warps work asynchronously in real GPUs, in the sense that they are randomly (or arbitrarily) dispatched for execution. The first contribution of this paper is to introduce asynchronous versions of these models in which warps are arbitrarily dispatched. In addition, we assume that threads can execute the “syncthreads” instruction for barrier synchronization. Since the barrier synchronization operation may be costly, we should evaluate and minimize the number of barrier synchronization operations executed by parallel algorithms. The second contribution of this paper is to show a parallel algorithm to the sum of n numbers in optimal computing time and few barrier synchronization steps. Our parallel algorithm computes the sum of n numbers in O(n/w+llog n) time units and O(log l/log w+log log w) barrier synchronization steps using wl threads on the asynchronous UMM with width w and latency l. Since the computation of the sum takes at least Ω(n/w+llog n) time units, this algorithm is time optimal. Finally, we show that the prefix-sums of n numbers can also be computed in O(n/w+llog n) time units and O(log l/log w+log log w) barrier synchronization steps using wl threads.
A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation
Yasuaki ITO Koji NAKANO

PAPER

Vol:
E96-D No:12
Page(s):
2596-2603
This paper presents a GPU (Graphics Processing Units) implementation of dynamic programming for the optimal polygon triangulation. Recently, GPUs can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture) provided by NVIDIA. The optimal polygon triangulation problem for a convex polygon is an optimization problem to find a triangulation with minimum total weight. It is known that this problem for a convex n-gon can be solved using the dynamic programming technique in O(n3) time using a work space of size O(n2). In this paper, we propose an efficient parallel implementation of this O(n3)-time algorithm on the GPU. In our implementation, we have used two new ideas to accelerate the dynamic programming. The first idea (adaptive granularity) is to partition the dynamic programming algorithm into many sequential kernel calls of CUDA, and to select the best parameters for the size and the number of blocks for each kernel call. The second idea (sliding and mirroring arrangements) is to arrange the working data for coalesced access of the global memory in the GPU to minimize the memory access overhead. Our implementation using these two ideas solves the optimal polygon triangulation problem for a convex 8192-gon in 5.57 seconds on the NVIDIA GeForce GTX 680, while a conventional CPU implementation runs in 1939.02 seconds. Thus, our GPU implementation attains a speedup factor of 348.02.
Optimal Parallel Algorithms for Computing the Sum, the Prefix-Sums, and the Summed Area Table on the Memory Machine Models
Koji NAKANO

PAPER

Vol:
E96-D No:12
Page(s):
2626-2634
The main contribution of this paper is to show optimal parallel algorithms to compute the sum, the prefix-sums, and the summed area table on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, and the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. We then go on to show that $Omega({nover w}+{nlover p}+llog n)$ time units are necessary to compute the sum. We also present a parallel algorithm that computes the prefix-sums of n numbers in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. Finally, we show that the summed area table of size $sqrt{n} imessqrt{n}$ can be computed in $O({nover w}+{nlover p}+llog n)$ time units on the DMM and the UMM. Since the computation of the prefix-sums and the summed area table is at least as hard as the sum computation, these parallel algorithms are also optimal.
Asymptotically Optimal Merging on ManyCore GPUs
Arne KUTZNER Pok-Son KIM Won-Kwang PARK

PAPER-Parallel and Distributed Computing

Vol:
E95-D No:12
Page(s):
2769-2777
We propose a family of algorithms for efficiently merging on contemporary GPUs, so that each algorithm requires O(m log (+1)) element comparisons, where m and n are the sizes of the input sequences with m ≤ n. According to the lower bounds for merging all proposed algorithms are asymptotically optimal regarding the number of necessary comparisons. First we introduce a parallely structured algorithm that splits a merging problem of size 2l into 2i subproblems of size 2l-i, for some arbitrary i with (0 ≤ i ≤ l). This algorithm represents a merger for i=l but it is rather inefficient in this case. The efficiency is boosted by moving to a two stage approach where the splitting process stops at some predetermined level and transfers control to several parallely operating block-mergers. We formally prove the asymptotic optimality of the splitting process and show that for symmetrically sized inputs our approach delivers up to 4 times faster runtimes than the thrust::merge function that is part of the Thrust library. For assessing the value of our merging technique in the context of sorting we construct and evaluate a MergeSort on top of it. In the context of our benchmarking the resulting MergeSort clearly outperforms the MergeSort implementation provided by the Thrust library as well as Cederman's GPU optimized variant of QuickSort.
A Dynamically Reconfigurable FPGA-Based Pattern Matching Hardware for Subclasses of Regular Expressions
Yusaku KANETA Shingo YOSHIZAWA Shin-ichi MINATO Hiroki ARIMURA Yoshikazu MIYANAGA

PAPER-Computer System

Vol:
E95-D No:7
Page(s):
1847-1857
In this paper, we propose a novel architecture for large-scale regular expression matching, called dynamically reconfigurable bit-parallel NFA architecture (Dynamic BP-NFA), which allows dynamic loading of regular expressions on-the-fly as well as efficient pattern matching for fast data streams. This is the first dynamically reconfigurable hardware with guaranteed performance for the class of extended patterns, which is a subclass of regular expressions consisting of union of characters and its repeat. This class allows operators such as character classes, gaps, optional characters, and bounded and unbounded repeats of character classes. The key to our architecture is the use of bit-parallel pattern matching approach, in which the information of an input non-deterministic finite automaton (NFA) is first compactly encoded in bit-masks stored in a collection of registers and block RAMs. Then, the NFA is efficiently simulated by a fixed circuitry using bitwise Boolean and arithmetic operations consuming one input character per clock regardless of the actual contents of an input text. Experimental results showed that our hardwares for both string and extended patterns were comparable to previous dynamically reconfigurable hardwares in their performances.
An Optimal Parallel Algorithm for Constructing a Spanning Tree on Circular Permutation Graphs
Hirotoshi HONMA Saki HONMA Shigeru MASUYAMA

PAPER

Vol:
E92-D No:2
Page(s):
141-148
The spanning tree problem is to find a tree that connects all the vertices of G. This problem has many applications, such as electric power systems, computer network design and circuit analysis. Klein and Stein demonstrated that a spanning tree can be found in O(log n) time with O(n+m) processors on the CRCW PRAM. In general, it is known that more efficient parallel algorithms can be developed by restricting classes of graphs. Circular permutation graphs properly contain the set of permutation graphs as a subclass and are first introduced by Rotem and Urrutia. They provided O(n2.376) time recognition algorithm. Circular permutation graphs and their models find several applications in VLSI layout. In this paper, we propose an optimal parallel algorithm for constructing a spanning tree on circular permutation graphs. It runs in O(log n) time with O(n/log n) processors on the EREW PRAM.
An Optimal Parallel Algorithm for Constructing a Spanning Forest on Trapezoid Graphs
Hirotoshi HONMA Shigeru MASUYAMA

PAPER

Vol:
E91-A No:9
Page(s):
2296-2300
Given a simple graph G with n vertices, m edges and k connected components. The spanning forest problem is to find a spanning tree for each connected component of G. This problem has applications to the electrical power demand problem, computer network design, circuit analysis, etc. An optimal parallel algorithm for finding a spanning tree on the trapezoid graph is given by Bera et al., it takes O(log n) time with O(n/log n) processors on the EREW (Exclusive-Read Exclusive-Write) PRAM. Bera et al.'s algorithm is very simple and elegant. Moreover, it can correctly construct a spanning tree when the graph is connected. However, their algorithm can not accept a disconnected graph as an input. Applying their algorithm to a disconnected graph, Concurrent-Write occurs once for each connected component, thus this can not be achieved on EREW PRAM. In this paper we present an O(log n) time parallel algorithm with O(n/log n) processors for constructing a spanning forest on trapezoid graph G on EREW PRAM even if G is a disconnected graph.
An Optimal Parallel Algorithm for Finding All Hinge Vertices of a Circular-Arc Graph
Hirotoshi HONMA Shigeru MASUYAMA

PAPER-Algorithms and Data Structures

Vol:
E91-A No:1
Page(s):
383-391
Let G =(V, E) be an undirected simple graph with u ∈ V. If there exist any two vertices in G whose distance becomes longer when a vertex u is removed, then u is defined as a hinge vertex. Finding the set of hinge vertices in a graph is useful for identifying critical nodes in an actual network. A number of studies concerning hinge vertices have been made in recent years. In a number of graph problems, it is known that more efficient sequential or parallel algorithms can be developed by restricting classes of graphs. In this paper, we shall propose a parallel algorithm which runs in O(log n) time with O(n/log n) processors on EREW PRAM for finding all hinge vertices of a circular-arc graph.
Generalization of Sorting in Single Hop Wireless Networks
Shyue-Horng SHIAU Chang-Biau YANG

PAPER-Computation and Computational Models

Vol:
E89-D No:4
Page(s):
1432-1439
The generalized sorting problem is to find the first k largest elements among n input elements and to report them in a sorted order. In this paper, we propose a fast generalized sorting algorithm under the single hop wireless networks model with collision detection (WNCD). The algorithm is based on the maximum finding algorithm and the sorting algorithm. The key point of our algorithm is to use successful broadcasts to build broadcasting layers logically and then to distribute the data elements into those logic layers properly. Thus, the number of broadcast conflicts is reduced. We prove that the average time complexity required for our generalized sorting algorithm is Θ(k + log(n - k)). When k = 1, our generalized sorting algorithm does the work of finding maximum, and when k = n, it does the work of sorting. Thus, the analysis of our algorithm builds a connection between the two extremely special cases which are maximum finding and sorting.
A Fast Initialization Algorithm for Single-Hop Wireless Networks
Shyue-Horng SHIAU Chang-Biau YANG

PAPER-Network

Vol:
E88-B No:11
Page(s):
4285-4292
Given a set of n stations, the initialization problem is to assign each station a unique identification number, from 1 to n. In the single-hop wireless Networks with collision detection, Nakano and Olariu proposed an algorithm to build a partition tree and solve the problem. In this paper, we shall classify the partition tree into four parts. By reviewing the classification, we find that three ideas can improve the algorithm. We show that it needs 2.88n time slots for solving the problem containing n stations. After applying our three ideas, the number of time slots will be improved to 2.46n.
A Two-Dimensional Quantum Transport Simulation of Nanoscale Double-Gate MOSFETs Using Parallel Adaptive Technique
Yiming LI Shao-Ming YU

PAPER-Scientific and Engineering Computing with Applications

Vol:
E87-D No:7
Page(s):
1751-1758
In this paper we apply a parallel adaptive solution algorithm to simulate nanoscale double-gate metal-oxide-semiconductor field effect transistors (MOSFETs) on a personal computer (PC)-based Linux cluster with the message passing interface (MPI) libraries. Based on a posteriori error estimation, the triangular mesh generation, the adaptive finite volume method, the monotone iterative method, and the parallel domain decomposition algorithm, a set of two-dimensional quantum correction hydrodynamic (HD) equations is solved numerically on our constructed cluster system. This parallel adaptive simulation methodology with 1-irregular mesh was successfully developed and applied to deep-submicron semiconductor device simulation in our recent work. A 10 nm n-type double-gate MOSFET is simulated with the developed parallel adaptive simulator. In terms of physical quantities and refined adaptive mesh, simulation results demonstrate very good accuracy and computational efficiency. Benchmark results, such as load-balancing, speedup, and parallel efficiency are achieved and exhibit excellent parallel performance. On a 16 nodes PC-based Linux cluster, the maximum difference among CPUs is less than 6%. A 12.8 times speedup and 80% parallel efficiency are simultaneously attained with respect to different simulation cases.
The Fault-Tolerant Early Bird Problem
Bjorn FAY Martin KUTRIB

PAPER

Vol:
E87-D No:3
Page(s):
687-693
The capabilities of reliable computations in one-dimensional cellular automata are investigated by means of the Early Bird Problem. The problem is typical for situations in massively parallel systems where a global behavior must be achieved by only local interactions between the single elements. The cells that cause the misoperations are assumed to behave as follows. They run a self-diagnosis before the actual computation once. The result is stored locally such that the working state of a cell becomes visible to its neighbors. A non-working (defective) cell cannot modify information but is able to transmit it unchanged with unit speed. We present an O(n log (n) log (n))-time fault-tolerant solution of the Early Bird Problem.

1-20hit(73hit)

Keyword Search Result

[Keyword] parallel algorithm(73hit)

Parallelization on a Minimal Substring Search Algorithm for Regular Expressions

Accelerating the Smith-Waterman Algorithm Using the Bitwise Parallel Bulk Computation Technique on the GPU

An Efficient GPU Implementation of CKY Parsing Using the Bitwise Parallel Bulk Computation Technique

Fully Parallelized LZW Decompression for CUDA-Enabled GPUs

A Cloud-Friendly Communication-Optimal Implementation for Strassen's Matrix Multiplication Algorithm

Parallelization of Dynamic Time Warping on a Heterogeneous Platform

Hybrid Consultant-Guided Search for the Traveling Salesperson Problem

A Fast Parallel Algorithm for Indexing Human Genome Sequences

Asynchronous Memory Machine Models with Barrier Synchronization

A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

Optimal Parallel Algorithms for Computing the Sum, the Prefix-Sums, and the Summed Area Table on the Memory Machine Models

Asymptotically Optimal Merging on ManyCore GPUs

A Dynamically Reconfigurable FPGA-Based Pattern Matching Hardware for Subclasses of Regular Expressions

An Optimal Parallel Algorithm for Constructing a Spanning Tree on Circular Permutation Graphs

An Optimal Parallel Algorithm for Constructing a Spanning Forest on Trapezoid Graphs

An Optimal Parallel Algorithm for Finding All Hinge Vertices of a Circular-Arc Graph

Generalization of Sorting in Single Hop Wireless Networks

A Fast Initialization Algorithm for Single-Hop Wireless Networks

A Two-Dimensional Quantum Transport Simulation of Nanoscale Double-Gate MOSFETs Using Parallel Adaptive Technique

The Fault-Tolerant Early Bird Problem

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles