IEICE global.ieice.org Site

Keyword Search Result

[Keyword] parallel algorithms(29hit)

1-20hit(29hit)

Accelerating the Smith-Waterman Algorithm Using the Bitwise Parallel Bulk Computation Technique on the GPU
Takahiro NISHIMURA Jacir Luiz BORDIM Yasuaki ITO Koji NAKANO

PAPER-Fundamentals of Information Systems

Pubricized:
2019/07/09
Vol:
E102-D No:12
Page(s):
2400-2408
The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. It is known that the bulk execution of an oblivious sequential algorithm can be implemented to run efficiently on a GPU. The bulk execution supports fine grained bitwise parallelism, allowing it to achieve high acceleration over a straightforward sequential computation. The main contribution of this work is to present a Bitwise Parallel Bulk Computation (BPBC) to accelerate the Smith-Waterman Algorithm (SWA) using the affine gap penalty. Thus, our idea is to convert this computation into a circuit simulation using the BPBC technique to compute multiple instances simultaneously. The proposed BPBC technique for the SWA has been implemented on the GPU and CPU. Experimental results show that the proposed BPBC for the SWA accelerates the computation by over 646 times as compared to a single CPU implementation and by 6.9 times as compared to a multi-core CPU implementation with 160 threads.
An Efficient GPU Implementation of CKY Parsing Using the Bitwise Parallel Bulk Computation Technique
Toru FUJITA Koji NAKANO Yasuaki ITO Daisuke TAKAFUJI

PAPER-GPU computing

Pubricized:
2017/08/04
Vol:
E100-D No:12
Page(s):
2857-2865
The main contribution of this paper is to present an efficient GPU implementation of bulk computation of the CKY parsing for a context-free grammar, which determines if a context-free grammar derives each of a lot of input strings. The bulk computation is to execute the same algorithm for a lot of inputs in turn or at the same time. The CKY parsing is to determine if a context-free grammar derives a given string. We show that the bulk computation of the CKY parsing can be implemented in the GPU efficiently using Bitwise Parallel Bulk Computation (BPBC) technique. We also show the rule minimization technique and the dynamic scheduling method for further acceleration of the CKY parsing on the GPU. The experimental results using NVIDIA TITAN X GPU show that our implementation of the bitwise-parallel CKY parsing for strings of length 32 takes 395µs per string with 131072 production rules for 512 non-terminal symbols.
A Cloud-Friendly Communication-Optimal Implementation for Strassen's Matrix Multiplication Algorithm
Jie ZHOU Feng YU

PAPER-Fundamentals of Information Systems

Pubricized:
2015/07/27
Vol:
E98-D No:11
Page(s):
1896-1905
Due to its on-demand and pay-as-you-go properties, cloud computing has become an attractive alternative for HPC applications. However, communication-intensive applications with complex communication patterns still cannot be performed efficiently on cloud platforms, which are equipped with MapReduce technologies, such as Hadoop and Spark. In particular, one major obstacle is that MapReduce's simple programming model cannot explicitly manipulate data transfers between compute nodes. Another obstacle is cloud's relatively poor network performance compared with traditional HPC platforms. The traditional Strassen's algorithm of square matrix multiplication has a recursive and complex pattern on the HPC platform. Therefore, it cannot be directly applied to the cloud platform. In this paper, we demonstrate how to make Strassen's algorithm with complex communication patterns “cloud-friendly”. By reorganizing Strassen's algorithm in an iterative pattern, we completely separate its computations and communications, making it fit to MapReduce programming model. By adopting a novel data/task parallel strategy, we solve Strassen's data dependency problems, making it well balanced. This is the first instance of Strassen's algorithm in MapReduce-style systems, which also matches Strassen's communication lower bound. Further experimental results show that it achieves a speedup ranging from 1.03× to 2.50× over the classical Θ(n3) algorithm. We believe the principle can be applied to many other complex scientific applications.
Asynchronous Memory Machine Models with Barrier Synchronization
Koji NAKANO

PAPER-Parallel and Distributed Computing

Vol:
E97-D No:3
Page(s):
431-441
The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. It is assumed that warps (or groups of threads) on the DMM and the UMM work synchronously in a round-robin manner. However, warps work asynchronously in real GPUs, in the sense that they are randomly (or arbitrarily) dispatched for execution. The first contribution of this paper is to introduce asynchronous versions of these models in which warps are arbitrarily dispatched. In addition, we assume that threads can execute the “syncthreads” instruction for barrier synchronization. Since the barrier synchronization operation may be costly, we should evaluate and minimize the number of barrier synchronization operations executed by parallel algorithms. The second contribution of this paper is to show a parallel algorithm to the sum of n numbers in optimal computing time and few barrier synchronization steps. Our parallel algorithm computes the sum of n numbers in O(n/w+llog n) time units and O(log l/log w+log log w) barrier synchronization steps using wl threads on the asynchronous UMM with width w and latency l. Since the computation of the sum takes at least Ω(n/w+llog n) time units, this algorithm is time optimal. Finally, we show that the prefix-sums of n numbers can also be computed in O(n/w+llog n) time units and O(log l/log w+log log w) barrier synchronization steps using wl threads.
A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation
Yasuaki ITO Koji NAKANO

PAPER

Vol:
E96-D No:12
Page(s):
2596-2603
This paper presents a GPU (Graphics Processing Units) implementation of dynamic programming for the optimal polygon triangulation. Recently, GPUs can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture) provided by NVIDIA. The optimal polygon triangulation problem for a convex polygon is an optimization problem to find a triangulation with minimum total weight. It is known that this problem for a convex n-gon can be solved using the dynamic programming technique in O(n3) time using a work space of size O(n2). In this paper, we propose an efficient parallel implementation of this O(n3)-time algorithm on the GPU. In our implementation, we have used two new ideas to accelerate the dynamic programming. The first idea (adaptive granularity) is to partition the dynamic programming algorithm into many sequential kernel calls of CUDA, and to select the best parameters for the size and the number of blocks for each kernel call. The second idea (sliding and mirroring arrangements) is to arrange the working data for coalesced access of the global memory in the GPU to minimize the memory access overhead. Our implementation using these two ideas solves the optimal polygon triangulation problem for a convex 8192-gon in 5.57 seconds on the NVIDIA GeForce GTX 680, while a conventional CPU implementation runs in 1939.02 seconds. Thus, our GPU implementation attains a speedup factor of 348.02.
Asymptotically Optimal Merging on ManyCore GPUs
Arne KUTZNER Pok-Son KIM Won-Kwang PARK

PAPER-Parallel and Distributed Computing

Vol:
E95-D No:12
Page(s):
2769-2777
We propose a family of algorithms for efficiently merging on contemporary GPUs, so that each algorithm requires O(m log (+1)) element comparisons, where m and n are the sizes of the input sequences with m ≤ n. According to the lower bounds for merging all proposed algorithms are asymptotically optimal regarding the number of necessary comparisons. First we introduce a parallely structured algorithm that splits a merging problem of size 2l into 2i subproblems of size 2l-i, for some arbitrary i with (0 ≤ i ≤ l). This algorithm represents a merger for i=l but it is rather inefficient in this case. The efficiency is boosted by moving to a two stage approach where the splitting process stops at some predetermined level and transfers control to several parallely operating block-mergers. We formally prove the asymptotic optimality of the splitting process and show that for symmetrically sized inputs our approach delivers up to 4 times faster runtimes than the thrust::merge function that is part of the Thrust library. For assessing the value of our merging technique in the context of sorting we construct and evaluate a MergeSort on top of it. In the context of our benchmarking the resulting MergeSort clearly outperforms the MergeSort implementation provided by the Thrust library as well as Cederman's GPU optimized variant of QuickSort.
An Optimal Parallel Algorithm for Constructing a Spanning Tree on Circular Permutation Graphs
Hirotoshi HONMA Saki HONMA Shigeru MASUYAMA

PAPER

Vol:
E92-D No:2
Page(s):
141-148
The spanning tree problem is to find a tree that connects all the vertices of G. This problem has many applications, such as electric power systems, computer network design and circuit analysis. Klein and Stein demonstrated that a spanning tree can be found in O(log n) time with O(n+m) processors on the CRCW PRAM. In general, it is known that more efficient parallel algorithms can be developed by restricting classes of graphs. Circular permutation graphs properly contain the set of permutation graphs as a subclass and are first introduced by Rotem and Urrutia. They provided O(n2.376) time recognition algorithm. Circular permutation graphs and their models find several applications in VLSI layout. In this paper, we propose an optimal parallel algorithm for constructing a spanning tree on circular permutation graphs. It runs in O(log n) time with O(n/log n) processors on the EREW PRAM.
An Optimal Parallel Algorithm for Constructing a Spanning Forest on Trapezoid Graphs
Hirotoshi HONMA Shigeru MASUYAMA

PAPER

Vol:
E91-A No:9
Page(s):
2296-2300
Given a simple graph G with n vertices, m edges and k connected components. The spanning forest problem is to find a spanning tree for each connected component of G. This problem has applications to the electrical power demand problem, computer network design, circuit analysis, etc. An optimal parallel algorithm for finding a spanning tree on the trapezoid graph is given by Bera et al., it takes O(log n) time with O(n/log n) processors on the EREW (Exclusive-Read Exclusive-Write) PRAM. Bera et al.'s algorithm is very simple and elegant. Moreover, it can correctly construct a spanning tree when the graph is connected. However, their algorithm can not accept a disconnected graph as an input. Applying their algorithm to a disconnected graph, Concurrent-Write occurs once for each connected component, thus this can not be achieved on EREW PRAM. In this paper we present an O(log n) time parallel algorithm with O(n/log n) processors for constructing a spanning forest on trapezoid graph G on EREW PRAM even if G is a disconnected graph.
An Optimal Parallel Algorithm for Finding All Hinge Vertices of a Circular-Arc Graph
Hirotoshi HONMA Shigeru MASUYAMA

PAPER-Algorithms and Data Structures

Vol:
E91-A No:1
Page(s):
383-391
Let G =(V, E) be an undirected simple graph with u ∈ V. If there exist any two vertices in G whose distance becomes longer when a vertex u is removed, then u is defined as a hinge vertex. Finding the set of hinge vertices in a graph is useful for identifying critical nodes in an actual network. A number of studies concerning hinge vertices have been made in recent years. In a number of graph problems, it is known that more efficient sequential or parallel algorithms can be developed by restricting classes of graphs. In this paper, we shall propose a parallel algorithm which runs in O(log n) time with O(n/log n) processors on EREW PRAM for finding all hinge vertices of a circular-arc graph.
The Fault-Tolerant Early Bird Problem
Bjorn FAY Martin KUTRIB

PAPER

Vol:
E87-D No:3
Page(s):
687-693
The capabilities of reliable computations in one-dimensional cellular automata are investigated by means of the Early Bird Problem. The problem is typical for situations in massively parallel systems where a global behavior must be achieved by only local interactions between the single elements. The cells that cause the misoperations are assumed to behave as follows. They run a self-diagnosis before the actual computation once. The result is stored locally such that the working state of a cell becomes visible to its neighbors. A non-working (defective) cell cannot modify information but is able to transmit it unchanged with unit speed. We present an O(n log (n) log (n))-time fault-tolerant solution of the Early Bird Problem.
Parallel Algorithms for Finding the Center of Interval and Circular-Arc Graphs
Fang Rong HSU Man Kwan SHAN

LETTER-Graphs and Networks

Vol:
E86-A No:10
Page(s):
2704-2709
The center problem of a graph is motivated by a number of facility location problems. In this paper, we propose parallel algorithms for finding the center of interval graphs and circular-arc graphs. Our algorithms run in O(log n) time algorithm using O(n/log n) processors while the intervals and arcs are given in sorted order. Our algorithms are on the EREW PRAM model.
BPL: A Language for Parallel Algorithms on the Butterfly Network
Fattaneh TAGHIYAREH Hiroshi NAGAHASHI

PAPER-Algorithms

Vol:
E83-D No:7
Page(s):
1488-1496
A number of parallel algorithms have been developed to solve large-scale real world problems. Although there has been much work on the design of parallel algorithms, there has been little on the design of languages for expressing these algorithms. This paper describes the BPL, a new parallel language designed for butterfly networks. The purpose of this language is to help designers in hiding the complexity of the algorithm and leaving details of mapping between data and processors for lower level. BPL provides a simpler virtual machine for the designer , in order to avoid thinking about control of processors and data. From another point of view, BPL helps designer to logically check the algorithm and correct any possible error in it. The paper gives some examples implemented by this language. In addition, we have also implemented a software tool which simulates the running of the algorithm on the network. The results lead us to believe that this language would be useful in representing all kinds of algorithms on this network including normal algorithms and others.
Parallel Algorithms for Convex Hull Problems and Their Paradigm
Wei CHEN Koji NAKANO Koichi WADA

INVITED SURVEY PAPER-Parallel and Distributed Algorithms

Vol:
E83-D No:3
Page(s):
519-529
A convex hull is one of the most fundamental and interesting geometric constructs in computational geometry. Considerable research effort has focused on developing algorithms, both in serial and in parallel, for computing convex hulls. In particular, there are few problems whose parallel algorithms are so thoroughly studied as convex hull problems. In this paper, we review the convex hull parallel algorithms and their paradigm. We provide a summary of results and introduce several interesting topics including typical techniques, output-size sensitive methods, randomized approaches, and robust algorithms for convex hull problems, with which we may see the highlights of the whole research for parallel algorithms. Most of our discussion uses the PRAM (Parallel Random Access Machine) computational model, but still we give a glance at the results of the other parallel computational models such as mesh, mesh-of-trees, hypercube, recofigurable array, and models of coarse grained multicomputers like BSP and LogP.
Distributed Concurrency Control with Local Wait-Depth Control Policy
Jiahong WANG Jie LI Hisao KAMEDA

PAPER-Databases

Vol:
E81-D No:6
Page(s):
513-520
Parallel Transaction Processing (TP) systems have great potential to serve the ever-increasing demands for high transaction processing rate. This potential, however, may not be reached due to the data contention and the widely-used two-phase locking (2PL) Concurrency Control (CC) method. In this paper, a distributed locking-based CC policy called LWDC (Local Wait-Depth Control) was proposed for dealing with this problem for the shared-nothing parallel TP system. On the basis of the LWDC policy, an algorithm called LWDCk was designed. Using simulation LWDCk was compared with the 2PL and the base-line Distributed Wait-Depth Limited (DWDL) CC methods. Simulation studies show that the new algorithm offers better system performance than those compared.
Parallel Algorithms for Finding a Hamiltonian Path and a Hamiltonian Cycle in an In-Tournament Graph
Shin-ichi NAKAYAMA Shigeru MASUYAMA

PAPER

Vol:
E81-A No:5
Page(s):
757-767
As a super class of tournament digraphs, Bang-Jensen, Huang and Prisner defined an in-tournament digraph (in-tournament for short) and investigated a number of its nice properties. The in-tournament is a directed graph in which the set of in-neighbors of every vertex induces a tournament digraph. In other words, the presence of arcs (x,z) and (y,z) implies that exactly one of (x,y) or (y,x) exists. In this paper, we propose, for in-tournaments, parallel algorithms for examining the existence of a Hamiltonian path and a Hamiltonian cycle and for constructing them, if they exist.
A New State Space-Based Approach for the Estimation of Two-Dimensional Frequencies and Its Parallel Implementations
Yi CHU Wen-Hsien FANG Shun-Hsyung CHANG

PAPER-Digital Signal Processing

Vol:
E80-A No:6
Page(s):
1099-1108
In this paper, we present a new state space-based approach for the two-dimensional (2-D) frequency estimation problem which occurs in various areas of signal processing and communication problems. The proposed method begins with the construction of a state space model associated with the noiseless data which contains a summation of 2-D harmonics. Two auxiliary Hankel-block-Hankel-like matrices are then introduced and from which the two frequency components can be derived via matrix factorizations along with frequency shifting properties. Although the algorithm can render high resolution frequency estimates, it also calls for lots of computations. To alleviate the high computational overhead required, a highly parallelizable implementation of it via the principle subband component (PSC) of some appropriately chosen transforms have been addressed as well. Such a PSC-based transform domain implementation not only reduces the size of data needed to be processed, but it also suppresses the contaminated noise outside the subband of interest. To reduce the computational complexity induced in the transformation process, we also suggest that either the transform of the discrete Fourier transform (DFT) or the Haar wavelet transform (HWT) be employed. As a consequence, such an approach of implementation can achieve substantial computational savings; meanwhile, as demonstrated by the provided simulation results, it still retains roughly the same performance as that of the original algorithm.
Parallel Algorithms for Maximal Linear Forests
Ryuhei UEHARA Zhi-Zhong CHEN

PAPER

Vol:
E80-A No:4
Page(s):
627-634
The maximal linear forest problem is to find, given a graph G = (V, E), a maximal subset of V that induces a linear forest. Three parallel algorithms for this problem are presented. The first one is randomized and runs in O(log n) expected time using n2 processors on a CRCW PRAM. The second one is deterministic and runs in O(log 2n) timeusing n4 processors on an EREW PRAM. The last one is deterministic and runs in O(log 5n) time using n3 processors on an EREW PRAM. The results put the problem in the class NC.
Factoring Hard Integers on a Parallel Machine
Rene PERALTA Masahiro MAMBO Eiji OKAMOTO

PAPER

Vol:
E80-A No:4
Page(s):
658-662
We describe our implementation of the Hypercube variation of the Multiple Polynomial Quadratic Sieve (HMPQS) integer factorization algorithm on a Parsytec GC computer with 128 processors. HMPQS is a variation on the Quadratic Sieve (QS) algorithm which inspects many quadratic polynomials looking for quadratic residues with small prime factors. The polynomials are organized as the nodes of an n-dimensional cube. We report on the performance of our implementations on factoring several large numbers for the Cunningham Project.
Parallel Parsing on a Loosely Coupled Multiprocessor
Dong-Yul RA Jong-Hyun KIM

PAPER-Algorithm and Computational Complexity

Vol:
E79-D No:12
Page(s):
1620-1628
In this paper, we introduce a parallel algorithm for parsing context-free languages. Our algorithm can handle arbitrary context-free grammars since it is based on Earley's algorithm. Our algorithm can operate on any loosely coupled multiprocessor which can provide a topology of a one-way ring. Our algorithm uses p processors to parse an input string of length n where 1 p n. It is shown that our algorithm requires O(n3/p) time. The algorithm uses a simple job allocation strategy. However, it achieves high load balancing and uses the processors efficiently.
Algorithm Transformation for Cube-Type Networks
Masaru TAKESUE

PAPER-Algorithms

Vol:
E79-D No:8
Page(s):
1031-1037
This paper presents a method for mechanically transforming a parallel algorithm on an original network so that the algorithm can work on a target network. It is assumed that the networks are of cube-type such as the shuffle-exchange network, omega network, and hypercube. Were those networks isomorphic to each other, the algorithm transformation is an easy task. The proposed transformation method is based on a novel graphembedding scheme <φ: δ, κ, π, ψ>. In addition to the dilating operation δ of the usual embedding scheme <φ: δ>, the novel scheme uses three primitive graph-transformation operations; κ (= δ-1) for contracting a path into a node, π for pipelining a graph, and ψ (= π-1) for folding a pipelined graph. By applying the primitive operations, the cube-type networks can be transformed so as to be isomorphic to each other. Relationships between the networks are represented by the composition of applied operations. With the isomorphic mapping φ, an algorithm in a node of the original network can be simulated in the corresponding node(s) of the target network. Thus the algorithm transformation is reduced to routine work.

1-20hit(29hit)

Keyword Search Result

[Keyword] parallel algorithms(29hit)

Accelerating the Smith-Waterman Algorithm Using the Bitwise Parallel Bulk Computation Technique on the GPU

An Efficient GPU Implementation of CKY Parsing Using the Bitwise Parallel Bulk Computation Technique

A Cloud-Friendly Communication-Optimal Implementation for Strassen's Matrix Multiplication Algorithm

Asynchronous Memory Machine Models with Barrier Synchronization

A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation

Asymptotically Optimal Merging on ManyCore GPUs

An Optimal Parallel Algorithm for Constructing a Spanning Tree on Circular Permutation Graphs

An Optimal Parallel Algorithm for Constructing a Spanning Forest on Trapezoid Graphs

An Optimal Parallel Algorithm for Finding All Hinge Vertices of a Circular-Arc Graph

The Fault-Tolerant Early Bird Problem

Parallel Algorithms for Finding the Center of Interval and Circular-Arc Graphs

BPL: A Language for Parallel Algorithms on the Butterfly Network

Parallel Algorithms for Convex Hull Problems and Their Paradigm

Distributed Concurrency Control with Local Wait-Depth Control Policy

Parallel Algorithms for Finding a Hamiltonian Path and a Hamiltonian Cycle in an In-Tournament Graph

A New State Space-Based Approach for the Estimation of Two-Dimensional Frequencies and Its Parallel Implementations

Parallel Algorithms for Maximal Linear Forests

Factoring Hard Integers on a Parallel Machine

Parallel Parsing on a Loosely Coupled Multiprocessor

Algorithm Transformation for Cube-Type Networks

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles