Scalable Detection of Frequent Substrings by Grammar-Based Compression

Masaya NAKAHARA; Shirou MARUYAMA; Tetsuji KUBOYAMA; Hiroshi SAKAMOTO

doi:10.1587/transinf.E96.D.457

IEICE TRANSACTIONS on Information

Scalable Detection of Frequent Substrings by Grammar-Based Compression

Masaya NAKAHARA, Shirou MARUYAMA, Tetsuji KUBOYAMA, Hiroshi SAKAMOTO

Full Text Views

0

Cite this

Summary :

A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a grammar as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor based on edit-sensitive parsing (ESP) is especially suitable for detecting maximal common substrings as well as long frequent substrings. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove several lower bounds to guarantee the length of extracted patterns. We also examine the performance of our algorithm by experiments in biological sequences and other compressible real world texts. Compared to other practical algorithms, our algorithm is faster and more scalable with large and repetitive strings.

Publication: IEICE TRANSACTIONS on Information Vol.E96-D No.3 pp.457-464

Publication Date: 2013/03/01

Publicized

Online ISSN: 1745-1361

DOI: 10.1587/transinf.E96.D.457

Type of Manuscript: Special Section PAPER (Special Section on Foundations of Computer Science — New Trends in Algorithms and Theory of Computation —)

Category

Cite this

Copy

Masaya NAKAHARA, Shirou MARUYAMA, Tetsuji KUBOYAMA, Hiroshi SAKAMOTO, "Scalable Detection of Frequent Substrings by Grammar-Based Compression" in IEICE TRANSACTIONS on Information, vol. E96-D, no. 3, pp. 457-464, March 2013, doi: 10.1587/transinf.E96.D.457.
Abstract: A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a grammar as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor based on edit-sensitive parsing (ESP) is especially suitable for detecting maximal common substrings as well as long frequent substrings. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove several lower bounds to guarantee the length of extracted patterns. We also examine the performance of our algorithm by experiments in biological sequences and other compressible real world texts. Compared to other practical algorithms, our algorithm is faster and more scalable with large and repetitive strings.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E96.D.457/_p

Copy

@ARTICLE{e96-d_3_457,
author={Masaya NAKAHARA, Shirou MARUYAMA, Tetsuji KUBOYAMA, Hiroshi SAKAMOTO, },
journal={IEICE TRANSACTIONS on Information},
title={Scalable Detection of Frequent Substrings by Grammar-Based Compression},
year={2013},
volume={E96-D},
number={3},
pages={457-464},
abstract={A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a grammar as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor based on edit-sensitive parsing (ESP) is especially suitable for detecting maximal common substrings as well as long frequent substrings. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove several lower bounds to guarantee the length of extracted patterns. We also examine the performance of our algorithm by experiments in biological sequences and other compressible real world texts. Compared to other practical algorithms, our algorithm is faster and more scalable with large and repetitive strings.},
keywords={},
doi={10.1587/transinf.E96.D.457},
ISSN={1745-1361},
month={March},}

Copy

TY - JOUR
TI - Scalable Detection of Frequent Substrings by Grammar-Based Compression
T2 - IEICE TRANSACTIONS on Information
SP - 457
EP - 464
AU - Masaya NAKAHARA
AU - Shirou MARUYAMA
AU - Tetsuji KUBOYAMA
AU - Hiroshi SAKAMOTO
PY - 2013
DO - 10.1587/transinf.E96.D.457
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E96-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2013
AB - A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a grammar as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor based on edit-sensitive parsing (ESP) is especially suitable for detecting maximal common substrings as well as long frequent substrings. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove several lower bounds to guarantee the length of extracted patterns. We also examine the performance of our algorithm by experiments in biological sequences and other compressible real world texts. Compared to other practical algorithms, our algorithm is faster and more scalable with large and repetitive strings.
ER -

IEICE TRANSACTIONS on Information

Scalable Detection of Frequent Substrings by Grammar-Based Compression

Summary :

Authors

Keyword

Latest Issue

Contents

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

IEICE TRANSACTIONS on Information

Scalable Detection of Frequent Substrings by Grammar-Based Compression

Summary :

Authors

Keyword

Latest Issue

Contents

Copyrights notice of machine-translated contents

Cite this

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles