Efficient Substructure Discovery from Large Semi-Structured Data

Tatsuya ASAI; Kenji ABE; Shinji KAWASOE; Hiroshi SAKAMOTO; Hiroki ARIMURA; Setsuo ARIKAWA

Efficient Substructure Discovery from Large Semi-Structured Data

Tatsuya ASAI, Kenji ABE, Shinji KAWASOE, Hiroshi SAKAMOTO, Hiroki ARIMURA, Setsuo ARIKAWA

Full Text Views

0

Cite this

Summary :

In this paper, we consider a data mining problem for semi-structured data. Modeling semi-structured data as labeled ordered trees, we present an efficient algorithm for discovering frequent substructures from a large collection of semi-structured data. By extending the enumeration technique developed by Bayardo (SIGMOD'98) for discovering long itemsets, our algorithm scales almost linearly in the total size of maximal tree patterns contained in an input collection depending mildly on the size of the longest pattern. We also developed several pruning techniques that significantly speed-up the search. Experiments on Web data show that our algorithm runs efficiently on real-life datasets combined with proposed pruning techniques in the wide range of parameters.

Publication: IEICE TRANSACTIONS on Information Vol.E87-D No.12 pp.2754-2763

Publication Date: 2004/12/01

Publicized

Online ISSN

DOI

Type of Manuscript: PAPER

Category: Data Mining

Cite this

Copy

Tatsuya ASAI, Kenji ABE, Shinji KAWASOE, Hiroshi SAKAMOTO, Hiroki ARIMURA, Setsuo ARIKAWA, "Efficient Substructure Discovery from Large Semi-Structured Data" in IEICE TRANSACTIONS on Information, vol. E87-D, no. 12, pp. 2754-2763, December 2004, doi: .
Abstract: In this paper, we consider a data mining problem for semi-structured data. Modeling semi-structured data as labeled ordered trees, we present an efficient algorithm for discovering frequent substructures from a large collection of semi-structured data. By extending the enumeration technique developed by Bayardo (SIGMOD'98) for discovering long itemsets, our algorithm scales almost linearly in the total size of maximal tree patterns contained in an input collection depending mildly on the size of the longest pattern. We also developed several pruning techniques that significantly speed-up the search. Experiments on Web data show that our algorithm runs efficiently on real-life datasets combined with proposed pruning techniques in the wide range of parameters.
URL: https://global.ieice.org/en_transactions/information/10.1587/e87-d_12_2754/_p

Copy

@ARTICLE{e87-d_12_2754,
author={Tatsuya ASAI, Kenji ABE, Shinji KAWASOE, Hiroshi SAKAMOTO, Hiroki ARIMURA, Setsuo ARIKAWA, },
journal={IEICE TRANSACTIONS on Information},
title={Efficient Substructure Discovery from Large Semi-Structured Data},
year={2004},
volume={E87-D},
number={12},
pages={2754-2763},
abstract={In this paper, we consider a data mining problem for semi-structured data. Modeling semi-structured data as labeled ordered trees, we present an efficient algorithm for discovering frequent substructures from a large collection of semi-structured data. By extending the enumeration technique developed by Bayardo (SIGMOD'98) for discovering long itemsets, our algorithm scales almost linearly in the total size of maximal tree patterns contained in an input collection depending mildly on the size of the longest pattern. We also developed several pruning techniques that significantly speed-up the search. Experiments on Web data show that our algorithm runs efficiently on real-life datasets combined with proposed pruning techniques in the wide range of parameters.},
keywords={},
doi={},
ISSN={},
month={December},}

Copy

TY - JOUR
TI - Efficient Substructure Discovery from Large Semi-Structured Data
T2 - IEICE TRANSACTIONS on Information
SP - 2754
EP - 2763
AU - Tatsuya ASAI
AU - Kenji ABE
AU - Shinji KAWASOE
AU - Hiroshi SAKAMOTO
AU - Hiroki ARIMURA
AU - Setsuo ARIKAWA
PY - 2004
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E87-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2004
AB - In this paper, we consider a data mining problem for semi-structured data. Modeling semi-structured data as labeled ordered trees, we present an efficient algorithm for discovering frequent substructures from a large collection of semi-structured data. By extending the enumeration technique developed by Bayardo (SIGMOD'98) for discovering long itemsets, our algorithm scales almost linearly in the total size of maximal tree patterns contained in an input collection depending mildly on the size of the longest pattern. We also developed several pruning techniques that significantly speed-up the search. Experiments on Web data show that our algorithm runs efficiently on real-life datasets combined with proposed pruning techniques in the wide range of parameters.
ER -