Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deep Web page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.
Jun ZENG
Chongqing University
Feng LI
Chongqing University
Brendan FLANAGAN
Kyushu University
Sachio HIROKAWA
Kyushu University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Jun ZENG, Feng LI, Brendan FLANAGAN, Sachio HIROKAWA, "LTDE: A Layout Tree Based Approach for Deep Page Data Extraction" in IEICE TRANSACTIONS on Information,
vol. E100-D, no. 5, pp. 1067-1078, May 2017, doi: 10.1587/transinf.2016EDP7375.
Abstract: Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deep Web page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016EDP7375/_p
Copy
@ARTICLE{e100-d_5_1067,
author={Jun ZENG, Feng LI, Brendan FLANAGAN, Sachio HIROKAWA, },
journal={IEICE TRANSACTIONS on Information},
title={LTDE: A Layout Tree Based Approach for Deep Page Data Extraction},
year={2017},
volume={E100-D},
number={5},
pages={1067-1078},
abstract={Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deep Web page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.},
keywords={},
doi={10.1587/transinf.2016EDP7375},
ISSN={1745-1361},
month={May},}
Copy
TY - JOUR
TI - LTDE: A Layout Tree Based Approach for Deep Page Data Extraction
T2 - IEICE TRANSACTIONS on Information
SP - 1067
EP - 1078
AU - Jun ZENG
AU - Feng LI
AU - Brendan FLANAGAN
AU - Sachio HIROKAWA
PY - 2017
DO - 10.1587/transinf.2016EDP7375
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E100-D
IS - 5
JA - IEICE TRANSACTIONS on Information
Y1 - May 2017
AB - Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deep Web page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.
ER -