The search functionality is under construction.
The search functionality is under construction.

HAIF: A Hierarchical Attention-Based Model of Filtering Invalid Webpage

Chaoran ZHOU, Jianping ZHAO, Tai MA, Xin ZHOU

  • Full Text Views

    0

  • Cite this

Summary :

In Internet applications, when users search for information, the search engines invariably return some invalid webpages that do not contain valid information. These invalid webpages interfere with the users' access to useful information, affect the efficiency of users' information query and occupy Internet resources. Accurate and fast filtering of invalid webpages can purify the Internet environment and provide convenience for netizens. This paper proposes an invalid webpage filtering model (HAIF) based on deep learning and hierarchical attention mechanism. HAIF improves the semantic and sequence information representation of webpage text by concatenating lexical-level embeddings and paragraph-level embeddings. HAIF introduces hierarchical attention mechanism to optimize the extraction of text sequence features and webpage tag features. Among them, the local-level attention layer optimizes the local information in the plain text. By concatenating the input embeddings and the feature matrix after local-level attention calculation, it enriches the representation of information. The tag-level attention layer introduces webpage structural feature information on the attention calculation of different HTML tags, so that HAIF is better applicable to the Internet resource field. In order to evaluate the effectiveness of HAIF in filtering invalid pages, we conducted various experiments. Experimental results demonstrate that, compared with other baseline models, HAIF has improved to various degrees on various evaluation criteria.

Publication
IEICE TRANSACTIONS on Information Vol.E104-D No.5 pp.659-668
Publication Date
2021/05/01
Publicized
2021/02/25
Online ISSN
1745-1361
DOI
10.1587/transinf.2020NTP0001
Type of Manuscript
Special Section PAPER (Special Section on the Architectures, Protocols, and Applications for the Future Internet)
Category

Authors

Chaoran ZHOU
  Changchun University of Science and Technology
Jianping ZHAO
  Changchun University of Science and Technology
Tai MA
  Changchun University of Science and Technology
Xin ZHOU
  Changchun University of Science and Technology

Keyword