Automatic Web Page Segmentation and Noise Removal for Structured Extraction using Tag Path Sequences

Roberto Panerai Velloso; Carina F. Dorneles

Authors

Roberto Panerai Velloso UFSC
Carina F. Dorneles UFSC

Keywords:

noise removal, page segmentation, structured extraction, web mining

Abstract

Web page segmentation and data cleaning are essential steps in structured web data extraction. Identifying a web page main content region, removing what is not important (menus, ads, etc.), can greatly improve the performance of the extraction process. We propose, for this task, a novel and fully automatic algorithm that uses a tag path sequence (TPS) representation of the web page. The TPS consists of a sequence of symbols (string), each one representing a different tag path. The proposed technique searches for positions in the TPS where it is possible to split it in two regions where each region's alphabet do not intersect, which means that they have completely different sets of tag paths and, thus, are different regions. The results show that the algorithm is very effective in identifying the main content block of several major websites, and improves the precision of the extraction step by removing irrelevant results.

Automatic Web Page Segmentation and Noise Removal for Structured Extraction using Tag Path Sequences

Authors

Keywords:

Abstract

Downloads

Additional Files

Published

Issue

Section

Developed By

Language

Information