Automatic Web Page Segmentation and Noise Removal for Structured Extraction using Tag Path Sequences

  • Roberto Panerai Velloso UFSC
  • Carina F. Dorneles UFSC
Keywords: noise removal, page segmentation, structured extraction, web mining

Abstract

Web page segmentation and data cleaning are essential steps in structured web data extraction. Identifying a web page main content region, removing what is not important (menus, ads, etc.), can greatly improve the performance of the extraction process. We propose, for this task, a novel and fully automatic algorithm that uses a tag path sequence (TPS) representation of the web page. The TPS consists of a sequence of symbols (string), each one representing a different tag path. The proposed technique searches for positions in the TPS where it is possible to split it in two regions where each region's alphabet do not intersect, which means that they have completely different sets of tag paths and, thus, are different regions. The results show that the algorithm is very effective in identifying the main content block of several major websites, and improves the precision of the extraction step by removing irrelevant results.

Downloads

Download data is not yet available.
Published
2013-09-11
Section
SBBD Articles