Siphoning Hidden-Web Data through Keyword-Based Interfaces: Retrospective

Luciano Barbosa; Juliana Freire

Authors

Luciano Barbosa AT&T Labs - Research
Juliana Freire University of Utah

Abstract

In this paper, we proposed the first, fully-automatic approach to crawling the Hidden Web through
keyword-based interfaces. Our crawler uses an algorithm for automatically deriving a series of
keyword-based queries whose goal is to obtain high coverage while minimizing the costs. In other
words, our goal is to retrieve as much of the hidden contents as possible while minimizing the number
of required queries. The intuition behind our algorithm is that, by obtaining samples of the hidden
contents in a online database or document collection, we are able to discover keywords that have high
frequency. Then, by using these high-frequency keywords we are able to construct queries that return
a large number of answers.

Siphoning Hidden-Web Data through Keyword-Based Interfaces: Retrospective

Authors

Abstract

Downloads

Published

Issue

Section

Developed By

Language

Information