Siphoning Hidden-Web Data through Keyword-Based Interfaces: Retrospective


  • Luciano Barbosa AT&T Labs - Research
  • Juliana Freire University of Utah


In this paper, we proposed the first, fully-automatic approach to crawling the Hidden Web through
keyword-based interfaces. Our crawler uses an algorithm for automatically deriving a series of
keyword-based queries whose goal is to obtain high coverage while minimizing the costs. In other
words, our goal is to retrieve as much of the hidden contents as possible while minimizing the number
of required queries. The intuition behind our algorithm is that, by obtaining samples of the hidden
contents in a online database or document collection, we are able to discover keywords that have high
frequency. Then, by using these high-frequency keywords we are able to construct queries that return
a large number of answers.


Download data is not yet available.