Automatic Document Classification Temporally Robust

Authors

  • Thiago Salles Universidade Federal de Minas Gerais
  • Leonardo Rocha Universidade Federal de São João Del Rei
  • Fernando Mourão Universidade Federal de Minas Gerais
  • Gisele L. Pappa Universidade Federal de Minas Gerais
  • Lucas Cunha Universidade Federal de Minas Gerais
  • Marcos André Gonçalves Universidade Federal de Minas Gerais
  • Wagner Meira Jr. Universidade Federal de Minas Gerais

Abstract

p, li { white-space: pre-wrap; }

The widespread use of the Internet has increased the amount of information being stored on and accessed through the Web. This information is frequently organized as textual documents and is the main target of search engines and other retrieval tools, which have to classify documents, among other tasks. Automatic Document Classification (ADC) associates documents to semantically meaningful categories, and usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents, and then use the model to classify new documents. One major challenge in building classifiers is dealing with the temporal evolution of the characteristics of the documents and the classes to which they belong. However, most of the current techniques for ADC do not consider this evolution while building and using the models. Previous studies show that the performance of classifiers may be affected by three different temporal effects (class distribution, term distribution and class similarity). In this paper, we propose a new approach that aims to minimize the impact of temporal effects through a Temporal Adjustment Factor, in order to devise temporally robust classifiers based on traditional ones (Rocchio and KNN). Experimental results obtained using two real and large textual collections point to significant gains up to 11% of the temporal-aware versions of the classifiers over their traditional counterparts, and up to 4% compared to SVM (with a significantly lower runtime).

Downloads

Download data is not yet available.

Downloads

Published

2010-09-10