Automatic Selection of Training Examples for a Record Deduplication Method Based on Genetic Programming

Gabriel Silva Gonçalves; Moisés G de Carvalho; Alberto H. F. Laender; Marcos A. Gonçalves

Authors

Gabriel Silva Gonçalves UFMG
Moisés G de Carvalho UFMG
Alberto H. F. Laender UFMG
Marcos A. Gonçalves UFMG

Keywords:

Information Storage and Retrieval, Artificial Intelligence

Abstract

Recently, machine learning techniques have been used to solve the record deduplication problem. However, these techniques require examples, manually generated in most cases, for training purposes. This uneases the use of such techniques because of the cost required to create the set of examples. In this article, we propose an approach based on a deterministic technique to automatically suggest training examples for a deduplication method based on genetic programming. Our experiments with synthetic datasets show that, by using only 15% of the examples suggested by our approach, it is possible to achieve results in terms of F1 that are equivalent to those obtained when using all the examples, leading to savings in training time of up to 85%.

Automatic Selection of Training Examples for a Record Deduplication Method Based on Genetic Programming

Authors

Keywords:

Abstract

Downloads

Additional Files

Published

Issue

Section

Developed By

Language

Information