KSample: Dynamic Sampling Over Unbounded Data Streams

  • Tiago Rodrigo Kepe Universidade Federal do Paraná
  • Eduardo Cunha de Almeida Universidade Federal do Paraná, Brazil
  • Thomas Cerqueus Université de Lyon, CNRS, INSA-Lyon, LIRIS, France
Keywords: Data Analytics, Data Sampling, Data Stream

Abstract

Data sampling over data streams is common practice to allow the analysis of data in real-time. However, sampling over data streams becomes complex when the stream does not fit in memory, and worse yet, when the length of the stream is unknown. A well-known technique for sampling data streams is the Reservoir Sampling. It requires a fixed-size reservoir that corresponds to the resulting sample size. But, defining the reservoir size is challenging: huge samples may waste computing resources and may not fit in memory; whereas tiny samples may be inadequate and prevent from drawing meaningful conclusions. This article presents KSample, a novel data sampling algorithm over unbounded data streams. It does not require to know the length of the stream or the size of the sample. The key idea of KSample is based on an invariant that keeps the percentage of the stream regardless of its length. That is the reservoir invariably represents at least the target percentage of the stream. KSample eliminates the problem of memory space by defining the concept of distributed mini-reservoirs grounded on the same invariant. Experiments show that KSample is substantially faster than the Reservoir Sampling algorithm to generate samples. Finally, KSample was put in practice to speed up data analytics over MapReduce jobs, reducing their response times by up to a factor of 20.

Downloads

Download data is not yet available.
Published
2015-10-12