Fast Parallel Set Similarity Joins on Many-core Architectures
Keywords:
Advanced Query Processing, GPU, High Performance Computing, Parallel Set Similarity JoinAbstract
Set similarity join is a core operation for text data integration, cleaning, and mining. Previous research work on improving the performance of set similarity joins mostly focused on sequential, CPU-based algorithms. Main optimizations of such algorithms exploit high threshold values and the underlying data characteristics to derive efficient filters. In this article, we investigate strategies to accelerate set similarity join by exploiting massive parallelism available in modern Graphics Processing Units (GPUs). We develop two new parallel set similarity join algorithms, which implement an inverted index on the GPU to quickly identify similar sets. The first algorithm, called \texttt{gSSJoin}, does not rely on any filtering scheme and, thus, exhibits much better robustness to variations of threshold values and data distributions. Moreover, gSSJoin adopts a load balancing strategy to evenly distribute the similarity calculations among the GPU's threads. The second algorithm, called sf-gSSJoin, applies a block division scheme for dealing with large datasets that do not fit in GPU memory. This scheme also enables a substantial reduction of the comparison space by discarding entire blocks based on size limits. We present variants of both algorithms for a multi-GPU platform to further exploit task parallelism in addition to data parallelism. Experimental evaluation on real-world datasets shows that we obtain up to 109x and 19.5x speedups over state-of-the-art algorithms for CPU and GPU, respectively.