Seamless Integration of Distance Functions and Feature Vectors for Similarity-Queries Processing
Keywords: distance functions, extended-SQL, feature extractors, similarity queries
AbstractSearching on complex data, such as medical images and financial time-series has attracted several research efforts in recent years. Feature extraction methods (FEM) and distance-functions (DF) play a crucial role when comparing such data, once they can help bridging the semantic gap between users' intent and the Content-Based Retrieval Systems (CBRS). Metric access methods (MAM), in turn, are important structures aimed at reducing the CBRS answer time. Integrating all these components into a single framework is quite challenging. In fact, such framework should provide flexibility to allow the database administrator (DBA) to add new FEM, DF and MAMs on the fly, as they are being required and developed, combining them according to the requirements of each CBRS application. An adequate way to create a framework is extending the Relational Database Management Systems (RDBMS) and the SQL language to allow integrating the definition of the FEM, DF and MAMs into the core search engine. In this paper we extend and improve the original strategy of the existing middleware SIREN into a new, complete similarity-enabled framework called SimbA. Based on a new approach, SimbA allows a much tighter and flexible handling of new complex data types, FEM, DF and MAM in a seamless abstraction. We adopted a SQL extension in order to represent the users' queries. Our extension of the SELECT command for similarity comparisons also provides a canonical query execution plan, which allows using data distribution estimates to generate cost-based optimization plans. To illustrate our architecture, we exemplify the developed concepts showing how to add a new complex data type (the ''financial time-series'') as well as its corresponding FEM into our framework. Through the extended-SQL and by using complex predicates, we provide example applications to store and query two real datasets as relational tables in commercial RDBMS. The first one (DDSM_ROI) is composed of mammograms, while the second (BM&F_BOVESPA) contains weeks of the Brazilian stock exchange index. The experiments performed confirm that our approach can be used as a generic backend for CBRS, whereas maintaining all the desirable features of a RDBMS.