Extracting and Semantically Integrating Implicit Schemas from Multiple Spreadsheets of Biology based on the Recognition of their Nature
Keywords:biology, interoperability, semantic web, spreadsheets
Spreadsheets are popular among users and organizations, becoming an essential data management tool. The easiness to handle spreadsheets associated with the creative freedom resulted in an increase in the volume of data available in this format. However, spreadsheets are not conceived to integrate data from distinct sources and challenges arise involving systematization of processes to reuse and combine their data. Many related initiatives address the problem of integrating data inside spreadsheets, focusing on lexical and syntactical aspects. However, the proper exploitation of the semantics related to this data is still an opportunity. In this sense, some related work propose mapping spreadsheets contents to open interoperability standards, mainly Semantic Web standards. The main limitation of such proposals is the assumption that it is possible to recognize and make explicit the schema and the semantics of spreadsheets automatically, regardless of their domain. This work differs from related work by assuming the essential role of the context -- mainly the domain in which the spreadsheet was conceived -- to delineate shared practices of the biology community, which establishes building patterns to be automatically recognized by our system, in a data extraction process and schema recognition. In this article, we present the result of a practical experiment involving such a system, in which we integrate hundreds of spreadsheets belonging to the biology domain and available on the Web. This integration was possible due to observation that the recognition of a spreadsheet nature can be achieved from its tabular organization.