Journal of Information and Data Management

Cover and Frontmatter

Alberto H. F. Laender , Mirella M Moro — Tue, 04 Oct 2011 00:00:00 -0300

Editorial

Angelo Brayner; Maristela Holanda; Carina Friedrich Dorneles — Wed, 30 Dec 2020 00:00:00 -0300

Frontmatter

Angelo Brayner; Maristela Holanda — Wed, 30 Dec 2020 00:00:00 -0300

Editorial

Angelo Brayner; Maristela Holanda; Thales Sehn Körting — Fri, 30 Oct 2020 00:00:00 -0300

Frontmatter

Angelo Brayner; Maristela Holanda — Fri, 30 Oct 2020 00:00:00 -0300

Editorial

Angelo Brayner; Maristela Holanda; Elaine Ribeiro de Faria Paiva — Tue, 30 Jun 2020 00:00:00 -0300

Frontmatter

Angelo Brayner; Maristela Holanda — Tue, 30 Jun 2020 00:00:00 -0300

Analysis of ENEM’s attendants between 2012 and 2017 using a clustering approach

Sun, 14 Feb 2021 00:00:00 -0300

Data analysis is increasingly being used as an unbiased and accurate way to evaluate many aspects
of society and their evolution over the years. This article presents an analysis of student’s characteristics, between
2012 and 2017, in the most important exam for entry into higher education in Brazil, the Exame Nacional do Ensino
Médio (Enem). The intention is to gain insights of Brazilian regions, Enem’s areas of knowledge, type of school and
accessibility, using a clustering method (K-means). An extensive and careful cleaning of the database was made in order
to homogenize it and avoid types of statistical bias. The results of this work are presented objectively in the article,
so it may be useful and used as a numerical base in works of socio-educational disciplines or studies that are interested
in better understanding the evolution of Enem in recent years. Finally, some discussions and restrictions on grouping
results were presented in a timely manner.

Weighted Linking Decomposition: Mining Denser and More Compact Hierarchies for Bipartite Graphs

Edré Moreira, Guilherme Oliveira Campos, Wagner Meira Jr. — Tue, 30 Jun 2020 00:00:00 -0300

Dense subgraph detection is a well-known problem in graph theory.
The hierarchical organization of graphs as dense subgraphs, however, goes beyond simple clustering, as it allows the analysis of the network at different scales.
Although there are several hierarchical decomposition methods for unipartite graphs, only a few approaches for the bipartite case have been proposed.
In this work, we explore the problem of hierarchical decomposition for bipartite graphs.
We propose an algorithm called Weighted Linking that identifies denser and more compact hierarchies than the state of the art approach.
We also propose a new score to help choose the best between two hierarchical decompositions of the same graph.
The proposed algorithm was evaluated experimentally using six real-world datasets and identified smaller and denser hierarchies on most of them.

Overcoming Bias in Community Detection Evaluation

Jeancarlo Campos Leão, Alberto H. F. Laender, Pedro O. S. Vaz de Melo — Wed, 30 Dec 2020 00:00:00 -0300

Community detection is a key task to further understand the function and the structure of complex networks. Therefore, a strategy used to assess this task must be able to avoid biased and incorrect results that might invalidate further analyses or applications that rely on such communities. Two widely used strategies to assess this task are generally known as structural and functional. The structural strategy basically consists in detecting and assessing such communities by using multiple methods and structural metrics. On the other hand, the functional strategy might be used when ground truth data are available to assess the detected communities. However, the evaluation of communities based on such strategies is usually done in experimental configurations that are largely susceptible to biases, a situation that is inherent to algorithms, metrics and network data used in this task. Furthermore, such strategies are not systematically combined in a way that allows for the identification and mitigation of bias in their algorithms, metrics or network data to converge into more consistent results. In this context, the main contribution of this article is an approach that supports a robust quality evaluation when detecting communities in real-world networks. In our approach, we measure the quality of a community by applying the structural and functional strategies, and the combination of both, to obtain different pieces of evidence. Then, we consider the divergences and the consensus among the pieces of evidence to identify and overcome possible sources of bias in community detection algorithms, evaluation metrics, and network data. Experiments conducted with several real and synthetic networks provided results that show the effectiveness of our approach to obtain more consistent conclusions about the quality of the detected communities.

PrivLBS Preserving Privacy in Location Based Services

Eduardo Rodrigues Duarte Neto, Javam C. Machado, André Luis Mendonça — Wed, 19 Feb 2020 00:00:00 -0300

Location-based services have been increasingly integrated into people’s daily activities. However, some of these services may not be trustworthy and lead to serious privacy breaches. While spatial transformation techniques such as location perturbation or generalization have been studied extensively, many of them only consider the location at single timestamps without considering temporal correlations of a moving user’s locations, leaving the user’s location with no guarantees of privacy protection against attacks that would exploit this vulnerability.

This work proposes a new technique for preserving data privacy, named PrivLBS, which ensures that the individual’s location will not be easily re-identified by malicious services. Extensive simulation experiments have been carried out to evaluate the efficiency of the PrivLBS. Experimental results show that PrivLBS reaches higher protection compared to other related approaches over different kinds of attack.

Mining Temporal Exception Rules from Multivariate Time Series Using a new Support Measure

Thabata Amaral, Elaine Parros Machado de Sousa — Wed, 30 Dec 2020 00:00:00 -0300

Association rules are a common task to discover useful and comprehensive relationships among frequent and infrequent data. Frequent patterns describe a usual behavior, while infrequent ones represent uncommon knowledge. Our interest lies in finding exception rules, a class of infrequent patterns that may have critical effects as a consequence. Existing approaches for exception rules mining usually handle “itemsets databases”, where transactions are organized with no temporal information. However, temporality may be inherent to some real contexts and should be considered to improve the semantic quality of results. Moreover, these approaches implement a non-discriminatory support measure to estimate the relevance of an item, thus interpreting a large volume of data that may be merely occasional as patterns. Aiming to overcome these drawbacks, we propose TRiER (TempoRal Exception Ruler), an efficient method for mining temporal exception rules that not only discover exceptional behaviors and their causative agents, but also identifies how long consequences take to appear. We also present a new support measure to manipulate time series. This measure considers the context in which a pattern occurs, thus incorporating more semantics to the results obtained. We performed an extensive experimental analysis in real multivariate time series to verify the practical applicability of TRiER. Our results show TRiER has lower computational cost and is more scalable than existing approaches while finding a succinct and relevant set of patterns.

SAVIME: An Array DBMS for Simulation Analysis and ML Models Prediction

Sun, 14 Feb 2021 00:00:00 -0300

Limitations in current DBMSs prevent their wide adoption in scientific applications. In order to make them benefit from DBMS support, enabling Declarative data analysis and visualization over scientific data, we present an in-memory array DBMS system called SAVIME. In this work we describe the system SAVIME, along with its data model. Our preliminary evaluation show how SAVIME, by using a simple storage definition language (SDL) can outperform the state-of-the-art array database system, SciDB, during the process of data ingestion. We also show that it is possible to use SAVIME as a storage alternative for a numerical solver without affecting its scalability, making it useful for modern ML based applications.

An Experimental Analysis of the Use of Different Storage Technologies on a Relational DBMS

Francisco D. B. S. Praciano, Italo C. Abreu, Javam C. Machado — Wed, 30 Dec 2020 00:00:00 -0300

The most traditional Database Management Systems (DBMS) are built on the premise that the data is stored on magnetic disks such as hard disks drives (HDD). Recently, several alternatives to HDDs have emerged, such as the solid state drives (SSDs) based on non-volatile memory (NVM) technology such as 3D X-point and the new generations of dynamic random access memories (DRAM). The different characteristics of these devices may impact the performance of DBMSs. In this work, we propose to analyze the performance of a DBMS that stores its databases in four different ways, in HDD, SSD NVM, DRAM, and in a hybrid way, using the three storage devices together. To do this, we use two workloads, analytical and transactional, and we observe the throughput as well as the latency. After, we discuss the reasons that give rise to the results obtained for each type of storage. We also show that the query processing can benefit from the different characteristics of each storage device to perform faster queries and, finally, we analyze the benefits of using a hybrid storage system.

Auditing Government Purchases with a Multicriteria Anomaly Detection Strategy

Tue, 30 Jun 2020 00:00:00 -0300

Government purchases are the usual instrument for public acquisition of goods and services. Despite extensive legislation, several control and auditing mechanisms, frauds are still diverse and commonplace at all levels of public administration, wasting public resources. Through the use of frequent patterns, temporal correlation and combined analysis of multi-criteria, this work proposes a methodology for detecting anomalies in government purchases. The methodology promotes several levels of filtering with respect to entities involved and purchases are considered as fraudulent based on diverse criteria. The applicability and effectiveness of the methodology is demonstrated through a real case study where we were able to identify a long term provider collusion.

CNN-DFT Based Approach Applied to Image Inspection of Railcar Component: A Comparison with Machine Learning Methods

Sun, 14 Feb 2021 00:00:00 -0300

The railcar component inspection is one of the most critical tasks in railway maintenance. The use of processing, coupled with machine learning has emerged as a solution for replacing current standard methodologies. The spectral analysis gives the frequency representation of a signal and has been largely used in signal processing tasks. In this sense, this work proposes the evaluation of the use of the discrete Fourier transform (DFT) in addition to the spatial representation image of railcar component for an automatic detector of defective parts performed by convolutional neural network (CNN) classification. The most appropriate combination of images of the spatial and frequency domains is compared to the HOG feature descriptor linked to the multilayer perceptron (MLP) and support vector machine (SVM) classification, where data augmentation is investigated to improve the classification performed by all approaches. The results are given in measure of accuracy in addition to accuracy boxplot, and it showed encouraging results in the combination of spatial image and DFT magnitude combined with data augmentation as CNN inputs, reaching an accuracy of .

A New Approach for Measuring Subjectivity in Brazilian News

Diogo Florêncio de Lima; A. S. C. Melo , D. L. Carvalho, L. B. Marinho — Tue, 30 Jun 2020 00:00:00 -0300

With the advent of digital journalism, information democratization has become a reality since news articles are published as soon as the facts occur, and that they are accessible from any device connected to the internet. It iscommon sense the perception that some media outlets are more biased than others when it comes to the way of exposing the facts. However, automatic ways of measuring such biases is still an open research challenge. Under the assumptionthat journalistic texts must have an objective and impartial language, high levels of subjectivity in these texts may indicate bias. This paper proposes an initial analysis on the usage of subjectivity lexicons to characterize subjectivity inseven popular media outlets in Brazil. To better understand the obtained results, we carried out a correlation analysis between the levels of subjectivity, readability, and news popularity metrics. The adopted methods, along with thefindings obtained from this research, may contribute to a better understanding of the linguistic characteristics of the news that readers consume daily in Brazil.

Evaluating Edge-Cloud Computing Trade-Offs for Mobile Object Detection and Classification with Deep Learning

Tue, 30 Jun 2020 00:00:00 -0300

Internet-of-Things (IoT) applications based on Artiﬁcial Intelligence, such as mobile object detection and recognition from images and videos, may greatly beneﬁt from inferences made by state-of-the-art Deep Neural Networks (DNNs) models. However, adopting such models in IoT applications poses an important challenge, since DNNs usually require lots of computational resources (i.e. memory, disk, CPU/GPU, and power), which may prevent them to run on resource-limited edge devices. On the other hand, moving the heavy computation to the Cloud may signiﬁcantly increase running costs and latency of IoT applications. Among the possible strategies to tackle this challenge are: (i) DNN model partitioning between edge and cloud; and (ii) running simpler models in the edge and more complex ones in the cloud, with information exchange between models, when needed. Variations of strategy (i) also include: running the entire DNN on the edge device (sometimes not feasible) and running the entire DNN on the cloud. All these strategies involve trade-oﬀs in terms of latency, communication, and ﬁnancial costs. In this article we investigate such trade-oﬀs in real-world scenarios. We conduct several experiments in object detection and image classiﬁcation models. Our experimental setup includes a Raspberry PI 3 B+ and a cloud server equipped with a GPU. Experiments using diﬀerent network bandwidths are performed. Our results provide useful insights about the aforementioned trade-oﬀs.

IT Incident Solving Domain Experiment on Business Process Failure Prediction

Pedro Mello, Flavia Santoro, Kate Revoredo — Sun, 14 Feb 2021 00:00:00 -0300

Business process monitoring aims at maintaining the reliability of process executions. Nevertheless, the dynamic nature of business processes hinders a proactive scenario in which risk mitigation actions can occur before the facts that put the process at risk. We argue that understanding failures behavior allows proactive actions. Analysing historical data of processes executions supports the identification of situations and patterns of failure behavior. In this paper, we present an experiment in which a combination of well-established techniques from Data Mining and Process Mining fields are applied to an incident management process. The results obtained show that it is possible to identify failures in order to reach for a proactive risk mitigation scenario.

Polyflow: a Polystore-compliant Mechanism to Provide Interoperability to Heterogeneous Provenance Graphs

Yan Mendes, Daniel de Oliveira, Victor Ströele — Wed, 30 Dec 2020 00:00:00 -0300

any scientific experiments are modeled as workflows. Data from a workflow is captured by Workflow Management Systems (WfMS). Each WfMS has its own format to represent provenance (metadata that describes the generated data history), and stores it in different granularity in the form of a graph. Provenance allows scientists to analyze and evaluate results produced by a workflow. However, in more complex scenarios in which the scientist needs to analyze provenance graphs generated by multiple WfMSs and workflows, a challenge arises. To solve this problem, we propose a tool called Polyflow, which is based on the concept of Polystore systems, being able to integrate several databases of heterogeneous origin by adopting a global ProvONE schema. Polyflow allows scientists to query multiple provenance graphs in an integrated way. We evaluate Polyflow with experts using provenance data collected from real phylogenetic data analysis workflows.

Efficient Processing of Analytical Queries Extended with Similarity Search Predicates over Images in Spark

Cristina Dutra de Aguiar Ciferri, Guilherme Muzzi da Rocha — Wed, 30 Dec 2020 00:00:00 -0300

An image data warehousing extends a conventional data warehousing to also manipulate images represented by feature vectors and attributes for similarity search. A challenge that arises is the efficient processing of analytical queries extended with a similarity search predicate. These queries have a high computational cost since they require the processing of costly star join operations and distance calculations in the same setting. We consider applications that manage huge volumes of data, where the use of parallel and distributed data processing frameworks is needed. In this article, we introduce two methods to efficiently solve this challenge in Spark. BrOmnImg is based on the integration of the broadcast join and the Omni techniques for the processing of the star join operation and the distance calculations, respectively. BrOmnImgCF extends BrOmnImg by using the conventional predicate to further reduce the number of distance calculations. Compared with the closest method available in the literature, BrOmnImg reduced the time spent on query processing by up to about 65%. Compared with BrOmnImg, BrOmnImgCF improved the performance by up to about 54%.

Exploratory Analysis of Electronic Health Records using Topic Modeling

Denio, Ivair Puerari, Guilherme, Julyane — Sun, 14 Feb 2021 00:00:00 -0300

The rapid growth of electronic health record (EHR) systems brings the increase of available information about patients in hospitals. This massive amount of text information represents an opportunity to extract unknown information about medical history, medication, diseases, allergies, among others. Extract the main topics that represent
the subjects covered by a text collection can give valuable insights. To this end, approaches for topic modeling have been used to tackle such problems of information discovery and extracting topic with thematic information. In this sense, this work presents an exploratory analysis of a health collection of electronic records from an intensive care unit (ICU). The collection is split into two sub-collections: discharged patients and patients who progressed to death. We apply an LDA-based approach to discover the latent topics from the collections. The analyses show that some topics are more recurrent in the death collection, like renal diseases, and others are more recurrent in discharge collection, like, diabetes. The results of the analyses can be useful for improving the health intensive care services since the topics can be a guide to understand the patterns in discharge and death situations.

Spatial Analysis and Data Mining of Urban Trees

Gabriel de Oliveira Campos Pacheco, Clodoveu Davis — Fri, 30 Oct 2020 00:00:00 -0300

Tree coverage in urban spaces is a theme of great importance for current societies, given all the benefits that green spaces provide to the population, especially in large cities. Trees fulfill a very important role to ensure quality of urban living and urban environmental quality, and as a result trees are considered to be an element of urban infrastructure. In spite of the recognition of the importance of tree coverage, events in which a street tree falls or needs to be preventively cut down are quite frequent, damaging property and causing disturbances in the routine of the population. From a rich dataset on urban trees for the city of Belo Horizonte (MG, Brazil), this paper proposes contributions towards the identification and solution of problems related to tree coverage, with special emphasis on felled trees. Data mining techniques are employed in search of consistent patterns, expressed as association rules or temporal sequences, that are related to felling events. We also show a VGI tool to updating and expanding the original dataset.

Weakly Supervised Learning Algorithm to Eliminate Irrelevant Association Rules in Large Knowledge Bases

Bruno, Rafael Garcia Leonel Miani — Sun, 14 Feb 2021 00:00:00 -0300

Large knowledge bases construction and population have being a very explored in the past few years. Many techniques were developed in order to accomplish this purpose. Association rule mining algorithms can also be used to help populating these knowledge bases. Nevertheless, analyzing the amount of association rules extracted can be a hard task, spending a lot of time. In this way, this article presents a weakly supervised learning association rule mining algorithm to eliminate irrelevant association rules extracted. The proposed method uses rules already discovered in past iterations and prunes off those with the same pattern. Experiments showed that the new technique can reduce and eliminate the amount of rules in about 60\%, decreasing the effort spent on evaluation them.

Editorial

Angelo Brayner; Maristela Holanda; Andre L. D. Rossi — Mon, 30 Dec 2019 00:00:00 -0300

Frontmatter

Angelo Brayner; Maristela Holanda — Mon, 30 Dec 2019 00:00:00 -0300

Spatial operations on uncertain positional data

Sun, 05 Jul 2020 00:00:00 -0300

Positional errors on spatial data affect spatial join accuracy in an unexpected and undesirable way. Furthermore, current probabilistic solutions barely achieve reasonable computational performance, unless they are employed in special cases such as when the errors follow a Circular Normal distribution. This paper presents a general framework for spatial operations that are robust to positional imprecision in geographic coordinates. The framework is designed to be i) generalist, ii) accurate, and iii) efficient. Two spatial operations are presented as case studies for the proposed framework. We developed some new procedures concerning spatial joins:
an adaptation of the Monte Carlo method to be used as a probabilistic filtering step and a probabilistic efficient alternative to Minimum Bounding Rectangles, which we call Confidence Rectangles. Empirical evidence suggests that our solution is Pareto efficient concerning these requirements, i.e., it is not outperformed by any competing solution. Moreover, the parameters of our solution corresponding to accuracy and efficiency may be adjusted to maximize the gain in one while relaxing the other according to the user's demand.

World Cups Impact Analysis in the Soccer Players Transaction and Soccer Globalization using Complex Network Techniques

Mon, 30 Dec 2019 00:00:00 -0300

In this work we propose an analysis of the world cup relationship with the transfer of soccer players and a quantitative evaluation of theories that associate globalization with this player transfer market. For this analysis, networks are generated for periods that precede each world cup since 1966, and the effects of the event through the relation of the network of transfers and the best placed of each edition were evaluated. We also investigated sociological theories that associate globalization with the transfer network in soccer, being able to show through quantitative data the hypotheses raised, besides being able to renew these proposals showing the rise of new markets, such as those from Asia. In order to carry out the analyzes, complex networks and data mining techniques were combined and this evaluation showed that countries that do many transactions do not necessarily perform well in the world cups. However, part of the countries involved with a large number of transfers can have a good performance, finishing the event in good positions.

Feature Selection and Comparison of Classifiers for Protein Function Prediction

Mon, 30 Dec 2019 00:00:00 -0300

Knowing the function of proteins is essential in several areas such as bioinformatics, agriculture, and others. The processes to determine protein function that is realized in laboratories are costly and require a long time to be done. Therefore, it is necessary to provide efficient computational models that aim to find the function of a protein. There are currently several kinds of researches that deal with the prediction problem of protein function. However, each of them presents a different methodology, employing different classifiers as well. Based on this problem, we propose a methodology using a multi-objective genetic algorithm with the classifier k-NN to select the best characteristics and then apply several classifiers such as Artificial Neural Network (ANN), Support Vector Machine (SVM), Random Forest, and k-NN, in order to compare their performance in the same methodology. Our methodology found the best performance to be the Random Forest classifier, with F-Measure of 75.47%.

Investigating the Relation Between Companies with Topological Analysis of a Network of Stock Exchange in Brazil

Mon, 30 Dec 2019 00:00:00 -0300

B3 (Brasil, Bolsa, Balcão) is the official stock exchange in Brazil and plays a key role in the world financial market. Stock exchange allows people and companies to relate through the shareholding and the purchase and sale of shares. The study of the relationship between people and companies can reveal valuable information about the operation of the stock exchange and, consequently, the financial market as a whole. In this work, the relations in B3 are modeled through a network, in which the vertices represent companies and people and the edges represent shareholdings. From the built network, several analyzes are performed with the objective of understanding and characterizing the patterns found in relationships. Investigation on the topology of the network is performed under different perspectives, such as the centrality of the vertices, organization of vertices in communities, the robustness and the diffusion of influence.