Automatic grouping of news from online newspapers using Machine Learning techniques for clustering texts in Portuguese

Authors

  • Lúcia Helena de Magalhães Universidade Federal de Minas Gerais
  • Renato Rocha Souza Universidade Federal de Minas Gerais

Keywords:

Grouping of news, Natural Language Processing, Machine Learning, Text analysis

Abstract

Clustering is a technique of organizing data into groups whose members have some similarity. Thus, this research aimed to use the techniques of Natural Language Processing, Machine Learning and Clustering to create clusters of news from a sample collected from the main online newspapers. It was found that the pre-processing step requires an effort to guarantee the quality of the results. The complexity of the Portuguese language, the need to update the list of stopwords, the difficulties related to the detection of the most important characteristics and the high dimensionality of the data were evidenced during all stages of this study. The k-means clustering algorithm obtained the best results for this type of information and Hierarchical Clustering had difficulties, since similar news were allocated to different groups. Affinity Propagation, on the other hand, disagreed as to the ideal number of clusters, but achieved a good performance when grouping by similarity.

Downloads

Download data is not yet available.

Published

2020-02-03

How to Cite

MAGALHÃES , L. H. de .; SOUZA , R. R. . Automatic grouping of news from online newspapers using Machine Learning techniques for clustering texts in Portuguese. Múltiplos Olhares em Ciência da Informação , Belo Horizonte, v. 9, n. 2, 2020. Disponível em: https://periodicos.ufmg.br/index.php/moci/article/view/19170. Acesso em: 21 nov. 2024.