Topic modeling

Summarize and organize data corpus using machine learning algorithms

Authors

  • Marcos de Souza Universidade Federal de Minas Gerais
  • Renato Rocha Souza Universidade Federal de Minas Gerais

Keywords:

Modeling topics, Machine learning, Latent Dirichlet allocation, Latent semantic indexing

Abstract

The research compares the results and performance of the Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) models of Machine Learning when applied Topic Modeling in documents of formal channels of scientific communication, consisting of 2006 scientific articles and expanded abstracts from the XIII to the XVII National Meeting of Research in Information Science (ENANCIB). The steps of empirical research are the collection of data for the constitution, cleaning, manipulation, combination, normalization, treatment and transformation of data from the corpus to connect to machine learning models. The models summarized and organized the data corpus into topics that are made up of terms and weights. The LSI model presented a greater variety between the terms and weights contained in each topic, different from the LDA model which presented a greater similarity in the results, thus making it easier for the domain specialist to create the assumption for the names of the topics.

Downloads

Download data is not yet available.

Published

2020-01-31

How to Cite

SOUZA , M. de; SOUZA , R. R. Topic modeling: Summarize and organize data corpus using machine learning algorithms. Múltiplos Olhares em Ciência da Informação , Belo Horizonte, v. 9, n. 2, 2020. Disponível em: https://periodicos.ufmg.br/index.php/moci/article/view/19138. Acesso em: 21 nov. 2024.