GENERIC FRAMEWORK FOR AUTOMATIC SUBJECT GENERATION AND INDEXING IN A DIGITAL REPOSITORY

Authors

Keywords:

Automatic Subject Generation, Indexing, Collections, Digital Repository, Faceted Search

Abstract

This study aims to present a generic framework for automatic subject generation, using machine learning techniques in the Annif tool. Subsequently, perform the indexing of data and metadata in a digital repository, providing the recovery of records through faceted search. To achieve this objective, the framework was applied in the area of Information Science, building a corpus of knowledge, based on metadata of 438 articles from the Brazilian Information Science Base (BRAPCI). The Brazilian Thesaurus in Information Science (TBCI) was used as controlled vocabulary. The “collector” application developed in Phyton was used to download metadata and complete files of Dissertations and Theses from existing collections in the Institutional Repository of the University of Brasília (RiUnB). After the model training process with Annif, subjects were automatically generated and indexed in the Tainacan digital repository. In this repository, taxonomies were created based on the elaborated controlled vocabulary. In the end, it was possible to parameterize faceted searches with the possibility for the user to insert labeling and at the same time perform web browsing, selecting the terms of the faceted taxonomy. It is concluded that the proposed generic framework can be applied in any area of knowledge, helping in the automatic generation of subjects, indexing in a digital repository and parameterization of faceted taxonomies for information retrieval.

Downloads

Download data is not yet available.

Author Biographies

Jean Carlos Borges Brito, Universidade de Brasília (UNB)

PhD student at UNB in Information Science, Master in Knowledge Management and Information Technology at the Catholic University of Brasília - UCB/2010; Postgraduate in Strategic Management from Cândido Mendes University - UCAM/2014, Postgraduate in Project Management with Emphasis on Information Systems - FAST/2005; Bachelor in Information Systems - FACEB/2004. He has been working in the IT area for 22 years with experiences in the area of IT Infrastructure, Systems Development / Programming, Telecommunications, IT Audit, Training and Project Management. Experience in IT Governance and strategic alignment between IT and Business, with certifications in ITIL and COBIT, in addition to experience as a Professor at a University Center, teaching 11 subjects in the area of Technology. He currently works as Information and Communication Technology Coordinator at the Brazilian Space Agency - AEB.

Dalton Lopes Martins, Universidade de Brasília (UNB)

Professor in the Library Science course and currently coordinator (2022-2024) of the Graduate Program in Information Science PGGCinf at the Faculty of Information Science (FCI) at the University of Brasília (UnB). He is also a permanent professor at the Graduate Program in Studies of the Human Condition PPGECH at the Federal University of São Carlos. He holds a degree in Electrical Engineering from the State University of Campinas (2002) and a Master's degree in Computer Engineering from the State University of Campinas (2004). PhD in Information Sciences from ECA-USP (2009-2012), working with the theme of mapping, structural analysis and dynamics of Social Networks in distributed digital environments. Research on objects and digital repositories, digital collections and information systems interoperability strategies, connected open data, data science and machine learning with an emphasis on the analysis of digital objects. Coordinates the Tainacan research project - free software for the social construction of digital repositories - in partnership with the Brazilian Institute of Museums (IBRAM), the state government of Espírito Santo, the National Arts Foundation (FUNARTE) and the Brazilian Institute of Heritage Historical and Artistic National (IPHAN).

References

BRITO, J. C. B; MARTINS, D. L. Geração automática e semiautomática de metadados: uma revisão sistemática de literatura. In: XXI Encontro Nacional de Pesquisa e Pós-Graduação em Ciência da Informação. Rio de Janeiro, 25-29 de outubro de 2021. Disponível em https://brapci.inf.br/index.php/res/download/216427, Acesso em 20 jan. 2023.

BRITO, J. C. B; MARTINS, D. L. Geração automática de metadados: estudo de caso utilizando a técnica de indexação automática estatística com a ferramenta ANNIF. In: XXII Encontro Nacional de Pesquisa e Pós-Graduação em Ciência da Informação. Porto Alegre, UFRGS, 7-11 de novembro de 2022. Disponível em https://enancib.ancib.org/index.php/enancib/xxiienancib/paper/viewFile/777/719, Acesso em 20 jan. 2023.

CAFÉ, L. C.; MUÑOZ, I. K. Avaliação de usabilidade no repositório institucional da Universidade de Brasília. Informação & Tecnologia, v. 3, n. 2, p. 39-61, 2016.

Disponível em: http://hdl.handle.net/20.500.11959/brapci/40954. Acesso em: 20 jan. 2023.

CRYSTAL, A; LAND, P. Metadata and Search: Global Corporate Circle DCMI 2003 Workshop. 2003. Disponível em http://www.dublincore.org/groups/corporate/Seattle/

Acesso em 20 jan. 2023.

GREENBERG, J. Metadata Extraction an Harvesting: a comparison of two automatic metadata generation applications. Journal of Internet Cataloging, vol. 6, (4), 2003.

IBICT E FUNARTE. Repositório temático com foco na produção científica a respeito das artes no Brasil. Relatório referente à meta 2 do TED 001/2020 (Ibict e Funarte) – Implementação do repositório digital da ferramenta de coleta, busca e recuperação da informação da produção científica, julho, 2022.

LANCASTER, F. W. Indexação e resumos: teoria e prática. 2. ed. Brasília: Briquet de Lemos Livros. 452p., 2004.

LAPPALAINEN, M; HULKKONEN, J; INKINEN, J; KALLIO, A; LEHTINEN, M; KOSKELA, M; SJÖBERG, M; SUOMINEN, O; YETUKURI, L. Automaattisen sisällönkuvailun ohjelmiston rakentaminen – case Annif. Signum, vol. 53, nº 4, 14–20, 2021.

MARTINS, D. L; SILVA, M. F; SANTAREM SEGUNDO, J. E; SIQUEIRA, J. Repositório Digital com o software livre Tainacan: revisão da ferramenta e exemplo de implantação na área cultural com a revista filme cultura. In: XVIII Encontro Nacional de Pesquisa e Pós-Graduação em Ciência da Informação, Marília/SP, 23-27 de outubro de 2017.

MARATEA A; PETROSINO A; MANZO, M. Automatic Generation of SCORM Compliant Metadata for Portable Document Format Files. International Conference on Computer Systems and Technologies – CompSysTech, 2012.

OLIVEIRA, R. R; CARVALHO, C. L de. Implementação de Interoperabilidade entre Repositórios Digitais por meio do Protocolo OAI-PMH. Technical Report, RT-INF_003-09, Relatório Técnico, março, 2009.

PAVÃO, C. G; COSTA, J. S. B; FERREIRA, M. K; HOROWITZ, Z. Metadados e repositórios institucionais: uma relação indissociável para a qualidade da recuperação e visibilidade da informação. PontodeAcesso, Salvador, v.9, n.2, p.103-116, dez. 2015.

PINHEIRO, L. V. R; FERREZ, H. D. Tesauro Brasileiro de Ciência da Informação. Rio de Janeiro; Brasília: Instituto Brasileiro de Informação em Ciência e Tecnologia (Ibict), 2014.

POLFREMAN, M; BROUGHTON, V; WILSON, A. Metadata Generation for Resource Discovery. JISC, 2008. Disponível em

http://www.jisc.ac.uk/whatwedo/programmes/resourcediscovery/autometgen.aspx

REINSEL, D; GANTZ, J; RYDNING, J. Data Age 2025: The Digitization of the world:

from edge to core. International Data Corp – IDC, Seagate, November 2018, Data refreshed May 2020. Disponível em: https://seagate.com/files/www-content/ourstory/trends/files/dataage-idc-report-final.pdf. Acesso em: 20 jan. 2023.

SILVA, L. C da; SANTAREM SEGUNDO, J. E. Componentes de representação da informação em ambientes de informação digital: estudo do sistema de organização do software Tainacan. In: XVIII Encontro Nacional de Pesquisa e Pós-Graduação em Ciência da Informação. Florianópolis/SC, 21-25 de outubro de 2019.

SUOMINEN, O. Annif: Feeding your subject indexing robot with bibliographic metadata. Liber’s 47th Annual Conference in Lille, France, Data Enhancements in the Service of Research Libraries, session 10, 2018.

SUOMINEN, O. Annif: DIY Automated Subject Indexing Using Multiple Algorithms. Liber Quarterly, vol. 29, 2019.

SUOMINEN, O. Annif, l’indexation automatique à la Bibliothèque nationale de Finlande. Ar(abes)ques, Bibliothèques de recherche en Europe, n°94 Juillet, août, septembre, 2019.

SUOMINEN, O; INKINEN, J; LEHTINEN, M. Annif and Finto AI: Developing and Implementing Automated Subject Indexing. JLIS.it, vol. 13, nº 1, january, 2022.

Published

2023-11-24

How to Cite

Borges Brito, J. C., & Martins, D. L. (2023). GENERIC FRAMEWORK FOR AUTOMATIC SUBJECT GENERATION AND INDEXING IN A DIGITAL REPOSITORY. Perspectivas Em Ciência Da Informação, 28(Fluxo Contínuo), e46629. Retrieved from https://periodicos.ufmg.br/index.php/pci/article/view/46629