Representation of structured data of the text genre as a technique for automatic text processing

Claudia Aparecida Fonseca; Marcus Vinícius Carvalho  Guelpeli; Rafael Santiago de Souza Netto

doi:10.35699/1983-3652.2022.35445

Authors

Claudia Aparecida Fonseca Universidade Federal dos Vales do Jequitinhonha e Mucuri, Departamento de Letras, Diamantina, MG, Brasil https://orcid.org/0000-0003-1945-0872
Marcus Vinícius Carvalho Guelpeli Universidade Federal dos Vales do Jequitinhonha e Mucuri, Departamento de sistema de informação, Diamantina, MG, Brasil https://orcid.org/0000-0001-5724-1081
Rafael Santiago de Souza Netto Centro Universitário de Barra Mansa, Departamento de Ciência da Computação, Barra Mansa, RJ, Brasil https://orcid.org/0000-0003-1231-5387

DOI:

https://doi.org/10.35699/1983-3652.2022.35445

Keywords:

Corpus linguistics, Natural language processing, Scientific article, Text genre, Corpora annotation

Abstract

The present article was developed in the field of Natural Language Processing and Language Studies based on a corpus compiled by computational tools. This study is based on the assumption that it is helpful to trace a close relationship between corpus generation/annotation and the assessment of the constitutive elements of the text genre source. It aims to demonstrate, through specific studies of structured data from the text genre ‘scientific article’, alternatives to automatic text processing techniques. In order to reach the intended goal, the authors created a computational model for the compilation of a linguistic, specialized Corpus, representative of the genre Scientific Article - CorpACE. The object of study includes the constitutive elements of scientific articles, marked in XML, extracted and collected from the SciELO-Scientific Electronic Library On-line database. The final product was a database obtained with information extracted and structured in XML format, which designates and identifies the markups of the genre being analyzed and is available for many tools and applications. The results demonstrate how the representation of constitutive elements of the genre can condense available information with hierarchical and dynamic processes built during the compilation. At the end of the study, it is believed that more research will be required for bringing Language Science and Computer Science closer with emphasis on NLP in the attempt to represent and manipulate linguistic knowledge in its many levels – morphological, syntactic, semantic and discursive – in order to improve implementation and manipulation of automatic text processing.

Downloads

Download data is not yet available.

References

ALENCAR, A. F. de. About Aelius Brazilian Portuguese POS-Tagger. [S.l.: s.n.], 2013. Available from: http://aelius.sourceforge.net/. Visited on: 27 Dec. 2021.

ALENCAR, A. F. de. Aelius User’s Manual. [S.l.: s.n.], 2013. Available from: http://aelius.sourceforge.net/manual.html. Visited on: 27 Dec. 2021.

ALENCAR, L. F de. Aelius: uma ferramenta para anotação automática de corpora usando o NLTK. In: IX Encontro de Linguística de Corpus. Porto Alegre: PUCRS, 2010. Available from: http://corpuslg.org/gelc/media/blogs/elc2010/slides/Figueiredo_de_Alencar.pdf. Visited on: 27 Dec. 2021.

BAKHTIN, M. M. Estética da criação verbal. São Paulo: Martins Fontes, 1997.

BHARTI, S. K.; BABU, K. S. Automatic Keyword Extraction for Text Summarization: a survey. European Journal of Advances in Engineering and Technology, v. 4, n. 6, p. 410–427, 2017.

BHATIA, N.; JAISWAL, A. Literature Review on Automatic Text Summarization: Single and Multiple Summarizations. International Journal of Computer Applications, v. 117, n. 6, p. 25–29, May 2015. DOI: 10.5120/20560-2948. Available from: http://research.ijcaonline.org/volume117/number6/pxc3902948.pdf. Visited on: 27 Dec. 2021.

BRONCKART, J.-P. Atividade de linguagem, textos e discursos. Por um interacionismo sócio-discursivo. São Paulo: EDUC, 1999.

CAMBRIA, E.; WHITE, B. Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]. IEEE Computational Intelligence Magazine, v. 9, n. 2, p. 48–57, May 2014. DOI: 10.1109/MCI.2014.2307227. Available from: http://ieeexplore.ieee.org/document/6786458/. Visited on: 27 Dec. 2021.

COHEN, J. D. Highlights: Language- and domain-independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, v. 46, n. 3, p. 162–174, Apr. 1995. DOI: 10.1002/(SICI)1097-4571(199504)46:3<162::AID-ASI2>3.0.CO;2-6. Available from: https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-4571(199504)46:3%3C162::AID-ASI2%3E3.0.CO;2-6. Visited on: 27 Dec. 2021.

DE OLIVEIRA JÚNIOR, R. L.; ESMIN, A. A. A. Monitoramento Automático de Mensagens de Fóruns de Discussão de Texto Semi-Supervisionado. In: SBIE - Simpósio Brasileiro de Informática na Educação. Rio de Janeiro: SBIE, 2012.

DIMA, E. et al. A Metadata Editor to Support the Description of Linguistic Resources. In: PROCEEDINGS of the Eighth International Conference on Language Resources and Evaluation (LREC’12). Istanbul, Turkey: European Language Resources Association (ELRA), May 2012. p. 1061–1066. Available from: http://www.lrec-conf.org/proceedings/lrec2012/pdf/468_Paper.pdf. Visited on: 27 Dec. 2021.

DOMINGUES, M. L.; FAVERO, E. L.; DE MEDEIROS, I. P. O desenvolvimento de um etiquetador morfossintático com alta acurácia para o português. In: VALE, O. A. (Ed.). Avanços da Linguística de Corpus no Brasil. São Paulo: Humanistas, 2008. p. 267–286.

FIALHO, P. et al. INESC-ID@ASSIN: Medição de Similaridade Semântica e Reconhecimento de Inferência Textual. Linguamática, v. 8, n. 2, p. 33–42, Dec. 2016. Available from: https://linguamatica.com/index.php/linguamatica/article/view/v8n2-4. Visited on: 27 Dec. 2021.

FONSECA, C. A. AnoTex: anotador de artigo científico para retextualização automática. 2018. Dissertação (Mestrado Profissional em Educação) – Universidade Federal dos Vales do Jequitinhonha e Mucuri, Diamantina. Available from: http://acervo.ufvjm.edu.br/jspui/handle/1/2114. Visited on: 27 Dec. 2021.

FONSECA, C. A. et al. AnoTex: rotina de filtragem de dados estruturados do gênero artigo científico como contribuição para o PLN. Texto Livre: Linguagem e Tecnologia, v. 11, n. 3, p. 40–64, Dec. 2018. DOI: 10.17851/1983-3652.11.3.40-64. Available from: https://periodicos.ufmg.br/index.php/textolivre/article/view/16811. Visited on: 27 Dec. 2021.

FRAKES, W. B.; BAEZA-YATES, R. (Eds.). Information retrieval: data structures & algorithms. Englewood Cliffs, N.J: Prentice Hall, 1992.

GUIA de uso de elementos e atributos XML para documentos que seguem a implementação SciELO Publishing Schema. — SciELO Publishing Schema 1.5.1 documentation. [S.l.: s.n.], 2016. Available from: http://docs.scielo.org/projects/scielo-publishing-schema/pt_BR/1.5-branch/. Visited on: 27 Dec. 2021.

JONES, K. S. Automatic summarising: The state of the art. Information Processing & Management, v. 43, n. 6, p. 1449–1481, Nov. 2007. DOI: 10.1016/j.ipm.2007.03.009. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0306457307000878. Visited on: 27 Dec. 2021.

JONES, K. S. Some thesauric history. Aslib Proceedings, v. 24, n. 7, p. 400–411, July 1972. DOI: 10.1108/eb050353. Available from: https://www.emerald.com/insight/content/doi/10.1108/eb050353/full/html. Visited on: 27 Dec. 2021.

JONES, K. S.; WALKER, S.; ROBERTSON, S.E. A probabilistic model of information retrieval: development and comparative experiments. Information Processing & Management, v. 36, n. 6, p. 779–808, Nov. 2000. DOI: 10.1016/S0306-4573(00)00015-7. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0306457300000157. Visited on: 27 Dec. 2021.

KUCUK, M. E.; OLGUN, B.; SEVER, H. Application of Metadata Concepts to Discovery of Internet Resources. [S.l.: s.n.], 2000. DOI: 10.1007/3-540-40888-6_29. Available from: https://www.infona.pl//resource/bwmeta1.element.springer-f35534a0-afa3-3605-a53e-d291fae9c131. Visited on: 27 Dec. 2021.

LANDAUER, T. K; FOLTZ, P. W.; LAHAM, D. An introduction to latent semantic analysis. Discourse Processes, v. 25, n. 2-3, p. 259–284, Jan. 1998. DOI: 10.1080/01638539809545028. Available from: http://www.tandfonline.com/doi/abs/10.1080/01638539809545028. Visited on: 27 Dec. 2021.

LAU, R. Y. K. et al. Towards Fuzzy Domain Ontology Based Concept Map Generation for E-Learning. In: LEUNG, H. et al. (Eds.). Advances in Web Based Learning – ICWL 2007. Berlin, Heidelberg: Springer, 2008. (Lecture Notes in Computer Science), p. 90–101. DOI: 10.1007/978-3-540-78139-4_9.

LIN, F.-R.; HSIEH, L.-S.; CHUANG, F.-T. Discovering genres of online discussion threads via text mining. Computers & Education, v. 52, n. 2, p. 481–495, Feb. 2009. DOI: 10.1016/j.compedu.2008.10.005. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0360131508001528. Visited on: 27 Dec. 2021.

LIU, X.; LI, C.; FENG, Z. Analyze of Subject Research Hot Spots Based on An Improved Algorithm of TF*IDF——Taking Information Science for Example–Information Science 2017 07. Information Science, v. 7, n. 35, p. 015, 2017. Available from: http://en.cnki.com.cn/Article_en/CJFDTotal-QBKX201707015.htm. Visited on: 27 Dec. 2021.

LIU, X.; WEBSTER, J. J.; KIT, C. An Extractive Text Summarizer Based on Significant Words. In: LI, W.; MOLLÁ-ALIOD, D. (Eds.). Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy. Berlin, Heidelberg: Springer, 2009. (Lecture Notes in Computer Science), p. 168–178. DOI: 10.1007/978-3-642-00831-3_16.

LOVINS, J. B. Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics, v. 11, n. 1, p. 22–31, 1968.

LUHN, H. P. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, v. 2, n. 2, p. 159–165, Apr. 1958. DOI: 10.1147/rd.22.0159. Available from: http://ieeexplore.ieee.org/document/5392672/. Visited on: 27 Dec. 2021.

LUI, A. K.-F.; LI, S. C.; CHOY, S. O. An Evaluation of Automatic Text Categorization in Online Discussion Analysis. In: SEVENTH IEEE International Conference on Advanced Learning Technologies (ICALT 2007). [S.l.: s.n.], July 2007. p. 205–209. DOI: 10.1109/ICALT.2007.59.

LYSE, G. I.; MEURER, P.; DE SMEDT, K. COMEDI: A component metadata editor. In: SELECTED papers of the CLARIN Annual Conference 2014. Bergen, Norway: Linköping University Electronic Press, 2015. Available from: https://ep.liu.se/en/conference-article.aspx?series=ecp&issue=116&Article_No=8. Visited on: 27 Dec. 2021.

MANARIS, B. Natural Language Processing: A Human-Computer Interaction Perspective. Advances in Computers, v. 47, n. 100, p. 1–66, 1998.

MARCANTONIO, A. T.; SANTOS, M. dos; LEHFELD, N. A. de S. Elaboração e divulgação do trabalho científico. [S.l.]: Atlas, 1993.

MARCONI, M. de A.; LAKATOS, E. M. Fundamentos de metodologia científica. São Paulo: Atlas, 2010.

MARCUSCHI, L. A. Gêneros textuais emergentes no contexto da tecnologia digital. In: MARCUSCHI, L. A.; XAVIER, A. C. (Eds.). Hipertexto e gêneros digitais: novas formas de construção de sentido. Rio de Janeiro: Lucerna, 2004. p. 13–67.

MARCUSCHI, L. A. Gêneros textuais: definição e funcionalidade. In: DIONÍSIO, A. P.; MACHADO, A. R.; BEZERRA, M. A. (Eds.). Gêneros textuais e ensino. Rio de Janeiro: Lucerna, 2002. p. 19–36.

MATENCIO, M. de L. M. Atividade de (Re)textualização em práticas acadêmicas: um estudo do resum. Scripta, v. 6, n. 11, p. 109–122, Oct. 2002. Available from: http://periodicos.pucminas.br/index.php/scripta/article/view/12453. Visited on: 27 Dec. 2021.

REITER, E. A Structured Review of the Validity of BLEU. Computational Linguistics, v. 44, n. 3, p. 393–401, Sept. 2018. DOI: 10.1162/coli_a_00322. Available from: https://direct.mit.edu/coli/article/44/3/393-401/1598. Visited on: 27 Dec. 2021.

ROCHA, V. J. C.; GUELPELI, M. V. C. PragmaSUM: automatic tex summarizer based on user profile. International Journal of Current Research, v. 9, n. 7, p. 53935–53942, 2017.

ROLIM, V.; FERREIRA, R.; COSTA, E. Identificação Automática de Dúvidas em Fóruns Educacionais. In: p. 936. DOI: 10.5753/cbie.sbie.2016.936. Available from: http://www.br-ie.org/pub/index.php/sbie/article/view/6779. Visited on: 27 Dec. 2021.

SALTON, G.; MCGILL, M. J. Introduction to modern information retrieval. New York: McGraw Hill, 1983. (McGraw-Hill computer science series).

SANTOS, C. N. dos; ZADROZNY, B. Learning Character-level Representations for Part-of-Speech Tagging. In: PROCEEDINGS of the 31st International Conference on Machine Learning (ICML-14). [S.l.: s.n.], 2014. p. 1818–1826.

SARDINHA, T. B. Lingüística de corpus. Barueri: Manole, 2004.

SCARTON, C. E.; ALUÍSIO, S. M. Análise da Inteligibilidade de textos via ferramentas de Processamento de Língua Natural: adaptando as métricas do Coh-Metrix para o Português. Linguamática, v. 2, n. 1, p. 45–61, Apr. 2010. Available from: https://linguamatica.com/index.php/linguamatica/article/view/44. Visited on: 27 Dec. 2021.

SCHNEUWLY, B.; DOLZ, J. Gêneros orais e escritos na escola. Campinas, SP: Mercado de Letras, 2004.

SILVA, B. C. D. da. O estudo Lingüístico-Computacional da Linguagem. Letras de Hoje, v. 41, n. 2, set. 2006. Available from: https://revistaseletronicas.pucrs.br/ojs/index.php/fale/article/view/597.

SILVA, J. G. B.; PEREIRA, M. T. B. F.; BUENO, L. A elaboração de um artigo científico: subsídios à apropriação desse gênero textual. Horizontes, v. 32, n. 1, June 2014. DOI: 10.24933/horizontes.v32i1.88. Available from: https://revistahorizontes.usf.edu.br/horizontes/article/view/88. Visited on: 27 Dec. 2021.

SILVA, L. A.; TRINDADE, D., et al. Mineração de Dados em publicações de Fóruns de Discussões do Moodle como geração de Indicadores para aprimoramento da Gestão Educacional. Anais dos Workshops do Congresso Brasileiro de Informática na Educação, v. 4, n. 1, p. 1084, Oct. 2015. DOI: 10.5753/cbie.wcbie.2015.1084. Available from: http://br-ie.org/pub/index.php/wcbie/article/view/6220. Visited on: 27 Dec. 2021.

SOUSA, M. C. P. de. O Corpus Tycho Brahe: contribuições para as humanidades digitais no Brasil. Filologia e Linguística Portuguesa, v. 16, spe, p. 53, Dec. 2014. DOI: 10.11606/issn.2176-9419.v16ispep53-93. Available from: http://revistas.usp.br/flp/article/view/88404. Visited on: 27 Dec. 2021.

SOUSA, M. C. P. de; KEPLER, F. N.; FARIA, P. P. F. de. E-Dictor: novas perspectivas na codificação e edição de corpora de textos históricos. In: CAMINHOS da Linguística de Corpus. São Paulo: Mercado de Letras, 2010. p. 225–246.

SOUSA, M. C. P. de; KEPLER, F. N.; FARIA, P. P. F. de. Uma proposta de automatização das edições XML do e-Dictor. In: ANAIS. [S.l.: s.n.], 2016. Available from: https://sefuefs2015.wordpress.com/. Visited on: 27 Dec. 2021.

SWALES, J. Genre analysis: English in academic and research settings. Cambridge: Cambridge Univ. Pr, 1990. (Cambridge applied linguistics series).

TONELLI, S.; PIANTA, E. Matching documents and summaries using key-concepts Sara. In: PROCEEDINGS of the Seventh DEFT Workshop. Montpellier, France: [s.n.], 2011. p. 73–83.

VIEIRA, R.; LIMA, V. L. S. de. Lingüística computacional: princípios e aplicações. In: ANAIS do XXI Congresso da SBC. I Jornada de Atualização em Inteligência Artificial. [S.l.: s.n.], 2001. Available from: http://www.inf.unioeste.br/~jorge/MESTRADOS/LETRAS%20-%20MECANISMOS%20DO%20FUNCIONAMENTO%20DA%20LINGUAGEM%20-%20PROCESSAMENTO%20DA%20LINGUAGEM%20NATURAL/ARTIGOS%20INTERESSANTES/lingu%EDstica%20computacional.pdf. Visited on: 27 Dec. 2021.

WEBSTER, J. J.; KIT, C. Tokenization as the initial phase in NLP. en. In: PROCEEDINGS of the 14th conference on Computational linguistics -. Nantes, France: Association for Computational Linguistics, 1992. v. 4, p. 1106. DOI: 10.3115/992424.992434. Available from: http://portal.acm.org/citation.cfm?doid=992424.992434. Visited on: 27 Dec. 2021.