Desafios de grandes modelos de linguagem generativa na reprodução de complexidade textual: um estudo com editoriais jornalísticos

André Luis Antonelli

doi:10.1590/1983-3652.2025.58530

Autores

André Luis Antonelli Universidade Estadual de Maringá, Departamento de Língua Portuguesa, Maringá, PR, Brasil https://orcid.org/0000-0002-7896-5465

DOI:

https://doi.org/10.1590/1983-3652.2025.58530

Palavras-chave:

Modelos de linguagem generativa, Complexidade textual, Gêneros discursivos, Inteligência artificial e linguagem, Análise comparativa humano-IA

Resumo

Este artigo avalia a performance do modelo de linguagem generativa Sabiá-3 na tarefa de reproduzir aspectos de complexidade textual do gênero discursivo editorial, usando como ponto de referência editoriais produzidos por humanos. Para essa tarefa, utilizamos métricas da ferramenta computacional NILC-Metrix. Nossos resultados revelaram diferenças em quatro das cinco métricas analisadas. Os textos humanos demonstraram maior complexidade nas medidas “proporção de types em relação à quantidade de tokens” e “entropia cruzada”. Argumentamos que esse resultado pode estar vinculado, por exemplo, à capacidade humana de selecionar palavras ou realizar combinações lexicais sem a limitação de parâmetros probabilísticos. Já os textos gerados pelo modelo Sabiá-3 apresentaram maior complexidade nas métricas “sílabas por palavra” e “orações subordinadas”, possivelmente devido ao fato, entre outros aspectos, de que ferramentas do tipo não sofrem restrições de processamento cognitivo. A única métrica sem diferença estatística significativa foi “conjunções difíceis”. Atribuímos esse resultado à natureza fechada dessa classe gramatical, que limitaria variações. O estudo reforça a importância de se considerar múltiplas dimensões da complexidade textual ao avaliar a produção de grandes modelos de linguagem generativa, especialmente quando se trata de gêneros que exigem domínio linguístico refinado, tais como o editorial.

Downloads

Os dados de download ainda não estão disponíveis.

Biografia do Autor

André Luis Antonelli, Universidade Estadual de Maringá, Departamento de Língua Portuguesa, Maringá, PR, Brasil

Doutor em Linguística (2011) pela Universidade Estadual de Campinas, onde também se graduou em Letras (2001) e obteve seu mestrado em Linguística (2007). Atualmente é professor associado da Universidade Estadual de Maringá. Seus principais trabalhos são sobre aspectos morfossintáticos do português e de outras línguas românicas, tanto numa perspectiva sincrônica quanto diacrônica. Mais recentemente, tem trabalhado também com questões de processamento de linguagem natural.

Referências

ABONIZIO, Hugo; ALMEIDA, Thales Sales; LAITZ, Thiago; MALAQUIAS JUNIOR, Roseval; BONÁS, Giovana Kerche; NOGUEIRA, Rodrigo; PIRES, Ramon. Sabiá-3 technical report. [S. l.: s. n.], 2025. arXiv: 2410.12049. Disponível em: https://arxiv.org/abs/2410.12049.

ALMEIDA, Erica; CALLOU, Dinah. Sobre o uso variável do subjuntivo em português: um estudo de tendência. In: BRITO, Ana Maria; SILVA, Maria de Fátima Henriques da; VELOSO, João; FIÉIS, Alexandra (ed.). XXV Encontro Nacional da Associação Portuguesa de Linguística: Textos seleccionados. Porto: APL, 2010. p. 143–152.

BAKHTIN, Mikhail. Estética da criação verbal. São Paulo: Martins Fontes, 1997.

BIBER, Douglas; CONRAD, Susan. Register, genre, and style. Cambridge: Cambridge University Press, 2009. FRANCESCHELLI, Giorgio; MUSOLESI, Mirco. On the creativity of large language models. AI & SOCIETY, p. 1–11, 2024.

FRANTZ, Roger; STARR, Laura; BAILEY, Alison. Syntactic complexity as an aspect of text complexity. Educational Researcher, v. 44, n. 7, p. 387–393, 2015.

GIBSON, Edward. The dependency locality theory: a distance-based theory of linguistic complexity. In: MARANTZ, Alec; MIYASHITA, Yasushi; O’NEIL, Wayne (ed.). Image, Language, Brain: Papers from the First Mind Articulation Project Symposium. Cambridge, MA: The MIT Press, 2000. p. 95–126.

GULORDAVA, Kristina; BOJANOWSKI, Piotr; GRAVE, Edouard; LINZEN, Tal; BARONI, Marco. Colorless green recurrent networks dram hierarchically. [S. l.: s. n.], 2018. arXiv: 1803.11138. Disponível em: https://arxiv.org/abs/1803.11138.

HOLTZMAN, Ari; BUYS, Jan; DU, Li; FORBES, Maxwell; CHOI, Yejin. The curious case of neural text degeneration. [S. l.: s. n.], 2020. arXiv: 1904.09751. Disponível em: https://arxiv.org/abs/1904.09751.

LEAL, Sidney Evaldo; DURAN, Magali Sanches; SCARTON, Carolina Evaristo; HARTMANN, Nathan Siegle; ALUÍSIO, Sandra Maria. NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese. Language Resources and Evaluation, v. 58, n. 1, p. 73–110, 2024.

LEVELT, Willem. Speaking: from intention to articulation. Cambridge, MA: The MIT Press, 1989.

MCNAMARA, Danielle; LOUWERSE, Max; GRAESSER, Arthur. Coh-Metrix: automated cohesion and coherence scores to predict text readability and facilitate comprehension. [S. l.: s. n.], 2002. Technical report, Institute for Intelligent Systems, University of Memphis, TN.

NAIK, Dishita; NAIK, Ishita; NAIK, Nitin. Applications of AI chatbots based on generative AI, large language models and large multimodal models. In: NAIK, Nitin; JENKINS, Paul; PRAJAPAT, Shaligram;

GRACE, Paul (ed.). Contributions Presented at The International Conference on Computing, Communication, Cybersecurity and AI, July 3–4, 2024, London, UK. Cham: Springer Nature Switzerland, 2024. p. 668–690.

PIANTADOSI, Steven T. Zipf’s word frequency law in natural language: a critical review and future directions. Psychonomic Bulletin & Review, v. 21, p. 1112–1130, 2014.

RADFORD, Alec; WU, Jeffrey; CHILD, Rewon; LUAN, David; AMODEI, Dario; SUTSKEVER, Ilya. Language models are unsupervised multitask learners. OpenAI blog, v. 1, n. 8, p. 1–9, 2019.

REISENBICHLER, Martin; REUTTERER, Thomas; SCHWEIDEL, David A.; DAN, Daniel. Frontiers: supporting content marketing with natural language generation. Marketing Science, v. 41, n. 3, p. 441–452, 2022.

TEAM, R Core. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2021. Disponível em: https://www.R-project.org/.

TOUVRON, Hugo; LAVRIL, Thibaut; IZACARD, Gautier; MARTINET, Xavier; LACHAUX, Marie-Anne; LACROIX, Timothée; ROZIÈRE, Baptiste; GOYAL, Naman; HAMBRO, Eric; AZHAR, Faisal; RODRIGUEZ, Aurelien; JOULIN, Armand; GRAVE, Edouard; LAMPLE, Guillaume. LLaMA: open and efficient foundation language models. [S. l.: s. n.], 2023. arXiv: 2302.13971. Disponível em: https://arxiv.org/abs/2302.13971.

VASWANI, Ashish; SHAZEER, Noam; PARMAR, Niki; USZKOREIT, Jakob; JONES, Llion; GOMEZ, Aidan; KAISER, Lukasz; POLOSUKHIN, Illia. Attention is all you need. In: LUXBURG, Ulrike von; GUYON, Isabelle; BENGIO, Samy; WALLACH, Hanna; FERGUS, Rob; VISHWANATHAN, S.; GARNETT, Roman (ed.). Advances in Neural Information Processing Systems 30. Red Hook: Curran Associates, Inc., 2018. p. 5999–6009.

VIEIRA, Rosaura Maria Marques. O editorial de jornal. In: DELL’ISOLA, Regina Lúcia Péret (ed.). Nos domínios dos gêneros textuais. Belo Horizonte: FALE/UFMG, 2009. v. 2. p. 15–20.

WELLECK, Sean; KULIKOV, Ilia; ROLLER, Stephen; DINAN, Emily; CHO, Kyunghyun; WESTON, Jason. Neural text generation with unlikelihood training. [S. l.: s. n.], 2019. arXiv: 1908.04319. Disponível em: https://arxiv.org/abs/1908.04319.