Multimodality

Cognitive Approaches and Computational Representation

Authors

DOI:

https://doi.org/10.35699/2317-2096.2025.58837

Keywords:

Multimodality, computer vision, annotated datasets, frame semantics

Abstract

This article introduces the concept of multimodality, discussed from two main perspectives: the metatheoretical, which understands multimodality as a field of inquiry into the production of meaning through multiple semiotic forms; and the phenomenological, which views it as the integration of different expressive modalities (speech, gesture, image, among others) in communicative practices. Based on this conceptual foundation, the text highlights the historical lack of attention to multimodality in the fields of Linguistics and Computer Science, reflected in theoretical and computational models that favor isolated and conventionalized linguistic forms. In response to these challenges, the article presents projects developed within the scope of ReINVenTA, a research network dedicated to the construction and annotation of multimodal datasets based on Frame Semantics, aiming to integrate cognitive linguistics and computational models. The conclusion emphasizes the need for interdisciplinary approaches that recognize language as a social, interactional, and inherently multimodal phenomenon.

References

ADAM, J. M. Textos: tipos e protótipos. Tradução de Monica Cavalcante. São Paulo:

Editora Contexto, 2018.

ADAMI, E.; KRESS, G. Introduction: multimodality, meaning making, and the issue of “text”. Text & Talk, [S. l.], v. 34, n. 3, p. 231-237, 2014.

BATEMAN, J. A.; WILDEFEUER, J.; HIIPPALA, T. Multimodality: Foundations, Research and Analysis – A Problem-Oriented Introduction. Berlin: De Gruyter Mouton, 2017.

BAVELAS, J. B. Face-to-Face Dialogue: Theory, Research, and Applications. Oxford: Oxford University Press, 2022.

BELCAVELLO, F. et al. Frame2: A FrameNet-Based Multimodal Dataset for Tackling Text-image Interactions in Video. In: JOINT INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS, LANGUAGE RESOURCES AND EVALUATION (LREC-COLING 2024), 2024, Torino. Proceedings […]. Torino: European Language Resources Association (ELRA)/ ICCL, 2024. p. 7429-7437.

BOMMASANI, R. et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.DOI: https://doi.org/10.48550/arXiv.2108.07258.

CAFFAGNI, D.; COCCHI, F.; BARSELLOTTI, L.; MORATELLI, N.; SARTO, S.; BARALDI, L.; CORNIA, M.; CUCCHIARA, R. The Revolution of Multimodal Large Language Models: A Survey. In: FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024. Findings of the Association for Computational Linguistics: ACL 2024. Bangkok: Association for Computational Linguistics, 2024. p. 13590-13618.

CHAFE, W. Creativity on Verbalization as Evidence for Analogic Knowledge. In: DEPARTMENT OF LINGUISTICS. Proceedings TNLAP’75, p. 144-145, 1975. Acesso em: https://aclanthology.org/T75-2029.pdf. Acesso em: 30 abr. 2025.

COHN, N.; SCHILPEROORD, J. A Multimodal Language Faculty: A Cognitive Framework for Human Communication. Londres: Bloomsbury Academic, 2024.

CROFT, W. The Origins of Grammar in the Verbalization of Experience. Cognitive Linguistics, v. 18, n. 3, p. 339-382, 2007.

CROFT, William; CRUSE, D. Alan. Cognitive Linguistics. Cambridge: Cambridge University Press, 2004.

CZULO, O.; ZIEM, A.; TORRENT, T. T. Beyond Lexical Semantics: Notes on Pragmatic Frames. In: LREC INTERNATIONAL FRAMENET WORKSHOP. Proceedings […]. Marseille: ELRA, 2020. p. 1-7.

DANNÉLLS, D.; TORRENT, T. T.; SIGILIANO, N. S.; DOBNIK, S. Beyond Strings of Characters: Resources Meet NLP – Again. In: VOLODINA, E.; DANNÉLLS, D.; BERDICEVSKIS, A.; FORSBERG, M.; VIRK, S. (ed.). Live and Learn: Festschrift in Honor of Lars Borin. Gothenburg: Institutionen för Svenska, Flerspråkighet och Språkteknologi – Göteborgs Universitet, 2022. p. 29-36.

DEVLIN, J.; CHANG, M-W.; LEE, K.; TOUTANOVA, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In: ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL). Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, v. 1 (Long and Short Papers), 2019. p. 4171-4186.

DORNELAS, L. G.; GAMONAL, M. A.; PAGANO, A. S. Semantic analysis of audio description in short films: a multimodal approach based on Frame Semantics. Domínios de Lingu@gem, 1866, e1801, p. 1-30, 2024.

ENFIELD, Nick. The Anatomy of Meaning: Speech, Gesture, and Compositionality. Cambridge: Cambridge University Press, 2009.

ENGLE, R. A. Not channels but composite signals: speech, gesture, diagrams and object demonstrations are integrated in multimodal explanations. In: GERNSBACHER, M. A.; DERRY, S. J. (org.). Proceedings of the Twentieth Annual Conference of the Cognitive Science Society. Mahwah, NJ: Lawrence Erlbaum Associates, 1998. p. 321-326.

FILLMORE, C. J. Frame Semantics. In: LINGUISTIC SOCIETY OF KOREA (ed.). Linguistics in the Morning Calm. Seoul: Hanshin Publishing, 1982. p. 111-137.

FILLMORE, C. J.; BAKER, C. F. A frames approach to semantic analysis. In: HEINE, B.; NARROG, H. (org.). The Oxford Handbook of Linguistic Analysis. Oxford: Oxford University Press, 2009. p. 313-340.

GRICE, H. P. Logic and conversation. In: COLE, P.; MORGAN, J. (ed.). Syntax and Semantics, Volume 3. New York: Academic Press, 1975. p. 41-58.

JEWITT, C. Multimodal approaches. In: NORRIS, S.; MAIER, C. D. (orgs.). Interactions, Images and Texts: A Reader in Multimodality. Berlin; München; Boston: De Gruyter Mouton, 2014. p. 127–136.

KOCKELMAN, P. The semiotic stance. Semiotica, Berlin, v. 2005, n. 157, p. 233-304, 2005.

KUZNETSOVA, A.; ROM, H.; ALLDRIN, N.; UIJLINGS, J.; KRASIN, I.; PONT-TUSET, J.; KAMALI, S.; POPOV, S.; MALLOCI, M.; KOLESNIKOV, A.; DUERIG, T.; FERRARI, V. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, n. 128, v. 7, p. 1956-1981, 2020.

LI, L. H.; YATSKAR, M.; YIN, D.; HSIEH, C-H.; CHANG, K-W. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557, 2019. DOI: https://doi.org/10.48550/arXiv.1908.03557.

LINELL, P. The Written Language Bias in Linguistics: its Nature, Origins and Transformations. Londres: Routledge, 2005.

LINELL, P. The Written Language Bias (WLB) in linguistics 40 years after. Language Sciences, [S. l.], v. 76, p. 101-109, jun. 2019.

OPENAI. ChatGPT: comida a chute explicada. Disponível em: https://chatgpt.com/share/6806cc07-d5c8-8000-b47e-c30029bc8849. Acesso em: 21 abr. 2025.

ROJO, A. Applying Frame Semantics to Translation: A Practical Example. Meta, 47(3), p. 312-350. 2002.

ROMBACH, R.; BLATTMANN, A.; LORENZ, D.; ESSER, P.; OMMER, B. High-resolution Image Synthesis with Latent Diffusion Models. In: CONFERENCE: 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR). Proceedings [...]. [S. l] , 2022. p. 10684-10695.

SALOMÃO, M. M. M. Gramática das construções: a questão da integração entre sintaxe e léxico. Veredas, n. 6, v. 1, p. 63-74. 2002.

SALOMÃO, M. M. M. Teorias da linguagem: a perspectiva sociocognitiva. In: MIRANDA, N. S.; SALOMÃO, M. M. M. (ed.). Construções do português do Brasil: da gramática ao discurso. Belo Horizonte: Editora UFMG, 2009. p. 20-32.

TOMASELLO, M. Origins of Human Communication. Cambridge, Mass.: MIT Press, 2008.

TORRENT, T. T.; MATOS, E. E. D. S.; COSTA, A. D. D.; GAMONAL, M. A.; PERON-CORRÊA, S.; PAIVA, V. M. R. L. A Flexible Tool for a Qualia-enriched FrameNet: the FrameNet Brasil WebTool. Language Resources and Evaluation, p. 1-29. 2024.

VIDIRIANO, M. et al. Framed Multi30k: A Frame-based Multimodal Multilingual Dataset. In: THE 2024 JOINT INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS, LANGUAGE RESOURCES AND EVALUATION (LREC-COLING 2024). Proceedings [...]. [S. l.] p. 7438-7449, 2024. ADAM, J. M. Textos: tipos e protótipos. Tradução de Monica Cavalcante. São Paulo:

Editora Contexto, 2018.

ADAMI, E.; KRESS, G. Introduction: multimodality, meaning making, and the issue of “text”. Text & Talk, [S. l.], v. 34, n. 3, p. 231-237, 2014.

BATEMAN, J. A.; WILDEFEUER, J.; HIIPPALA, T. Multimodality: Foundations, Research and Analysis – A Problem-Oriented Introduction. Berlin: De Gruyter Mouton, 2017.

BAVELAS, J. B. Face-to-Face Dialogue: Theory, Research, and Applications. Oxford: Oxford University Press, 2022.

BELCAVELLO, F. et al. Frame2: A FrameNet-Based Multimodal Dataset for Tackling Text-image Interactions in Video. In: JOINT INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS, LANGUAGE RESOURCES AND EVALUATION (LREC-COLING 2024), 2024, Torino. Proceedings […]. Torino: European Language Resources Association (ELRA)/ ICCL, 2024. p. 7429-7437.

BOMMASANI, R. et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.DOI: https://doi.org/10.48550/arXiv.2108.07258.

CAFFAGNI, D.; COCCHI, F.; BARSELLOTTI, L.; MORATELLI, N.; SARTO, S.; BARALDI, L.; CORNIA, M.; CUCCHIARA, R. The Revolution of Multimodal Large Language Models: A Survey. In: FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024. Findings of the Association for Computational Linguistics: ACL 2024. Bangkok: Association for Computational Linguistics, 2024. p. 13590-13618.

CHAFE, W. Creativity on Verbalization as Evidence for Analogic Knowledge. In: DEPARTMENT OF LINGUISTICS. Proceedings TNLAP’75, p. 144-145, 1975. Acesso em: https://aclanthology.org/T75-2029.pdf. Acesso em: 30 abr. 2025.

COHN, N.; SCHILPEROORD, J. A Multimodal Language Faculty: A Cognitive Framework for Human Communication. Londres: Bloomsbury Academic, 2024.

CROFT, W. The Origins of Grammar in the Verbalization of Experience. Cognitive Linguistics, v. 18, n. 3, p. 339-382, 2007.

CROFT, William; CRUSE, D. Alan. Cognitive Linguistics. Cambridge: Cambridge University Press, 2004.

CZULO, O.; ZIEM, A.; TORRENT, T. T. Beyond Lexical Semantics: Notes on Pragmatic Frames. In: LREC INTERNATIONAL FRAMENET WORKSHOP. Proceedings […]. Marseille: ELRA, 2020. p. 1-7.

DANNÉLLS, D.; TORRENT, T. T.; SIGILIANO, N. S.; DOBNIK, S. Beyond Strings of Characters: Resources Meet NLP – Again. In: VOLODINA, E.; DANNÉLLS, D.; BERDICEVSKIS, A.; FORSBERG, M.; VIRK, S. (ed.). Live and Learn: Festschrift in Honor of Lars Borin. Gothenburg: Institutionen för Svenska, Flerspråkighet och Språkteknologi – Göteborgs Universitet, 2022. p. 29-36.

DEVLIN, J.; CHANG, M-W.; LEE, K.; TOUTANOVA, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In: ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL). Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, v. 1 (Long and Short Papers), 2019. p. 4171-4186.

DORNELAS, L. G.; GAMONAL, M. A.; PAGANO, A. S. Semantic analysis of audio description in short films: a multimodal approach based on Frame Semantics. Domínios de Lingu@gem, 1866, e1801, p. 1-30, 2024.

ENFIELD, Nick. The Anatomy of Meaning: Speech, Gesture, and Compositionality. Cambridge: Cambridge University Press, 2009.

ENGLE, R. A. Not channels but composite signals: speech, gesture, diagrams and object demonstrations are integrated in multimodal explanations. In: GERNSBACHER, M. A.; DERRY, S. J. (org.). Proceedings of the Twentieth Annual Conference of the Cognitive Science Society. Mahwah, NJ: Lawrence Erlbaum Associates, 1998. p. 321-326.

FILLMORE, C. J. Frame Semantics. In: LINGUISTIC SOCIETY OF KOREA (ed.). Linguistics in the Morning Calm. Seoul: Hanshin Publishing, 1982. p. 111-137.

FILLMORE, C. J.; BAKER, C. F. A frames approach to semantic analysis. In: HEINE, B.; NARROG, H. (org.). The Oxford Handbook of Linguistic Analysis. Oxford: Oxford University Press, 2009. p. 313-340.

GRICE, H. P. Logic and conversation. In: COLE, P.; MORGAN, J. (ed.). Syntax and Semantics, Volume 3. New York: Academic Press, 1975. p. 41-58.

JEWITT, C. Multimodal approaches. In: NORRIS, S.; MAIER, C. D. (orgs.). Interactions, Images and Texts: A Reader in Multimodality. Berlin; München; Boston: De Gruyter Mouton, 2014. p. 127–136.

KOCKELMAN, P. The semiotic stance. Semiotica, Berlin, v. 2005, n. 157, p. 233-304, 2005.

KUZNETSOVA, A.; ROM, H.; ALLDRIN, N.; UIJLINGS, J.; KRASIN, I.; PONT-TUSET, J.; KAMALI, S.; POPOV, S.; MALLOCI, M.; KOLESNIKOV, A.; DUERIG, T.; FERRARI, V. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, n. 128, v. 7, p. 1956-1981, 2020.

LI, L. H.; YATSKAR, M.; YIN, D.; HSIEH, C-H.; CHANG, K-W. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557, 2019. DOI: https://doi.org/10.48550/arXiv.1908.03557.

LINELL, P. The Written Language Bias in Linguistics: its Nature, Origins and Transformations. Londres: Routledge, 2005.

LINELL, P. The Written Language Bias (WLB) in linguistics 40 years after. Language Sciences, [S. l.], v. 76, p. 101-109, jun. 2019.

OPENAI. ChatGPT: comida a chute explicada. Disponível em: https://chatgpt.com/share/6806cc07-d5c8-8000-b47e-c30029bc8849. Acesso em: 21 abr. 2025.

ROJO, A. Applying Frame Semantics to Translation: A Practical Example. Meta, 47(3), p. 312-350. 2002.

ROMBACH, R.; BLATTMANN, A.; LORENZ, D.; ESSER, P.; OMMER, B. High-resolution Image Synthesis with Latent Diffusion Models. In: CONFERENCE: 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR). Proceedings [...]. [S. l] , 2022. p. 10684-10695.

SALOMÃO, M. M. M. Gramática das construções: a questão da integração entre sintaxe e léxico. Veredas, n. 6, v. 1, p. 63-74. 2002.

SALOMÃO, M. M. M. Teorias da linguagem: a perspectiva sociocognitiva. In: MIRANDA, N. S.; SALOMÃO, M. M. M. (ed.). Construções do português do Brasil: da gramática ao discurso. Belo Horizonte: Editora UFMG, 2009. p. 20-32.

TOMASELLO, M. Origins of Human Communication. Cambridge, Mass.: MIT Press, 2008.

TORRENT, T. T.; MATOS, E. E. D. S.; COSTA, A. D. D.; GAMONAL, M. A.; PERON-CORRÊA, S.; PAIVA, V. M. R. L. A Flexible Tool for a Qualia-enriched FrameNet: the FrameNet Brasil WebTool. Language Resources and Evaluation, p. 1-29. 2024.

VIDIRIANO, M. et al. Framed Multi30k: A Frame-based Multimodal Multilingual Dataset. In: THE 2024 JOINT INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS, LANGUAGE RESOURCES AND EVALUATION (LREC-COLING 2024). Proceedings [...]. [S. l.] p. 7438-7449, 2024.

Published

2025-04-30

How to Cite

Multimodality: Cognitive Approaches and Computational Representation. (2025). Caligrama: Revista De Estudos Românicos, 30(1), 8-22. https://doi.org/10.35699/2317-2096.2025.58837