An investigation of linguistic problems in automatic multi-document summaries

Authors

  • Márcio de Souza Dias Universidade Federal de Goiás
  • Ariani Di Felippo Universidade Federal de São Carlos
  • Amanda Pontes Rassi Redação Nota 1000 Ltda
  • Paula Christina Figueira Cardoso Universidade Federal de Lavras
  • Fernando Antônio Asevedo Nóbrega Samsung, São Paulo
  • Thiago Alexandre Salgueiro Pardo Universidade de São Paulo

DOI:

https://doi.org/10.17851/2237-2083.29.2.859-907

Keywords:

automatic summarization, multi-document summary, linguistic problem, corpus annotation

Abstract

Automatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured.

Downloads

Download data is not yet available.

Published

2024-10-06

How to Cite

DIAS, M. de S.; DI FELIPPO, A.; RASSI, A. P.; CARDOSO, P. C. F.; NÓBREGA, F. A. A.; PARDO, T. A. S. An investigation of linguistic problems in automatic multi-document summaries. Revista de Estudos da Linguagem, [S. l.], v. 29, n. 2, p. 859–907, 2024. DOI: 10.17851/2237-2083.29.2.859-907. Disponível em: https://periodicos.ufmg.br/index.php/relin/article/view/54259. Acesso em: 22 nov. 2024.

Issue

Section

Thematic issue 29:2 (2021): Corpus Linguistics: Achievements and Challenges