The role of the study corpus in the descriptive improvement of multi-document informational complementarity
DOI:
https://doi.org/10.17851/2237-2083.29.2.1059-1087Keywords:
Multi-document informational Complementarity, Processing of Natural Languages, Study corpusAbstract
In sub-areas of Natural Language Processing (NLP), such as Automatic Multidocument Summarization (AMS), it is necessary to understand the linguistic behavior of certain phenomena, especially those of a semantic nature. Cross-document Structure Theory (CST) is widely used in NLP studies because it provides a set of semantic relations that organize information between units of analysis (commonly, pairs of sentences) organized between content (namely, redundancy, complementarity and contradiction) and presentation (namely, source/authorship and style). Until then, the characterization of CST relationships was based on generic attributes (such as the number of words in common between sentences of a pair) and specific attributes (such as the presence of temporal adverbs) for the relationships of Redundancy and Complementarity. However, the delimitation of such attributes is still incipient, as they do not include semantic and pragmatic attributes, linguistic levels that are possible to recover between the CST units of analysis. In this sense, the aim of this paper is to reconstruct the methodological path of Souza (2019) with regard to the study in corpus of CST relations in Portuguese journalistic texts, since the set of available attributes, until then, still produced mistakes in the identification of multi-document complementarity subtypes, namely temporal and timeless. Based on the CSTNews corpus, a subset of studies was organized with the first 10 clusters, that are represented by 204 pairs of sentences. As a result, a detailed description of CST complementarity was obtained, as well as the creation of a typology of signaling relationships that translate this phenomenon, in addition to proposing a specific methodology for the study of CST relations.