DANTEStocks

A Multi-Layered Annotated Corpus of Stock Market Tweets for Brazilian Portuguese

Autores/as

Palabras clave:

Brazilian Portuguese, stock market, Twitter, multi-layer annotation

Resumen

This paper introduces DANTEStocks, a pioneer multi-layered-annotated corpus for fostering research in Natural Language Processing (NLP) of user-generated content in Portuguese. Comprising 4,048 tweets (X posts) from the stock market domain written in Brazilian Portuguese, the corpus was annotated with three standoff layers of information, that is to wit, emotion categories (following Plutchik’s wheel of emotions); part-of-speech and dependency relations (under the Universal Dependencies framework); and named entities (according to the second HAREM’s taxonomy). The DANTEStocks corpus has been constructed within the context of the POeTiSA Project, which aims at increasing the amount of linguistic resources (lingwares) and fostering the development of tools and NLP applications for Brazilian Portuguese. In this article, we address the design of the several annotation tasks carried out in DANTEStocks, reporting on the available annotation results.

Descargas

Los datos de descarga aún no están disponibles.

Referencias

BIRD, S.; KLEIN, E.; LOPER, E. Natural Language Toolkit – NLTK: TweetTokenizer. [S. l.]: NLTK Project, 2013. Available at: https://www.nltk.org/api/nltk.tokenize.casual.html. Accessed on: June 21, 2021.

BRUM, H. B.; NUNES, M. G. V. N. Building a Sentiment Corpus of Tweets in Brazilian Portuguese. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 11., 2018, Miyazaki. Proceedings […]. Miyazaki: ELRA, 2018. p. 4167-72.

CORREA JUNIOR, E. et al. PELESent: Cross-Domain Polarity Classification Using Distant Supervision. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS, 6., 2017, Uberlândia. Proceedings […]. Uberlândia: SBC, 2017. p. 49-54.

DAMERAU, F. J. A Technique for Computer Detection and Correction of Spelling Errors. Communications of the ACM, v. 7, n. 3, p. 171-176, 1964.

DERCZYNSKI, L. et al. Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition. In: WORKSHOP ON NOISY USER-GENERATED TEXT, 3., 2017, Copenhagen. Proceedings […]. Copenhagen: ACL, 2017. p. 140-147.

DERCZYNSKI, L.; BONTCHEVA, K.; ROBERTS, I. Broad Twitter Corpus: A Diverse Named Entity Recognition Resource. In: INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS, 26., 2016, Osaka. Proceedings […]. Osaka: ACL, 2016. p. 1169-1179.

DEVEIKYTE, J. et al. A Sentiment Analysis Approach to the Prediction of Market Volatility. Frontiers in Artificial Intelligence, v. 5, p. 1-10, 2022.

DI-FELIPPO, A. et al. Descrição preliminar do corpus DANTEStocks: Diretrizes de segmentação para anotação segundo Universal Dependencies. In: WORKSHOP ON PORTUGUESE DESCRIPTION, 7., 2021, Online. Proceedings […]. Porto Alegre: SBC, 2021. p. 335-343.

DI-FELIPPO, A. et al. Diretrizes de anotação de PoS Tags em tweets do mercado financeiro: orientações para anotação em Língua Portuguesa segundo a abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 438. São Carlos, SP: ICMC/ USP, 2022. 24p.

DI-FELIPPO, A.; NUNES, M. G. V.; BARBOSA, B. K. S. A Dependency Treebank of Tweets in Brazilian Portuguese: Syntactic Annotation Issues and Approach. In: SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY, 15., 2024, Belém. Proceedings [...]. Porto Alegre: SBC, 2024. p. 192-201.

DI-FELIPPO, A.; NUNES, M. G. V.; BARBOSA, B. K. S. Diretrizes de anotação de relações de dependência em tweets do mercado financeiro. Relatório Técnico do ICMC 446. São Carlos, SP: ICMC/ USP, 2024. 70p.

DURAN, M. S. et al. The Dawn of the Porttinari Multigenre Treebank: Introducing Its Journalistic Portion. In: SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY, 14., 2023, Belo Horizonte. Proceedings […]. Porto Alegre: SBC, 2023. p. 115-124.

DURAN, M. S. Manual de Anotação de PoS tags: orientações para anotação de etiquetas morfossintáticas em Língua Portuguesa, seguindo as diretrizes da abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 434. São Carlos, SP: ICMC/ USP, 2021. 55p.

DURAN, M. S. Manual de Anotação de Relações de Dependência - versão revisada e estendida: orientações para anotação de relações de dependência sintática em Língua Portuguesa, seguindo as diretrizes da abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 440. São Carlos, SP: ICMC/ USP, 2022. 166p.

FREITAS, C.; MOTA, C.; SANTOS, D.; OLIVEIRA, H. G.; CARVALHO, P. Second HAREM: Advancing the State of the Art of Named Entity Recognition in Portuguese. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 17., 2020, Valletta. Proceedings […]. Valletta: ELRA, 2010. p. 3630-3637.

GIMENES, P. A.; ROMAN, N. T.; CARVALHO, A.M.B.R. Spelling Error Patterns in Brazilian Portuguese. Computational Linguistics, v. 41, n. 1, p. 175-183, 2015. ISSN 0891-2017.

ISLAM, J.; MERCER, R. E.; XIAO, L. Multi-Channel Convolutional Neural Network for Twitter Emotion and Sentiment Recognition. In: NAACL-HLT, 17., 2019, Minneapolis. Proceedings […]. Minneapolis: ACL, 2019. p. 1355-1365.

JURAFSKY, D.; MARTIN, J. H. Speech and Language Processing. 3. ed. (draft). Available at: https://web.stanford.edu/~jurafsky/slp3/. Accessed on: Apr. 13, 2024.

KRIPPENDORFF, K. Reliability in Content Analysis: Some Common Misconceptions. Human Communications Research, v. 30, p. 411-433, 2004.

KRUMM, J.; DAVIES, N.; NARAYANASWAMI, C. User-Generated Content. IEEE Pervasive Computing, v. 7, p. 10-11, 2009. DOI: 10.1109/MPRV.2008.85.

LIU, Y. et al. Parsing Tweets into Universal Dependencies. In: NAACL-HLT, 16., 2018, New Orleans. Proceedings […]. New Orleans: ACL, 2018. p. 965-975.

LOPES, L. et al. PortiLexicon-UD: A Portuguese Lexical Resource According to Universal Dependencies Model. In: LANGUAGE RESOURCES AND EVALUATION CONFERENCE, 13., 2022, Marseille. Proceedings […]. Marseille: ELRA, 2022. p. 6635-6643.

LOPES, L.; PARDO, T. A. S. Towards Portparser – A Highly Accurate Parsing System for Brazilian Portuguese Following the Universal Dependencies Framework. In: INTERNATIONAL CONFERENCE ON COMPUTATIONAL PROCESSING OF PORTUGUESE, 16., 2024, Santiago de Compostela. Proceedings […]. Santiago de Compostela: ACL, 2024. p. 401-410.

MARNEFFE, M-C. de et al. Universal Dependencies. Computational Linguistics, v. 47, n. 2, p. 255-308, 2021. ISSN 1530-9312.

MIRANDA, L. G. M.; PARDO, T. A. S. An Improved and Extended Annotation Tool for Universal Dependencies-Based Treebank Construction. In: PROPOR DEMONSTRATIONS WORKSHOP, 2022, Fortaleza. Proceedings […]. Fortaleza: ACL, 2022. p. 1-3.

MOHAMMAD S.; BRAVO-MARQUEZ, F. Wassa-2017 Shared Task on Emotion Intensity. In: WORKSHOP ON COMPUTATIONAL APPROACHES TO SUBJECTIVITY, SENTIMENT AND SOCIAL MEDIA ANALYSIS, 8., 2017, Copenhagen. Proceedings […]. Copenhagen: ACL, 2017. p. 34-49.

MOHAMMAD S. M. et al. Sentiment, Emotion, Purpose, and Style in Electoral Tweets. Information Processing and Management, v. 51, n. 4, p. 480-99, 2015.

MORAES, S. M. W. et al. 7x1-PT: um corpus extraído do Twitter para análise de sentimentos em língua portuguesa. In: SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY, 10., 2015, Natal. Proceedings […]. Porto Alegre: SBC, 2015. p. 21-25.

MOTA, C.; SANTOS, D. (ed.). Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM. [S. l.]: Linguateca, 2008. Available at: http://www.linguateca.pt/LivroSegundoHAREM/. Accessed on: May 20, 2024. ISBN: 978-989-20-1656-6.

NIVRE, J. et al. Universal Dependencies V2: An Evergrowing Multilingual Treebank Collection. In: LANGUAGE RESOURCES AND EVALUATION CONFERENCE, 20., 2020, Marseille. Proceedings […]. Marseille: ELRA, 2020. p. 4034-4043.

NIVRE, J. Towards a Universal Grammar for Natural Language Processing. In: INTERNATIONAL CONFERENCE ON INTELLIGENT TEXT PROCESSING AND COMPUTATIONAL LINGUISTICS, 16., 2015, Cairo. Proceedings […]. Cairo: Springer, 2015. p. 3-16.

PARDO, T. A. S. et al. Porttinari – A Large Multi-Genre Treebank for Brazilian Portuguese. In: SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE, 14., 2021, Online. Proceedings […]. Porto Alegre: SBC, 2021. p. 1-10.

PERES, R.; ESTEVES, D.; MAHESHWARI, G. Bidirectional LMSTM with a Context Input Window for Named Entity Recognition in Tweets. In: KNOWLEDGE CAPTURE CONFERENCE, 9., 2017, Austin. Proceedings […]. New York: Association for Computing Machinery, 2017. p. 1-4.

PLUTCHIK, R.; KELLERMAN, H. (ed.). Emotion: Theory, Research and Experience. New York: Acad. Press, 1986.

PLUTCHIK Wheel. In: WIKIPEDIA. [S. l.], 2011. Available at: https://en.m.wikipedia.org/wiki/File:Plutchik-wheel.svg. Accessed on: May 25, 2024.

POETISA. Resources and Tools. [S. l.], [ca. 2021]. Available at: https://sites.google.com/icmc.usp.br/poetisa/resources-and-tools?authuser=0. Accessed on: May 15, 2024.

QI, P. et al. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (SYSTEM DEMONSTRATIONS), 58., 2020, Online. Proceedings […]. [S. l.]: ACL, 2020. p. 101-108.

RIJHWANI, S.; PREOTIUC-PIETRO, D. Temporally-Informed Analysis of Named Entity Recognition. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 58., 2020, Online. Proceedings […]. [S. l.]: ACL, 2020. p. 7605-7617.

ROBERTS, K. et al. EmpaTweet: Annotating and Detecting Emotions on Twitter. In: INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 8., 2012, Istanbul. Proceedings […]. Istanbul: ELRA, 2012. p. 3806-3813.

SANGUINETTI, M. et al. Treebanking User-Generated Content: A UD Based Overview of Guidelines, Corpora and Unified Recommendations. Lang Resources & Evaluation, v. 57, n. 2, p. 493-544, 2023. ISSN:1574-020X.

SCHUFF, H. et al. Annotation, Modelling and Analysis of Fine-Grained Emotions on a Stance and Sentiment Detection Corpus. In: WASSA-2017 SHARED TASK ON EMOTION INTENSITY, 2017, Stroudsburg. Proceedings […]. Stroudsburg: ACL, 2017. p. 13-23.

SILVA, E. H. et al. Universal Dependencies for Tweets in Brazilian Portuguese: Tokenization and Part of Speech Tagging. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE, 18., 2020, Online. Proceedings […]. [S. l.]: SBC, 2020. p. 434-445.

SILVA, F. J. V. Brazilian Stock Market Tweets with Emotions. In: KAGGLE. [S. l.], 2021. Available at: https://www.kaggle.com/fernandojvdasilva/stock-tweets-ptbr-emotions/data. Accessed on: May 25, 2024.

SILVA, F. J. V.; ROMAN, N. T.; CARVALHO, A. M. B. R. Stock Market Tweets Annotated with Emotions. Corpora, v. 15, n. 3, p. 343-54, 2020. ISSN 1755-1676.

SILVA, I. S. et al. Effective Sentiment Stream Analysis with Self-Augmenting Training and Demand-Driven Projection. In: INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 34., 2011, Beijing. Proceedings […]. Beijing: ACM, 2011. p. 475-84.

SUTTLES, J.; IDE, N. Distant Supervision for Emotion Classification with Discrete Binary Values. In: GELBUKH, A. (ed.). Computational Linguistics and Intelligent Text Processing. CICLing 2013. LNCS, v. 7817. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. p. 121-136.

TESNIÈRE, L. Elements of Structural Syntax. Translated by Timothy Osborne; Sylvain Kahane. Amsterdam; Philadelphia: John Benjamins, 2015.

UNIVERSAL DEPENDENCIES. CoNLL-U Format. [S. l.], c2014-2024. Available at: https://universaldependencies.org/format.html. Accessed on: Apr. 10, 2024.

USHIO, A. et al. Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts. In: CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2., 2022, Online. Proceedings […]. [S. l.]: ACL, 2022. p. 309-319.

VOSKAKI, R.; TZIAFA, E.; IOANNIDOU, K. Description of Predicative Nouns in a Modern Greek Financial Corpus. In: INTERNATIONAL SYMPOSIUM ON THEORETICAL AND APPLIED LINGUISTICS, 21., 2016, New Orleans. Proceedings […]. New Orleans: Greek Applied Linguistics Association, 2016. p. 488-503.

ZERBINATI, M. M.; ROMAN, N. T.; DI-FELIPPO, A. A Corpus of Stock Market Tweets Annotated with Named Entities. In: INTERNATIONAL CONFERENCE ON COMPUTATIONAL PROCESSING OF PORTUGUESE, 16., 2024, Santiago de Compostela. Proceedings […]. Santiago de Compostela: ACL, 2024. p. 276-284.

Descargas

Publicado

30-10-2025

Número

Sección

Número Temático - Corpus Linguistics: Studies and Applications (lançamento em 2025)