Item creation and judging

ChatGPT as designer and judge

Authors

DOI:

https://doi.org/10.1590/1983-3652.2024.51222

Keywords:

Artificial Intelligence, Educational assessment, ChatGPT, Item design, Judging Process

Abstract

The purpose of this study was to evaluate the effectiveness of artificial intelligence (AI), represented by ChatGPT 4.0, compared to human designers in creating items for an exam for entry into higher education in the area of Written Language. A mixed approach was utilized, combining classic and contemporary methodologies in educational evaluation including expert judgment. ChatGPT and four human designers developed 84 items, following Anderson and Krathwohl's Taxonomy to establish the level of cognitive demand. The items were evaluated by two human judges and ChatGPT, using a detailed rubric that includes clarity, neutrality, format, curricular alignment, and writing. The results showed a high rate of acceptance without changes for both ChatGPT and human items, indicating good alignment with the evaluation standards. However, differences were observed in the need for minor and major changes proposed by the rubric. The study concludes that both AI and human designers are capable of generating high-quality items, highlighting the potential of AI in the design of educational items.

References

AMERICAN EDUCATIONAL RESEARCH ASSOCIATION, American Psychological Association y NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION. Standards for Educational and Psychological Testing. [S. l.]: American Educational Research Association, 2014.

ANDERSON, L.W. y KRATHWOHL, D. (ed.). A Taxonomy for Learning, Teaching and Assessing: a Revision of Bloom’s Taxonomy of Educational Objectives. [S. l.]: Longman, 2001.

BLOOM, B. S. Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc, 1956.

CHAPELLE, C. A. Argument-based validation in testing and assessment. [S. l.]: SAGE Publications, 2021.

CHOMSKY, N.; ROBERTS, I. y WATUMULL, J. Noam Chomsky: The False Promise of ChatGPT. The New York Times, marzo 2023. Disponible en: https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html.

DENZIN, N. K. The Research Act: A Theoretical Introduction to Sociological Methods. [S. l.]: McGraw-Hill, 1978.

DIMITRIADOU, E. y LANITIS, A. A critical evaluation, challenges, and future perspectives of using artificial intelligence and emerging technologies in smart classrooms. Smart Learning Environments, v. 10, n. 12, 2023. DOI: 10.1186/s40561-023-00231-3.

DOWNING, S. M. Validity: On the meaningful interpretation of assessment data. Medical Education, v. 37, n. 9, p. 830-837, 2003. DOI: 10.1046/j.1365-2923.2003.01594.x.

FEUERRIEGEL, S. et al. Generative AI. Bus Inf Syst Eng, v. 66, p. 111-126, 2024. DOI: 10.1007/s12599-023-00834-7.

FIELD, A. Discovering statistics using IBM SPSS statistics. 4th. [S. l.]: Sage, 2013.

GALICIA ALARCÓN, Liliana Aidé et al. Validez de contenido por juicio de expertos: propuesta de una herramienta virtual. Apertura, v. 9, n. 2, p. 42-53, 2017. DOI: 10.32870/Ap.v9n2.993.

HALADYNA, T. M. Developing and Validating Multiple-choice Test Items. [S. l.]: Lawrence Erlbaum Associates, 2004.

HALADYNA, T. M.; DOWNING, S. M. y RODRÍGUEZ, M. C. A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, v. 15, n. 3, p. 309-333, 2002. DOI: 10.1207/S15324818AME1503_5.

HAYES, A. F. y KRIPPENDORFF, K. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, v. 1, n. 1, p. 77-89, 2007.

HOSSEINI, M.; RASMUSSEN, L. M. y RESNIK, D. B. Using AI to write scholarly publications. Accountability in Research, p. 1-9, 2023. DOI: 10.1080/08989621.2023.2168535.

HOWELL, D. C. Statistical methods for psychology. Wadsworth, NY: Cengage Learning, 2012.

KANE, M. T. Current Concerns in Validity Theory. Journal of Educational Measurement, v. 38, n. 4, p. 319-342, 2001. DOI: 10.1111/j.1745-3984.2001.tb01130.x.

KANE, M. T. Validating the interpretations and Uses of Test Scores. Journal of Educational Measurement, v. 50, n. 1, p. 1-73, 2013. DOI: 10.1111/jedm.12000.

LÓPEZ, A. T. Análisis de Rasch para todos. Una guía Simplificada para evaluadores educativos. [S. l.]: Instituto de Evaluación e Ingeniería Avanzada, 1998. ISBN 9709225103.

LYNN, M. R. Determination and Quantification of Content Validity. Nursing Research, v. 35, n. 6, p. 382-385, 1986.

MCHUGH, M. L. Interrater reliability: the kappa statistic. Biochemia Medica, v. 22, n. 3, p. 276-282, 2012.

MESSICK, S. Validity. In: Educational Measurement. Edición: R. L. Linn. 3rd. [S. l.]: American Council on Education/Macmillan, 1989. p. 13-103.

NASUTION, N. E. A. Using artificial intelligence to create biology multiple choice questions for higher education. Agricultural and Environmental Education, v. 2, n. 1, em002, 2023. DOI: 10.29333/agrenvedu/13071.

NITKO, A. J. y BROOKHART, S. M. Educational Assessment of Students. Boston, MA: Pearson, 2011.

OPEN AI. ChatGPT (versión del 14 de marzo) [Modelo de Lenguaje Grande]. 2023.

POPHAM, W. J. Educational Evaluation. Boston, MA: Allyn y Bacon, 1990.

RAUBER, M. F. et al. Reliability and validity of an automated model for assessing the learning of machine learning in middle and high school: Experiences from the “ML for All!” course. Informatics in Education, v. 00, n. 00, 2024. DOI: 10.15388/infedu.2024.10.

RUIZ MENDOZA, K. K. El uso del ChatGPT 4.0 para la elaboración de exámenes: crear el prompt adecuado. LATAM Revista Latinoamericana de Ciencias Sociales y Humanidades, v. 4, n. 2, p. 6142-6157, 2023. DOI: 10.56712/latam.v4i2.1040.

SADIKU, M. N. O. et al. Artificial Intelligence in Education. International Journal of Scientific Advances, v. 2, n. 1, 2021.

STIGGINS, R. J. Student-involved classroom assessment. [S. l.]: Prentice Hall, 2001.

TLILI, A. et al. What if the devil is my guardian angel: ChatGPT as a case study of using chatbots in education. Smart Learning Environments, v. 10, n. 15, 2023. DOI: 10.1186/s40561-023-00237-x.

YELL, M. M. Social studies, ChatGPT, and lateral reading. Social Education, v. 87, n. 3, p. 138-141, 2023.

Downloads

Published

2024-05-31

How to Cite

Item creation and judging: ChatGPT as designer and judge. Texto Livre, Belo Horizonte-MG, v. 17, p. e51222, 2024. DOI: 10.1590/1983-3652.2024.51222. Disponível em: https://periodicos.ufmg.br/index.php/textolivre/article/view/51222. Acesso em: 19 dec. 2024.

Similar Articles

1-10 of 514

You may also start an advanced similarity search for this article.