Creación y jueceo de ítems: ChatGPT como diseñador y juez

Karla Karina Ruiz Mendoza; Luis Horacio Pedroza Zúñiga; Alma Yadhira López García

doi:10.1590/1983-3652.2024.51222

Autores/as

Karla Karina Ruiz Mendoza Universidad Autónoma de Baja, California, IIDE, Ensenada, Baja California, México. https://orcid.org/0000-0001-8978-8364
Luis Horacio Pedroza Zúñiga Universidad Autónoma de Baja, California, IIDE, Ensenada, Baja California, México. https://orcid.org/0000-0002-5256-2967
Alma Yadhira López García TheLearning Bar, Canadá https://orcid.org/0000-0002-7474-5799

DOI:

https://doi.org/10.1590/1983-3652.2024.51222

Palabras clave:

Inteligencia Artificial, Evaluación educativa, ChatGPT, Diseño de ítems, Jueceo

Resumen

El fin de este estudio fue evaluar la efectividad de la inteligencia artificial (IA), representada por ChatGPT 4.0, comparada con diseñadores humanos en la creación de ítems para un examen para el ingreso a la educación superior en el área de Lengua Escrita. Se utilizó un enfoque mixto, combinando metodologías clásicas y contemporáneas en evaluación educativa, incluyendo el juicio de expertos. ChatGPT y cuatro diseñadores humanos desarrollaron 84 ítems, siguiendo la Taxonomía de Anderson y Krathwohl para establecer el nivel de demanda cognitiva. Los ítems fueron evaluados por dos jueces humanos y ChatGPT, utilizando una rúbrica detallada que incluye claridad, neutralidad, formato, alineación curricular y redacción. Los resultados mostraron una alta tasa de aceptación sin cambios tanto para ítems de ChatGPT como para los humanos, indicando una buena alineación con los estándares de evaluación. Sin embargo, se observaron diferencias en la necesidad de cambios menores y mayores propuestos por la rúbrica. El estudio concluye que tanto la IA como los diseñadores humanos son capaces de generar ítems de alta calidad, resaltando el potencial de la IA en el diseño de ítems educativos.

Descargas

Los datos de descargas todavía no están disponibles.

Citas

AMERICAN EDUCATIONAL RESEARCH ASSOCIATION, American Psychological Association y NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION. Standards for Educational and Psychological Testing. [S. l.]: American Educational Research Association, 2014.

ANDERSON, L.W. y KRATHWOHL, D. (ed.). A Taxonomy for Learning, Teaching and Assessing: a Revision of Bloom’s Taxonomy of Educational Objectives. [S. l.]: Longman, 2001.

BLOOM, B. S. Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc, 1956.

CHAPELLE, C. A. Argument-based validation in testing and assessment. [S. l.]: SAGE Publications, 2021.

CHOMSKY, N.; ROBERTS, I. y WATUMULL, J. Noam Chomsky: The False Promise of ChatGPT. The New York Times, marzo 2023. Disponible en: https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html.

DENZIN, N. K. The Research Act: A Theoretical Introduction to Sociological Methods. [S. l.]: McGraw-Hill, 1978.

DIMITRIADOU, E. y LANITIS, A. A critical evaluation, challenges, and future perspectives of using artificial intelligence and emerging technologies in smart classrooms. Smart Learning Environments, v. 10, n. 12, 2023. DOI: 10.1186/s40561-023-00231-3.

DOWNING, S. M. Validity: On the meaningful interpretation of assessment data. Medical Education, v. 37, n. 9, p. 830-837, 2003. DOI: 10.1046/j.1365-2923.2003.01594.x.

FEUERRIEGEL, S. et al. Generative AI. Bus Inf Syst Eng, v. 66, p. 111-126, 2024. DOI: 10.1007/s12599-023-00834-7.

FIELD, A. Discovering statistics using IBM SPSS statistics. 4th. [S. l.]: Sage, 2013.

GALICIA ALARCÓN, Liliana Aidé et al. Validez de contenido por juicio de expertos: propuesta de una herramienta virtual. Apertura, v. 9, n. 2, p. 42-53, 2017. DOI: 10.32870/Ap.v9n2.993.

HALADYNA, T. M. Developing and Validating Multiple-choice Test Items. [S. l.]: Lawrence Erlbaum Associates, 2004.

HALADYNA, T. M.; DOWNING, S. M. y RODRÍGUEZ, M. C. A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, v. 15, n. 3, p. 309-333, 2002. DOI: 10.1207/S15324818AME1503_5.

HAYES, A. F. y KRIPPENDORFF, K. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, v. 1, n. 1, p. 77-89, 2007.

HOSSEINI, M.; RASMUSSEN, L. M. y RESNIK, D. B. Using AI to write scholarly publications. Accountability in Research, p. 1-9, 2023. DOI: 10.1080/08989621.2023.2168535.

HOWELL, D. C. Statistical methods for psychology. Wadsworth, NY: Cengage Learning, 2012.

KANE, M. T. Current Concerns in Validity Theory. Journal of Educational Measurement, v. 38, n. 4, p. 319-342, 2001. DOI: 10.1111/j.1745-3984.2001.tb01130.x.

KANE, M. T. Validating the interpretations and Uses of Test Scores. Journal of Educational Measurement, v. 50, n. 1, p. 1-73, 2013. DOI: 10.1111/jedm.12000.

LÓPEZ, A. T. Análisis de Rasch para todos. Una guía Simplificada para evaluadores educativos. [S. l.]: Instituto de Evaluación e Ingeniería Avanzada, 1998. ISBN 9709225103.

LYNN, M. R. Determination and Quantification of Content Validity. Nursing Research, v. 35, n. 6, p. 382-385, 1986.

MCHUGH, M. L. Interrater reliability: the kappa statistic. Biochemia Medica, v. 22, n. 3, p. 276-282, 2012.

MESSICK, S. Validity. In: Educational Measurement. Edición: R. L. Linn. 3rd. [S. l.]: American Council on Education/Macmillan, 1989. p. 13-103.

NASUTION, N. E. A. Using artificial intelligence to create biology multiple choice questions for higher education. Agricultural and Environmental Education, v. 2, n. 1, em002, 2023. DOI: 10.29333/agrenvedu/13071.

NITKO, A. J. y BROOKHART, S. M. Educational Assessment of Students. Boston, MA: Pearson, 2011.

OPEN AI. ChatGPT (versión del 14 de marzo) [Modelo de Lenguaje Grande]. 2023.

POPHAM, W. J. Educational Evaluation. Boston, MA: Allyn y Bacon, 1990.

RAUBER, M. F. et al. Reliability and validity of an automated model for assessing the learning of machine learning in middle and high school: Experiences from the “ML for All!” course. Informatics in Education, v. 00, n. 00, 2024. DOI: 10.15388/infedu.2024.10.

RUIZ MENDOZA, K. K. El uso del ChatGPT 4.0 para la elaboración de exámenes: crear el prompt adecuado. LATAM Revista Latinoamericana de Ciencias Sociales y Humanidades, v. 4, n. 2, p. 6142-6157, 2023. DOI: 10.56712/latam.v4i2.1040.

SADIKU, M. N. O. et al. Artificial Intelligence in Education. International Journal of Scientific Advances, v. 2, n. 1, 2021.

STIGGINS, R. J. Student-involved classroom assessment. [S. l.]: Prentice Hall, 2001.

TLILI, A. et al. What if the devil is my guardian angel: ChatGPT as a case study of using chatbots in education. Smart Learning Environments, v. 10, n. 15, 2023. DOI: 10.1186/s40561-023-00237-x.

YELL, M. M. Social studies, ChatGPT, and lateral reading. Social Education, v. 87, n. 3, p. 138-141, 2023.