CFP last date
20 August 2025
Call for Paper
September Edition
IJCA solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 20 August 2025

Submit your paper
Know more
Random Articles
Reseach Article

Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments

by Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 21
Year of Publication: 2025
Authors: Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi
10.5120/ijca2025925255

Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi . Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments. International Journal of Computer Applications. 187, 21 ( Jul 2025), 32-36. DOI=10.5120/ijca2025925255

@article{ 10.5120/ijca2025925255,
author = { Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi },
title = { Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments },
journal = { International Journal of Computer Applications },
issue_date = { Jul 2025 },
volume = { 187 },
number = { 21 },
month = { Jul },
year = { 2025 },
issn = { 0975-8887 },
pages = { 32-36 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number21/optimizing-gpt-4-for-automated-short-answer-grading-in-educational-assessments/ },
doi = { 10.5120/ijca2025925255 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-07-26T00:55:56.416186+05:30
%A Augustine O. Ugbari
%A Clement Ndeekor
%A Echebiri Wobidi
%T Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 21
%P 32-36
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Automated Short Answer Grading Systems (ASAGS) have witnessed significant advancement with the integration of large language models (LLMs), particularly GPT-4. This paper explores methodologies to optimize GPT-4 for the purpose of grading short answer questions in educational assessments. The focus is on aligning GPT-4’s natural language processing capabilities with human grading rubrics to enhance accuracy, consistency, and fairness. We examine techniques including prompt engineering, rubric-based scoring, and fine-tuning strategies. The research also assesses the model’s performance across various domains, evaluates inter-rater reliability with human graders, and addresses concerns related to bias, explainability, and scalability. This paper proposes a framework that leverages GPT-4 as a co-grader, ensuring human-in-the-loop moderation to improve educational outcomes.

References
  1. Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. Longman.
  2. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
  3. Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., ... & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint. https://doi.org/10.48550/arXiv.2303.12712
  4. Burrows, S., Gurevych, I., & Stein, B. (2015). The efficacy of machine learning for automated essay grading. IEEE Transactions on Learning Technologies, 9(4), 532–544.
  5. Clark, E., Tafjord, O., & Richardson, K. B. (2021). What can large language models do with syntax?. arXiv preprint arXiv:2103.08505.
  6. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  7. Dzikovska, M., Heilman, M., Collins, A., & Core, M. (2013). BEA: A large corpus of learner essays. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 1–9).
  8. Floridi, L., Cowls, J., Beltrametti, M., Chatila, R., Chazerand, P., Dignum, V., ... & Vayena, E. (2018). AI4People—An ethical framework for a good AI society: Opportunities, risks, principles, and recommendations. Minds and Machines, 28(4), 689–707. https://doi.org/10.1007/s11023-018-9482-5
  9. Guidotti, R., Monreale, A., Rossi, F., Giannotti, F., & Pedreschi, D. (2018). A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5), 1–42.
  10. Kaggle. (2012). Automated student assessment prize (ASAP). https://www.kaggle.com/c/asap-aes
  11. Kasneci, E., Sessler, K., Küchenhoff, L., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274
  12. Mohler, J., Bunescu, R., & Mihalcea, R. (2011). Lexical methods for measuring the semantic content similarity of text. In Proceedings of the conference on empirical methods in natural language processing (pp. 1416–1426).
  13. OpenAI. (2023). GPT-4 Technical Report.
  14. Ouyang, A., Wu, J., Jiang, X., Almeida, D., Wainwright, C. J., Sutskever, I., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.
  15. Riordan, B., Xue, Z., Cruz, N., & Warschauer, M. (2017). Assessing automated scoring of student-written short answers using deep learning. Journal of Educational Data Mining, 9(1), 25–47.
  16. Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge.
  17. Sukkarieh, J. Z., & Pulman, S. G. (2005). Issues in the automated evaluation of reading comprehension exercises. In Proceedings of the ACL student research workshop (pp. 9–16).
  18. Wang, Y., Liang, N., She, D., Liu, K., Xiao, X., & Zhu, J. (2023). Large language models are few-shot graders for multi-aspect feedback. arXiv preprint arXiv:2305.10775.
  19. Zhao, Y. E., Prasad, A., Eschweiler, K. M., & Chai, J. (2023). Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  20. Zupanc, B., & Bosnić, Z. (2015). Text similarity based on latent semantic analysis. Informatica, 39(3).
Index Terms

Computer Science
Information Sciences

Keywords

Automated Short Answer Grading Systems (ASAGS) Large Language Models (LLMs) GPT-4 Short Answer Questions (SAQs) Prompt Engineering Rubric-Based Scoring Few-Shot Learning Fine-Tuning Inter-Rater Reliability Natural Language Processing (NLP) Chain-of-Thought Prompting Feedback Generation.