Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments

Augustine O. Ugbari; Clement Ndeekor; Echebiri Wobidi

Call for Paper

February Edition

IJCA solicits high quality original research papers for the upcoming February edition of the journal. The last date of research paper submission is 20 January 2026

Submit your paper

Know more

The week's pick

DHCPv6 Security Threats in Smart City Infrastructure: A Comprehensive Case Study of USA Municipalities

Joy Selasi Agbesi

Random Articles

Reseach Article

Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments

by Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Number 21

Year of Publication: 2025

Authors: Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi

10.5120/ijca2025925255

Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi . Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments. International Journal of Computer Applications. 187, 21 ( Jul 2025), 32-36. DOI=10.5120/ijca2025925255

@article{ 10.5120/ijca2025925255,

author = { Augustine O. Ugbari, Clement Ndeekor, Echebiri Wobidi },

title = { Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments },

journal = { International Journal of Computer Applications },

issue_date = { Jul 2025 },

volume = { 187 },

number = { 21 },

month = { Jul },

year = { 2025 },

issn = { 0975-8887 },

pages = { 32-36 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume187/number21/optimizing-gpt-4-for-automated-short-answer-grading-in-educational-assessments/ },

doi = { 10.5120/ijca2025925255 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2025-07-26T00:55:56.416186+05:30

%A Augustine O. Ugbari

%A Clement Ndeekor

%A Echebiri Wobidi

%T Optimizing GPT-4 for Automated Short Answer Grading in Educational Assessments

%J International Journal of Computer Applications

%@ 0975-8887

%V 187

%N 21

%P 32-36

%D 2025

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Automated Short Answer Grading Systems (ASAGS) have witnessed significant advancement with the integration of large language models (LLMs), particularly GPT-4. This paper explores methodologies to optimize GPT-4 for the purpose of grading short answer questions in educational assessments. The focus is on aligning GPT-4’s natural language processing capabilities with human grading rubrics to enhance accuracy, consistency, and fairness. We examine techniques including prompt engineering, rubric-based scoring, and fine-tuning strategies. The research also assesses the model’s performance across various domains, evaluates inter-rater reliability with human graders, and addresses concerns related to bias, explainability, and scalability. This paper proposes a framework that leverages GPT-4 as a co-grader, ensuring human-in-the-loop moderation to improve educational outcomes.

References

Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. Longman.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., ... & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint. https://doi.org/10.48550/arXiv.2303.12712
Burrows, S., Gurevych, I., & Stein, B. (2015). The efficacy of machine learning for automated essay grading. IEEE Transactions on Learning Technologies, 9(4), 532–544.
Clark, E., Tafjord, O., & Richardson, K. B. (2021). What can large language models do with syntax?. arXiv preprint arXiv:2103.08505.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dzikovska, M., Heilman, M., Collins, A., & Core, M. (2013). BEA: A large corpus of learner essays. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 1–9).
Floridi, L., Cowls, J., Beltrametti, M., Chatila, R., Chazerand, P., Dignum, V., ... & Vayena, E. (2018). AI4People—An ethical framework for a good AI society: Opportunities, risks, principles, and recommendations. Minds and Machines, 28(4), 689–707. https://doi.org/10.1007/s11023-018-9482-5
Guidotti, R., Monreale, A., Rossi, F., Giannotti, F., & Pedreschi, D. (2018). A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5), 1–42.
Kaggle. (2012). Automated student assessment prize (ASAP). https://www.kaggle.com/c/asap-aes
Kasneci, E., Sessler, K., Küchenhoff, L., Bannert, M., Dementieva, D., Fischer, F., ... & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274
Mohler, J., Bunescu, R., & Mihalcea, R. (2011). Lexical methods for measuring the semantic content similarity of text. In Proceedings of the conference on empirical methods in natural language processing (pp. 1416–1426).
OpenAI. (2023). GPT-4 Technical Report.
Ouyang, A., Wu, J., Jiang, X., Almeida, D., Wainwright, C. J., Sutskever, I., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.
Riordan, B., Xue, Z., Cruz, N., & Warschauer, M. (2017). Assessing automated scoring of student-written short answers using deep learning. Journal of Educational Data Mining, 9(1), 25–47.
Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge.
Sukkarieh, J. Z., & Pulman, S. G. (2005). Issues in the automated evaluation of reading comprehension exercises. In Proceedings of the ACL student research workshop (pp. 9–16).
Wang, Y., Liang, N., She, D., Liu, K., Xiao, X., & Zhu, J. (2023). Large language models are few-shot graders for multi-aspect feedback. arXiv preprint arXiv:2305.10775.
Zhao, Y. E., Prasad, A., Eschweiler, K. M., & Chai, J. (2023). Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
Zupanc, B., & Bosnić, Z. (2015). Text similarity based on latent semantic analysis. Informatica, 39(3).

Index Terms

Computer Science

Information Sciences

Keywords

Automated Short Answer Grading Systems (ASAGS) Large Language Models (LLMs) GPT-4 Short Answer Questions (SAQs) Prompt Engineering Rubric-Based Scoring Few-Shot Learning Fine-Tuning Inter-Rater Reliability Natural Language Processing (NLP) Chain-of-Thought Prompting Feedback Generation.