CFP last date
20 July 2026
Reseach Article

Error Analysis of BERT model for Chatbot using various Performance Measures

by Bertilla Fernandes, Snehalata B. Shirude
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 116
Year of Publication: 2026
Authors: Bertilla Fernandes, Snehalata B. Shirude
10.5120/ijca037ed09d2626

Bertilla Fernandes, Snehalata B. Shirude . Error Analysis of BERT model for Chatbot using various Performance Measures. International Journal of Computer Applications. 187, 116 ( Jun 2026), 18-23. DOI=10.5120/ijca037ed09d2626

@article{ 10.5120/ijca037ed09d2626,
author = { Bertilla Fernandes, Snehalata B. Shirude },
title = { Error Analysis of BERT model for Chatbot using various Performance Measures },
journal = { International Journal of Computer Applications },
issue_date = { Jun 2026 },
volume = { 187 },
number = { 116 },
month = { Jun },
year = { 2026 },
issn = { 0975-8887 },
pages = { 18-23 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number116/error-analysis-of-bert-model-for-chatbot-using-various-performance-measures/ },
doi = { 10.5120/ijca037ed09d2626 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2026-06-25T12:52:25.819100+05:30
%A Bertilla Fernandes
%A Snehalata B. Shirude
%T Error Analysis of BERT model for Chatbot using various Performance Measures
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 116
%P 18-23
%D 2026
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Many opportunities for changing how information and computer systems engage with more naturally, accessible way are presented by Conversational Agents (CAs). There can be possibilities in which human expectations can be fallen short of and "failed" by these CAs. BERT, a Google creation, is a notable development in natural language processing (NLP) with impressive results on a variety of tasks including Chatbots. BERT models are designed to help understand the intricate contextual relationships between each word in a statement. The evaluation metrics of the Question Answering task, can assess the factuality of large language models (LLMs). In this study an explanation of the evaluation measures for error analysis with BERT transformer model for conversational agents is provided and also details of the strengths and limitations of using these evaluation measures for chatbots in response generation is given. The impact of six different types of conversational errors was systematically analyzed by us. Work is done on diverse variants of the BERT model and detailed analysis of the evaluation measures for error analysis on a python FAQ dataset which includes the question, answer and context is performed. It was analyzed that BERTSCORE supplements better with human decisions and brings forth better model selection performance compared to present metrics. Finally, the paper concludes with discussion on the strengths and limitations of the various metrics with error analysis for conversational agents.

References
  1. Saadat Izadi and Mohamad Forouzanfar,” Error Correction and Adaptation in Conversational AI: A Review of Techniques and Applications in Chatbots”. AI 2024, 5, 803–841. https://doi.org/10.3390/ai5020041 (2024).
  2. Mourad Jbene, Abdellah Chehri, Rachid Saadane, Smail Tigani, Gwanggil Jeon,” Intent detection for task-oriented conversational agents:A comparative study of recurrent neural networks and Transformer models”,Expert Systems.2025;42:e13712. https://doi.org/10.1111/exsy.13712 (2024).
  3. Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, Yarin Gal,”Detecting hallucinations in large language models using semantic entropy”, Nature. Vol 630. https://doi.org/10.1038/s41586-024-07421-0 (2024).
  4. Salvatore Giorgi, Shreya Havaldar, Farhan Ahmed, Zuhaib Akhtar, Shalaka Vaidya, Gary Pan, Lyle H. Ungar, H. Andrew Schwartz, João Sedoc,”HUMAN-CENTERED METRICS FOR DIALOG SYSTEM EVALUATION”,DOI:10.48550/arXiv.2305.14757 (2023).
  5. Amer Farea, Zhen Yang, Kien Duong, Nadeesha Perera, Frank Emmert-Streib,” Evaluation of Question Answering Systems Complexity of judging a natural language”. ACM Computing Surveys, Volume 58, Issue 1 Article No.: 1, Pages 1 – 43. https://doi.org/10.1145/3744663 (2021).
  6. Weizhe Yuan, Graham Neubig, Pengfei Liu,” BARTSCORE: Evaluating Generated Text as Text Generation”,35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia. (2021).
  7. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi,” BERTSCORE: EVALUATING TEXT GENERATION WITH BERT”, ICLR 2020 (2020).
  8. Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre and Mark Cieliebak,”Survey on evaluation methods for dialogue systems”, Artificial Intelligence Review (2021) 54:755–810. https://doi.org/10.1007/s10462-020-09866-x . Springer (2021).
  9. Chin-Yew Lin,” ROUGE: A Package for Automatic Evaluation of Summaries”, Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. (2004).
  10. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu,” BLEU: a Method for Automatic Evaluation of Machine Translation”, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318. (2002).
  11. Baber Khalid and Sungjin Lee,”Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis”, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5871- 5883 July 10-15, 2022 ©2022 Association for Computational Linguistics (2022)
  12. S. ABINAYA, K. S. ASHWIN, A. SHERLY ALPHONSE,” Enhanced Emotion-Aware Conversational Agent: Analyzing User Behavioral Status for Tailored Reponses in Chatbot Interactions”, VOLUME 13, IEEE Access (2025)
  13. Dipak Mandlik, Roshan Chaudhary, Mayur Kotkar, Rushikesh Zende, Dr. R. S. Bhosale,” AI-Powered College Enquiry Chatbot Using NLP with BERT and GPT”, IJIRMPS, ISSN: 2349-7300 March - April 2025 Volume 13 Issue 2 (2025)
  14. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang,” SQuAD: 100,000+ Questions for Machine Comprehension of Text”, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2016)
  15. Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, Joelle Pineau,”How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation”, EMNLP (2016)
  16. Jekaterina Novikova, Ondrej Dusek, Amanda Cercas Curry, Verena Rieser,” Why We Need New Evaluation Metrics for NLG”, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2231-2242, Copenhagen, Denmark, September 7-11, (2017)
  17. Michael Hanna, Ondrej Bojar ,”A Fine-Grained Analysis of BERTScore”, Proceedings of the Sixth Conference on Machine Translation, pages 507–517, Online. Association for Computational Linguistics. (2021)
  18. Maryam Gheisarifar, Marwa Shembesh, Merve Koseoglu, Qiao Fang, Fatemeh Solmaz Afshari, Judy Chia-Chun Yuan, Cortino Sukotjo,” Evaluating the validity and consistency of artificial intelligence chatbots in responding to patients’ frequently asked questions in prosthodontics”, THE JOURNAL OF PROSTHETIC DENTISTRY, Volume 134 Issue 1,(2025)
  19. Bertilla Fernandes, Snehalata B. Shirude,”Intent Classification and Response Generation of Conversational Agents: A Literature Review”, In: Bansal, J.C., Saha, S., Coello, C.A.C., Rathore, H. (eds) Advances in Data-driven Computing and Intelligent Systems. ADCIS 2024. Lecture Notes in Networks and Systems, vol 1304. Springer, Singapore. https://doi.org/10.1007/978-981-96-3652-5_3,(2025)
  20. Enjy Abouzeid, Rita Wassef, Ayesha Jawwad, Patricia Harris,” Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis”, JMIR Med Educ 2025;11:e69521; doi: Abouzeid et al 10.2196/69521(2025)
Index Terms

Computer Science
Information Sciences

Keywords

Conversational-agent based error analysis chatbots education error analysis response generation large language models