International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 187 - Number 4 |
Year of Publication: 2025 |
Authors: Ankush Ramprakash Gautam |
![]() |
Ankush Ramprakash Gautam . Impact of High Data Quality on LLM Hallucinations. International Journal of Computer Applications. 187, 4 ( May 2025), 35-39. DOI=10.5120/ijca2025924909
[1]Large Language Models (LLMs) have shown surprising efficacy in natural language understanding and generation, but they are prone to hallucinations—where the model writes things that are simply wrong. The quality of the training and reference data is crucial in reducing these hallucinations. This paper investigates the effect of data quality on LLM hallucination rates and how structured, accurate and relevant to context dataset effects model reliability. We also, through empirical evaluation, examine how various levels of data noise, incompleteness and bias affect the frequency of hallucinations in state-of-the-art LLM architectures. We also discuss some potential solutions, including better dataset, data augmentation and [2] Reinforcement Learning from Human Feedback (RLHF) to improve the factuality of the model. Our results show that there is a need for strict data governance and high quality data pipelines in the creation of reliable AI models. Hence, by improving the data quality we are able to decrease the occurrence of hallucinations and thus improve the reliability of the LLMs for practical application in various areas including healthcare, finance, and law.