Impact of High Data Quality on LLM Hallucinations

Ankush Ramprakash Gautam

Call for Paper

September Edition

IJCA solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 20 August 2025

Submit your paper

Know more

The week's pick

Assessing LLMs as Cognitive Interpreters of Student Prompts: A Typological Framework

Tadeu da Ponte Matevz Vremec Matej Mertik

Random Articles

Reseach Article

Impact of High Data Quality on LLM Hallucinations

by Ankush Ramprakash Gautam

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Number 4

Year of Publication: 2025

Authors: Ankush Ramprakash Gautam

10.5120/ijca2025924909

Ankush Ramprakash Gautam . Impact of High Data Quality on LLM Hallucinations. International Journal of Computer Applications. 187, 4 ( May 2025), 35-39. DOI=10.5120/ijca2025924909

@article{ 10.5120/ijca2025924909,

author = { Ankush Ramprakash Gautam },

title = { Impact of High Data Quality on LLM Hallucinations },

journal = { International Journal of Computer Applications },

issue_date = { May 2025 },

volume = { 187 },

number = { 4 },

month = { May },

year = { 2025 },

issn = { 0975-8887 },

pages = { 35-39 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume187/number4/impact-of-high-data-quality-on-llm-hallucinations/ },

doi = { 10.5120/ijca2025924909 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2025-05-17T02:45:52.949353+05:30

%A Ankush Ramprakash Gautam

%T Impact of High Data Quality on LLM Hallucinations

%J International Journal of Computer Applications

%@ 0975-8887

%V 187

%N 4

%P 35-39

%D 2025

%I Foundation of Computer Science (FCS), NY, USA

Abstract

[1]Large Language Models (LLMs) have shown surprising efficacy in natural language understanding and generation, but they are prone to hallucinations—where the model writes things that are simply wrong. The quality of the training and reference data is crucial in reducing these hallucinations. This paper investigates the effect of data quality on LLM hallucination rates and how structured, accurate and relevant to context dataset effects model reliability. We also, through empirical evaluation, examine how various levels of data noise, incompleteness and bias affect the frequency of hallucinations in state-of-the-art LLM architectures. We also discuss some potential solutions, including better dataset, data augmentation and [2] Reinforcement Learning from Human Feedback (RLHF) to improve the factuality of the model. Our results show that there is a need for strict data governance and high quality data pipelines in the creation of reliable AI models. Hence, by improving the data quality we are able to decrease the occurrence of hallucinations and thus improve the reliability of the LLMs for practical application in various areas including healthcare, finance, and law.

References

Large Language Models (LLMs) [Online] - https://en.wikipedia.org/wiki/Large_language_model
Reinforcement Learning from Human Feedback (RLHF) [Online] - https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback.
Islam, Saad Obaid ul, Anne Lauscher, and Goran Glavaš. "How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild." arXiv preprint, vol. 2502.12769, 2025.
Chen, Hao, et al. "On the Diversity of Synthetic Data and its Impact on Training Large Language Models." arXiv preprint, vol. 2410.15226, 2024.
Dahl, Matthew, et al. "Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models." Journal of Law and Artificial Intelligence, vol. 16, no. 1, 2024, pp. 64-102.
Chan, Willy, et al. "Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization." arXiv preprint, vol. 2502.15795, 2025.
Wettig, Alexander, et al. "QuRating: Selecting High-Quality Data for Training Language Models." arXiv preprint, vol. 2402.09739, 2024.
Sun, Jianwei, et al. "Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse." arXiv preprint, vol. 2403.09167, 2024.
Bagheri Nezhad, Sina, Ameeta Agrawal, and Rhitabrat Pokharel. "Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models." arXiv preprint, vol. 2412.12500, 2024.
Nahar, Mahjabin, et al. "Fakes of Varying Shades: How Warning Affects Human Perception and Engagement Regarding LLM Hallucinations." arXiv preprint, vol. 2404.03745, 2024.
Li, Johnny, et al. "Banishing LLM Hallucinations Requires Rethinking Generalization." arXiv preprint, vol. 2406.17642, 2024.
Thelwall, Mike. "Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT’s Effectiveness with Different Settings and Inputs." Journal of Data and Information Science, vol. 10, no. 1, 2025, pp. 1-15.
TruthfulQA Benchmark: Lin, Stephanie, et al. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. https://arxiv.org/abs/2109.07958
QuRating (Data Quality Impact): Wettig, Alexander, et al. (2024). QuRating: Selecting High-Quality Data for Training Language Models. https://arxiv.org/abs/2402.09739

Index Terms

Computer Science

Information Sciences

Large Language Models

Hallucinations

Data Quality

Artificial Intelligence

Reliability

Machine Learning

Data Governance

Keywords

Large Language Models Hallucinations Data Quality Artificial Intelligence Reliability Machine Learning Data Governance