CFP last date
20 May 2025
Reseach Article

Impact of High Data Quality on LLM Hallucinations

by Ankush Ramprakash Gautam
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 4
Year of Publication: 2025
Authors: Ankush Ramprakash Gautam
10.5120/ijca2025924909

Ankush Ramprakash Gautam . Impact of High Data Quality on LLM Hallucinations. International Journal of Computer Applications. 187, 4 ( May 2025), 35-39. DOI=10.5120/ijca2025924909

@article{ 10.5120/ijca2025924909,
author = { Ankush Ramprakash Gautam },
title = { Impact of High Data Quality on LLM Hallucinations },
journal = { International Journal of Computer Applications },
issue_date = { May 2025 },
volume = { 187 },
number = { 4 },
month = { May },
year = { 2025 },
issn = { 0975-8887 },
pages = { 35-39 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number4/impact-of-high-data-quality-on-llm-hallucinations/ },
doi = { 10.5120/ijca2025924909 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-05-17T02:45:52.949353+05:30
%A Ankush Ramprakash Gautam
%T Impact of High Data Quality on LLM Hallucinations
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 4
%P 35-39
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

[1]Large Language Models (LLMs) have shown surprising efficacy in natural language understanding and generation, but they are prone to hallucinations—where the model writes things that are simply wrong. The quality of the training and reference data is crucial in reducing these hallucinations. This paper investigates the effect of data quality on LLM hallucination rates and how structured, accurate and relevant to context dataset effects model reliability. We also, through empirical evaluation, examine how various levels of data noise, incompleteness and bias affect the frequency of hallucinations in state-of-the-art LLM architectures. We also discuss some potential solutions, including better dataset, data augmentation and [2] Reinforcement Learning from Human Feedback (RLHF) to improve the factuality of the model. Our results show that there is a need for strict data governance and high quality data pipelines in the creation of reliable AI models. Hence, by improving the data quality we are able to decrease the occurrence of hallucinations and thus improve the reliability of the LLMs for practical application in various areas including healthcare, finance, and law.

References
  1. Large Language Models (LLMs) [Online] - https://en.wikipedia.org/wiki/Large_language_model
  2. Reinforcement Learning from Human Feedback (RLHF) [Online] - https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback.
  3. Islam, Saad Obaid ul, Anne Lauscher, and Goran Glavaš. "How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild." arXiv preprint, vol. 2502.12769, 2025.
  4. Chen, Hao, et al. "On the Diversity of Synthetic Data and its Impact on Training Large Language Models." arXiv preprint, vol. 2410.15226, 2024.
  5. Dahl, Matthew, et al. "Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models." Journal of Law and Artificial Intelligence, vol. 16, no. 1, 2024, pp. 64-102.
  6. Chan, Willy, et al. "Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization." arXiv preprint, vol. 2502.15795, 2025.
  7. Wettig, Alexander, et al. "QuRating: Selecting High-Quality Data for Training Language Models." arXiv preprint, vol. 2402.09739, 2024.
  8. Sun, Jianwei, et al. "Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse." arXiv preprint, vol. 2403.09167, 2024.
  9. Bagheri Nezhad, Sina, Ameeta Agrawal, and Rhitabrat Pokharel. "Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models." arXiv preprint, vol. 2412.12500, 2024.
  10. Nahar, Mahjabin, et al. "Fakes of Varying Shades: How Warning Affects Human Perception and Engagement Regarding LLM Hallucinations." arXiv preprint, vol. 2404.03745, 2024.
  11. Li, Johnny, et al. "Banishing LLM Hallucinations Requires Rethinking Generalization." arXiv preprint, vol. 2406.17642, 2024.
  12. Thelwall, Mike. "Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT’s Effectiveness with Different Settings and Inputs." Journal of Data and Information Science, vol. 10, no. 1, 2025, pp. 1-15.
  13. TruthfulQA Benchmark: Lin, Stephanie, et al. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. https://arxiv.org/abs/2109.07958
  14. QuRating (Data Quality Impact): Wettig, Alexander, et al. (2024). QuRating: Selecting High-Quality Data for Training Language Models. https://arxiv.org/abs/2402.09739
Index Terms

Computer Science
Information Sciences
Large Language Models
Hallucinations
Data Quality
Artificial Intelligence
Reliability
Machine Learning
Data Governance

Keywords

Large Language Models Hallucinations Data Quality Artificial Intelligence Reliability Machine Learning Data Governance