CFP last date
20 June 2025
Call for Paper
July Edition
IJCA solicits high quality original research papers for the upcoming July edition of the journal. The last date of research paper submission is 20 June 2025

Submit your paper
Know more
Reseach Article

Generative AI-Powered Framework for Scalable and Real-Time Data Quality Management in Databricks

by Maria Anurag Reddy Basani
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 186 - Number 80
Year of Publication: 2025
Authors: Maria Anurag Reddy Basani
10.5120/ijca2025924727

Maria Anurag Reddy Basani . Generative AI-Powered Framework for Scalable and Real-Time Data Quality Management in Databricks. International Journal of Computer Applications. 186, 80 ( Apr 2025), 1-10. DOI=10.5120/ijca2025924727

@article{ 10.5120/ijca2025924727,
author = { Maria Anurag Reddy Basani },
title = { Generative AI-Powered Framework for Scalable and Real-Time Data Quality Management in Databricks },
journal = { International Journal of Computer Applications },
issue_date = { Apr 2025 },
volume = { 186 },
number = { 80 },
month = { Apr },
year = { 2025 },
issn = { 0975-8887 },
pages = { 1-10 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume186/number80/generative-ai-powered-framework-for-scalable-and-real-time-data-quality-management-in-databricks/ },
doi = { 10.5120/ijca2025924727 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-04-26T02:19:35.269645+05:30
%A Maria Anurag Reddy Basani
%T Generative AI-Powered Framework for Scalable and Real-Time Data Quality Management in Databricks
%J International Journal of Computer Applications
%@ 0975-8887
%V 186
%N 80
%P 1-10
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In today’s data-driven field, ensuring high data quality is essential for accurate analysis and informed decision-making. Traditional data quality management methods are often labor-intensive, difficult to scale, and struggle to handle the vast and complex datasets prevalent in modern organizations. This paper proposes a novel framework that integrates Generative AI within the Databricks platform to enhance data quality management across critical dimensions, including accuracy, consistency, completeness, and timeliness. Leveraging the scalable infrastructure of Databricks, our solution employs Generative AI to automatically detect and correct data anomalies, impute missing values, and generate validation rules based on natural language commands, significantly reducing the need for manual intervention. Extensive experiments were conducted to compare the proposed approach with industry-standard data quality tools, including Ataccama ONE, Informatica Data Quality, IBM InfoSphere QualityStage, Talend Data Quality, and Soda SQL. Results demonstrate substantial improvements in data quality metrics, with our framework achieving up to 9.41% higher accuracy, 9.09% better timeliness, and a 7.78% increase in completeness over baseline scores. Additionally, our system’s ability to operate in real-time, coupled with seamless integration in Databricks, makes it a powerful, adaptive, and cost-effective solution for large-scale, dynamic data environments. This research provides valuable insights into the capabilities of Generative AI in data quality management, setting the stage for future advancements in automated data integrity solutions.

References
  1. Okechukwu Clement Agomuo, Agomuo Kingsley Uzoma, Zohaib Khan, Agomuo Ijeoma Otuomasirichi, and Junaid Hussain Muzamal. Transparent ai for adaptive fraud detection. In 2025 19th International Conference on Ubiquitous Information Management and Communication (IMCOM), pages 1–6. IEEE, 2025.
  2. Rachid Alami, Anjanava Biswas, Varun Shinde, Ahmad Almogren, Ateeq Ur Rehman, and Tahseen Shaikh. Blockchain enabled federated learning for detection of malicious internet of things nodes. IEEE Access, 12:188174–188185, 2024.
  3. Ataccama. Ataccama one platform. https://www. ataccama.com/platform/data-quality, 2024. Accessed: 2024-11-04.
  4. Tom Brown, Benjamin Mann, and Nick Ryder. The role of large language models in data quality. Proceedings of the National Academy of Sciences, 120:254–262, 2023.
  5. Peter Cohan. Generative AI Cloud Platforms. Apress, Berkeley, CA, 2024.
  6. Pan Singh Dhoni. An economical, time bound, scalable data platform designed for advanced analytics and ai. In International Conference on Cognitive Computing and Cyber Physical Systems, Singapore, 2023. Springer Nature Singapore.
  7. Pan Singh Dhoni. Enhancing data quality through generative ai: An empirical study with data. Authorea Preprints, 2023.
  8. Gartner. The state of data quality in a data-driven world. Gartner Research, 2023.
  9. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2021.
  10. Nikhil Gupta and Jason Yip. Generative AI with Databricks. Apress, Berkeley, CA, 2024.
  11. Gerry Hosea and I. T. Student. Transforming data warehouses into dynamic knowledge bases for rag. Scientific Research Journal of Science, Engineering and Technology, 2(1):5–10, 2024.
  12. Adil Hussain, Vineet Dhanawat, Ayesha Aslam, Noman Iqbal, and Sajib Tripura. Credit card fraud detection using machine learning techniques: Dealing with imbalanced data using over-sampling and under-sampling methods. In 2024 Beyond Technology Summit on Informatics International Conference (BTS-I2C), pages 676–681, 2024.
  13. IBM. The financial impact of poor data quality. IBM Research Whitepaper, 2023.
  14. IBM. Ibm infosphere qualitystage. https://www.ibm.com/ products/infosphere-qualitystage, 2024. Accessed: 2024-11-04.
  15. Informatica. Informatica data quality. https://www. informatica.com/products/data-quality.html, 2024. Accessed: 2024-11-04.
  16. Alekh Jindal and et al. Turning databases into generative ai machines. In CIDR, 2024.
  17. Andrej Karpathy. Generative models in large-scale data quality assurance. Journal of Machine Learning Research, 22:1– 15, 2022.
  18. Jukka Keisala. Utilizing large language models as no-code interface in a software development toolkit, 2023.
  19. Arpana Dipak Mahajan and et al. Generative ai-powered spark cluster recommendation engine. In 2023 Second International Conference on Augmented Intelligence and Sustainable Systems (ICAISS). IEEE, 2023.
  20. Ramona Maxwell. Automation in the Era of ML and AI. Apress, Berkeley, CA, 2024.
  21. Dhananjay Patil and Pranav Kharde. Data quality in big data: Challenges and opportunities. Journal of Data Management, 14(2):102–116, 2021.
  22. Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson, 4th edition, 2022.
  23. Soda. Soda sql. https://soda.io/, 2024. Accessed: 2024- 11-04.
  24. Talend. Talend data quality. https://www.talend.com/ products/data-quality/, 2024. Accessed: 2024-11-04.
  25. Aleksejs Vesjolijs. The e (g) tl model: A novel approach for efficient data handling and extraction in multivariate systems. Applied System Innovation, 7(5):92, 2024.
  26. Matei Zaharia, Ali Ghodsi, and Andy Konwinski. Databricks: Revolutionizing data processing. Communications of the ACM, 63(5):56–65, 2020.
Index Terms

Computer Science
Information Sciences

Keywords

Data Quality Management Generative AI Databricks Real-Time Data Processing Automated Data Cleansing Anomaly Detection Data Imputation