International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 186 - Number 80 |
Year of Publication: 2025 |
Authors: Maria Anurag Reddy Basani |
![]() |
Maria Anurag Reddy Basani . Generative AI-Powered Framework for Scalable and Real-Time Data Quality Management in Databricks. International Journal of Computer Applications. 186, 80 ( Apr 2025), 1-10. DOI=10.5120/ijca2025924727
In today’s data-driven field, ensuring high data quality is essential for accurate analysis and informed decision-making. Traditional data quality management methods are often labor-intensive, difficult to scale, and struggle to handle the vast and complex datasets prevalent in modern organizations. This paper proposes a novel framework that integrates Generative AI within the Databricks platform to enhance data quality management across critical dimensions, including accuracy, consistency, completeness, and timeliness. Leveraging the scalable infrastructure of Databricks, our solution employs Generative AI to automatically detect and correct data anomalies, impute missing values, and generate validation rules based on natural language commands, significantly reducing the need for manual intervention. Extensive experiments were conducted to compare the proposed approach with industry-standard data quality tools, including Ataccama ONE, Informatica Data Quality, IBM InfoSphere QualityStage, Talend Data Quality, and Soda SQL. Results demonstrate substantial improvements in data quality metrics, with our framework achieving up to 9.41% higher accuracy, 9.09% better timeliness, and a 7.78% increase in completeness over baseline scores. Additionally, our system’s ability to operate in real-time, coupled with seamless integration in Databricks, makes it a powerful, adaptive, and cost-effective solution for large-scale, dynamic data environments. This research provides valuable insights into the capabilities of Generative AI in data quality management, setting the stage for future advancements in automated data integrity solutions.