Trends and Challenges in Data Cleaning for Large-Scale Systems: A Survey

Srikanth Kamatala; Anil Kumar Jonnalagadda; Praveen Kumar Myakala

doi:10.63412/53kczv76

Authors

Srikanth Kamatala Independent Researcher Author https://orcid.org/0009-0000-2375-7119
Anil Kumar Jonnalagadda Independant Researcher Author https://orcid.org/0009-0000-8207-4131
Praveen Kumar Myakala Independent Researcher Author https://orcid.org/0009-0009-6988-5592

DOI:

https://doi.org/10.63412/53kczv76

Keywords:

Automation in Data Cleaning, Cloud-Based Data Cleaning, Data Heterogeneity, Data Provenance, Explainable AI, Federated Data Cleaning, IoT Data Cleaning, Machine Learning in Data Cleaning, Metrics for Data Quality, Privacy-Preserving Data Cleaning, Real-Time Data Cleaning, Resource-Efficient Algorithms, Scalability, Standardized Benchmarks, Trends in Data Cleaning, Universal Metrics.

Abstract

Data cleaning is a critical process to maintain the integrity and usability of large-scale systems that process massive, diverse, and dynamic datasets. As the scale and complexity of data ecosystems grow, traditional cleaning techniques face limitations in addressing challenges such as data heterogeneity, real-time processing demands, and resource constraints. This paper presents a comprehensive survey of the latest trends and persistent challenges in data cleaning for large-scale systems. It examines advancements in automated and AI-driven methods, distributed and cloud-based cleaning frameworks, and real-time error detection techniques for streaming data. Additionally, the survey highlights domain-specific cleaning approaches in sectors like healthcare and finance, where data quality significantly impacts decision-making and operational efficiency. Key challenges, including scalability bottlenecks, the lack of standardized benchmarks, and ethical considerations, are discussed in detail. Finally, the paper identifies open research directions, such as explainable AI in data cleaning, universal metrics development, and sustainable algorithms for resource-efficient processing. By synthesizing recent developments and emphasizing their role in improving decision-making, system performance, and user experience, this survey aims to guide researchers and practitioners toward innovative solutions for enhancing data quality in large-scale systems.

Downloads

Download data is not yet available.

Author Biography

Praveen Kumar Myakala, Independent Researcher

Praveen Kumar Myakala is a dedicated lifelong learner with a passion for innovation and education. Holding a Master's in Data Science from Colorado Boulder University, he has published influential research and seeks to inspire through writing and mentoring. Committed to transformative learning and creative problem-solving, he emphasizes balancing professional growth with personal fulfillment while crafting meaningful solutions.

Trends and Challenges in Data Cleaning for Large-Scale Systems: A Survey

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biography

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Make a Submission