Trends and Challenges in Data Cleaning for Large-Scale Systems: A Survey

Authors

DOI:

https://doi.org/10.63412/53kczv76

Keywords:

Automation in Data Cleaning, Cloud-Based Data Cleaning, Data Heterogeneity, Data Provenance, Explainable AI, Federated Data Cleaning, IoT Data Cleaning, Machine Learning in Data Cleaning, Metrics for Data Quality, Privacy-Preserving Data Cleaning, Real-Time Data Cleaning, Resource-Efficient Algorithms, Scalability, Standardized Benchmarks, Trends in Data Cleaning, Universal Metrics.

Abstract

Data cleaning is a critical process to maintain the integrity and usability of large-scale systems that process massive, diverse, and dynamic datasets. As the scale and complexity of data ecosystems grow, traditional cleaning techniques face limitations in addressing challenges such as data heterogeneity, real-time processing demands, and resource constraints. This paper presents a comprehensive survey of the latest trends and persistent challenges in data cleaning for large-scale systems. It examines advancements in automated and AI-driven methods, distributed and cloud-based cleaning frameworks, and real-time error detection techniques for streaming data. Additionally, the survey highlights domain-specific cleaning approaches in sectors like healthcare and finance, where data quality significantly impacts decision-making and operational efficiency. Key challenges, including scalability bottlenecks, the lack of standardized benchmarks, and ethical considerations, are discussed in detail. Finally, the paper identifies open research directions, such as explainable AI in data cleaning, universal metrics development, and sustainable algorithms for resource-efficient processing. By synthesizing recent developments and emphasizing their role in improving decision-making, system performance, and user experience, this survey aims to guide researchers and practitioners toward innovative solutions for enhancing data quality in large-scale systems.

Author Biography

  • Praveen Kumar Myakala, Independent Researcher

    Praveen Kumar Myakala is a dedicated lifelong learner with a passion for innovation and education. Holding a Master's in Data Science from Colorado Boulder University, he has published influential research and seeks to inspire through writing and mentoring. Committed to transformative learning and creative problem-solving, he emphasizes balancing professional growth with personal fulfillment while crafting meaningful solutions.

Downloads

Published

2025-05-07

How to Cite

[1]
S. Kamatala, A. K. Jonnalagadda, and P. K. Myakala, “Trends and Challenges in Data Cleaning for Large-Scale Systems: A Survey”, IJGIS, vol. 2, no. 2, May 2025, doi: 10.63412/53kczv76.