CFP last date
22 April 2024
Reseach Article

Hybrid Technique for Data Cleaning

Published on June 2014 by Ashwini M. Save, Seema Kolkur
National Conference on Role of Engineers in National Building
Foundation of Computer Science USA
NCRENB - Number 1
June 2014
Authors: Ashwini M. Save, Seema Kolkur
4d58345c-d42d-4ccd-9b7a-7e24489e7a79

Ashwini M. Save, Seema Kolkur . Hybrid Technique for Data Cleaning. National Conference on Role of Engineers in National Building. NCRENB, 1 (June 2014), 4-8.

@article{
author = { Ashwini M. Save, Seema Kolkur },
title = { Hybrid Technique for Data Cleaning },
journal = { National Conference on Role of Engineers in National Building },
issue_date = { June 2014 },
volume = { NCRENB },
number = { 1 },
month = { June },
year = { 2014 },
issn = 0975-8887,
pages = { 4-8 },
numpages = 5,
url = { /proceedings/ncrenb/number1/16965-1402/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 National Conference on Role of Engineers in National Building
%A Ashwini M. Save
%A Seema Kolkur
%T Hybrid Technique for Data Cleaning
%J National Conference on Role of Engineers in National Building
%@ 0975-8887
%V NCRENB
%N 1
%P 4-8
%D 2014
%I International Journal of Computer Applications
Abstract

Data warehouse contains large volume of data. Data quality is an important issue in data warehousing projects. Many business decision processes are based on the data entered in the data warehouse. Hence for accurate data, improving the data quality is necessary. Data may include text errors, quantitative errors or even duplication of the data. There are several ways to remove such errors and inconsistencies from the data. Data cleaning is a process of detecting and correcting inaccurate data. Different types of algorithms such as Improved PNRS algorithm, Quantitative algorithm and Transitive algorithm are used for the data cleaning process. In this paper an attempt has been made to clean the data in the data warehouse by combining different approaches of data cleaning. Text data will be cleaned by Improved PNRS algorithm, Quantitative data will be cleaned by special rules i. e. Enhanced technique. And lastly duplication of the data will be removed by Transitive closure algorithm. By applying these algorithms one after other on data sets, the accuracy level of the dataset will get increased.

References
  1. Arindam Paul, V. Ganesan, and J. Challa, "HADCLEAN: A Hybrid Approach to Data Cleaning in Data Warehouses" IEEE, 2012.
  2. Mortadha M. Hamad and AlaaAbdulkhar Jihad, "An Enhanced Technique to Clean Data in the Data Warehouse"IEEE,2011.
  3. K. Ali and M. Warraich, "A framework to implement data cleaning in enterprise data warehouse for robust data quality" IEEE, 978-1-4244-8003-6/10, 2010.
  4. C. Varol, C. Bayrak, R. Wagner and D. Goff, "Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data", Data Engineering - International Series in Operations Research & Management Science, vol. 132, pp. 91 -101, 2010.
  5. M. A. Hernández and S J. Stolfo, "Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem", Data Mining and Knowledge Discovery, Springer Netherlands, vol. 2, no. 1, pp. 9-37, 1998.
  6. R. Bheemavaram, J. Zhang and W. N. Li, "Efficient Algorithms for Grouping Data to Improve Data Quality", Proceedings of the 2006 International Conference on Information & Knowledge Engineering (IKE 2006), CSREA Press, Las Vegas, Nevada, USA, pp. 149-154, 2006.
  7. R. Bheemavaram, J. Zhang, W. N. Li, "A Parallel and Distributed Approach for Finding Transitive Closures of Data Records: A Proposal", Proceedings of the Acxiom Laboratory for Applied Research (ALAR), pp. 71-81, 2006.
  8. W. N. Li, R. Bheemavaram, X. Zhang, "Transitive Closure of Data Records: Application and Computation", Data Engineering - International Series in Operations Research & Management Science, Springer US, vol. 132, pp. 39-75, 2010.
  9. Ballou, D. (1999) "Enhancing data quality in Data Warehousing Environment," Comm. ACM (42:1), pp. 73-78.
  10. M. Bilenko and R. J. Mooney. "Adaptive duplicate detection using learnable string similarity measures" ACM SIGKDD, 2003, pp 39-48
  11. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. "Duplicate Record Detection": A Survey. IEEE TKDE, 19(1), 2007, pp 1-16
  12. S. Reddy, A. Lavanya, V. Khanna, L. S. S. Reddy, "Research Issues onData Warehouse Maintenance", IEEE, ICACC '09. InternationalConference Advanced Computer Control, Singapore, Jan 2009, Page(s): 623 – 627
Index Terms

Computer Science
Information Sciences

Keywords

Data Cleaning Pnrs Improved Pnrs Enhance Technique Transitive.