CFP last date
22 April 2024
Reseach Article

An Efficient Duplication Record Detection Algorithm for Data Cleansing

by Arfa Skandar, Mariam Rehman, Maria Anjum
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 127 - Number 6
Year of Publication: 2015
Authors: Arfa Skandar, Mariam Rehman, Maria Anjum
10.5120/ijca2015906401

Arfa Skandar, Mariam Rehman, Maria Anjum . An Efficient Duplication Record Detection Algorithm for Data Cleansing. International Journal of Computer Applications. 127, 6 ( October 2015), 28-37. DOI=10.5120/ijca2015906401

@article{ 10.5120/ijca2015906401,
author = { Arfa Skandar, Mariam Rehman, Maria Anjum },
title = { An Efficient Duplication Record Detection Algorithm for Data Cleansing },
journal = { International Journal of Computer Applications },
issue_date = { October 2015 },
volume = { 127 },
number = { 6 },
month = { October },
year = { 2015 },
issn = { 0975-8887 },
pages = { 28-37 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume127/number6/22736-2015906401/ },
doi = { 10.5120/ijca2015906401 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:19:12.170107+05:30
%A Arfa Skandar
%A Mariam Rehman
%A Maria Anjum
%T An Efficient Duplication Record Detection Algorithm for Data Cleansing
%J International Journal of Computer Applications
%@ 0975-8887
%V 127
%N 6
%P 28-37
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The purpose of this research was to review, analyze and compare algorithms lying under the empirical technique in order to suggest the most effective algorithm in terms of efficiency and accuracy. The research process was initiated by collecting the relevant research papers with the query of “duplication record detection” from IEEE database. After that, papers were categorized on the basis of different techniques proposed in the literature. In this research, the focus was made on empirical technique. The papers lying under this technique were further analyzed in order to come up with the algorithms. Finally, the comparison was performed in order to come up with the best algorithm i.e. DCS++. The selected algorithm was critically analyzed in order to improve its working. On the basis of limitations of selected algorithm, variation in algorithm was proposed and validated by developed prototype. After implementation of existing DCS++ and its proposed variation, it was found that the proposed variation in DCS++ producing better results in term of efficiency and accuracy. The algorithms lying under the empirical technique of duplicate records deduction were focused. The research material was gathered from the single digital library i.e. IEEE. A restaurant dataset was selected and the results were evaluated on the specified dataset which can be considered as a limitation of the research. The existing algorithm i.e. DCS++ and proposed variation in DCS++ were implemented in C#. As a result, it was concluded that proposed algorithm is performing outstanding than the existing algorithm.

References
  1. Ahmed K. Elmagarmid, P., G. Ipeirotis, and Vassilios S. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. on Knowl. and Data Eng., vol. 19, pp. 1-16, 2007.
  2. P. Ying, X. Jungang, C. Zhiwang, and S. Jian, "IKMC: An Improved K-Medoids Clustering Method for Near-Duplicated Records Detection," in Computational Intelligence and Software Engineering, 2009. CiSE 2009. International Conference on, Wuhan, 2009, pp. 1 - 4.
  3. M. Rehman and V. Esichaikul, "DUPLICATE RECORD DETECTION FOR DATABASE CLEANSING," in Machine Vision, 2009. ICMV '09. Second International Conference on , Dubai, 2009 , pp. 333 - 338.
  4. X. Mansheng, L. Yoush, and Z. Xiaoqi, "A PROPERTY OPTIMIZATION METHOD in SUPPORT of APPROXIMATELY DUPLICATED RECORDS DETECTING," in Intelligent Computing and Intelligent Systems, 2009. ICIS 2009. IEEE International Conference on, 2009.
  5. Q. Hua, M. Xiang, and F. Sun, "An Optimal Feature Selection Method for Approximately Duplicate Records," in Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference on, Chengdu, 2010.
  6. D. Bhalodiya, M., K. Patel, and C. Patel, "An Efficient way to Find Frequent Pattern with," in Nirma University International Conference on Engineering, 2013.
  7. L. Huang, P. Yuan, and F. Chu, "Duplicate Records Cleansing with Length Filtering and Dynamic Weighting," in Semantics, Knowledge and Grid, 2008. SKG '08. Fourth International Conference on, Beijing, 2008, pp. 95 - 102.
  8. M. Gollapalli, X. Li, I. Wood, and G. Governatori, "Approximate Record Matching Using Hash Grams," in 11th IEEE International Conference on Data Mining Workshops, 2011.
  9. Z. Wei, W. Feng, and L. Peipei, "Research on Cleaning Inaccurate Data in Production," in Service Systems and Service Management (ICSSSM), 2012 9th International Conference on, Shanghai, 2012.
  10. L. Zhe and Z. Zhi-gang, "An Algorithm of Detection Duplicate Information Based on Segment," in International Conference on Computational Aspects of Social Networks, 2010.
  11. H., H. Shahri and Z., A., A. Barforush, "Data Mining for Removing Fuzzy Duplicates Using Fuzzy Inference," in Processing NAFIPS '04. IEEE Annual Meeting of the (Volume:1 ), 2004.
  12. W. Su, J. Wang, and H., F. Lochovsky, "Record Matching over Query Results from Multiple Web Databases," in IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2010.
  13. R. Naseem, S. Anees, M., and S. Farook, "Near Duplicate Web Page Detection With Analytic Feature Weighting," in Third International Conference on Advances in Computing and Communications, 2013.
  14. L., Wan Zhao and Wah, C. N., "Scale-Rotation Invariant Pattern Entropy for Keypoint-Based Near-Duplicate Detection," in IEEE TRANSACTIONS ON IMAGE PROCESSING, 2009.
  15. G. Beskales, A., M. Soliman, F., I. Ilyas, S.i Ben-David, and Y. Kim, "ProbClean: A Probabilistic Duplicate Detection," in Data Engineering (ICDE), 2010 IEEE 26th International Conference on, 2010.
  16. J. Kim and H. Lee, "Efficient Exact Similarity Searches using Multiple," in IEEE 28th International Conference on Data Engineering, 2012.
  17. M. Ektefa, F. Sidi, H. Ibrahim, and M.,A. Jabar, "A Threshold-based Similarity Measure for Duplicate Detection," in Open Systems (ICOS), 2011 IEEE Conference on, Langkawi, 2011, pp. 37 - 41.
  18. M. Herschel, F. Naumann, S. Szott, and M. Taubert, "Scalable Iterative Graph Duplicate Detection," in IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012.
  19. Q. Kan, Y. Yang, S. Zhen, and W. Liu, "A Unified Record Linkage Strategy for Web Service," in Third International Conference on Knowledge Discovery and Data Mining, 2010.
  20. U. Draisbach and F. Naumann, "A Generalization of Blocking and Windowing Algorithms for Duplicate Detection," in Data and Knowledge Engineering (ICDKE), 2011 International Conference on , Milan, 2011, pp. 18 - 24.
  21. A. Bilke and F. Naumann, "Schema Matching using Duplicates," in Proceedings of the 21st International Conference on Data Engineering, 2005.
  22. Q. kan, Yan, Y. g, W. Liu, and X. Liu, "An Integrated Approach for Detecting Approximate Duplicate Records," in Second Asia-Pacific Conference on Computational Intelligence and Industrial Applications, 2009.
  23. U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg, "Adaptive Windows for Duplicate Detection," in 28th International Conference on Data Engineering, 2012.
Index Terms

Computer Science
Information Sciences

Keywords

Duplication Records Detection Algorithm DCS++ Windowing Blocking