CFP last date
22 April 2024
Reseach Article

Validation of Deduplication in Data using Similarity Measure

by Varsha Wandhekar, Arti Mohanpurkar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 116 - Number 21
Year of Publication: 2015
Authors: Varsha Wandhekar, Arti Mohanpurkar
10.5120/20460-2819

Varsha Wandhekar, Arti Mohanpurkar . Validation of Deduplication in Data using Similarity Measure. International Journal of Computer Applications. 116, 21 ( April 2015), 18-22. DOI=10.5120/20460-2819

@article{ 10.5120/20460-2819,
author = { Varsha Wandhekar, Arti Mohanpurkar },
title = { Validation of Deduplication in Data using Similarity Measure },
journal = { International Journal of Computer Applications },
issue_date = { April 2015 },
volume = { 116 },
number = { 21 },
month = { April },
year = { 2015 },
issn = { 0975-8887 },
pages = { 18-22 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume116/number21/20460-2819/ },
doi = { 10.5120/20460-2819 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:57:45.875635+05:30
%A Varsha Wandhekar
%A Arti Mohanpurkar
%T Validation of Deduplication in Data using Similarity Measure
%J International Journal of Computer Applications
%@ 0975-8887
%V 116
%N 21
%P 18-22
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Deduplication is the process of determining all categories of information within a data set that signify the same real life / world entity. The data gathered from various resources may have data high quality issues in it. The concept to identify duplicates by using windowing and blocking strategy. The objective is to achieve better precision, good efficiency and also to reduce the false positive rate all are in accordance with the estimated similarities of records. Various Similarity metrics are commonly used to recognize the similar field entries. So the main focus of this paper is to applying appropriate similarity measure on appropriate data to properly identifying the duplicates.

References
  1. M. Rehman, V. Esichaikul, "Duplicate Record Detection For Database Cleansing", Second International Conference on Machine Vision, 2009.
  2. E. Rahm and H. Hai Do, "Data Cleaning: Problems and Current Approaches", IEEE Computer Society Technical Committee on Data Engineering, 2000, pp:3-13.
  3. L. Gu and R. Baxter, "Adaptive filtering for efficient record linkage," in Proceedings of the SIAM International Conference on Data Mining, 2004, pp. 477–481
  4. A. Elmagarmid, P. Ipeirotis, and V. Verykios, "Duplicate record detection: A survey", IEEE Transactions on Know ledge and Data Engineering (TKDE), 2007, pp:1-16.
  5. S. Yan, D. Lee, M. Kan, C. Lee Giles, "Adaptive Sorted Neighborhood Methods for Efficient Record Linkage", ACM,JCDL, June 2007, pp:17-22.
  6. L. Leitao, P. Calado, and M. Herschel, "Efficient and Effective Duplicate Detection in Hierarchical Data", IEEE Transactions On Knowledge And Data Engineering, Vol. 25, No. 5, May 2013
  7. N. Koudas, S. Sarawagi, D. Srivastava,"Record Linkage: Similarity Measures and Algorithms", ACM, SIGMOD 2006, pp:802-804.
  8. V. Wandhekar , A. Mohanpurkar, "A Review on Efficient and Effective Duplicate Detection in Data", International Journal for Research in Applied Science and Engineering Technology (IJRASET), ISSN: 2321-9653, Volume 2 Issue XI, November 2014,pp: 103-107.
  9. U. Draisbach, F. Naumann, "A Generalization of Blocking and Windowing Algorithms for Duplicate Detection", IEEE, 2011, pp: 18-24.
  10. M. Bilenko, B. Kamath, R. Mooney, "Adaptive Blocking: Learning to Scale Up Record Linkage", In Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM-06), Hong Kong, December 2006, pp. 87-96.
  11. U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg, "Adaptive windows for duplicate detection," ACM SIGKDD interntional conference on Knowledge discover and data mining, NY, USA, 2011
  12. K. Prasad, S. Chaturvedi, T. Faruquie, L. Subramaniam, "Automated Selection of Blocking Columns for Record Linkage", IEEE,2012.
  13. J. Nin, V. Mulero, N. Bazan, Josep-L. L. Pey, "On the Use of Semantic Blocking Techniques for Data Cleansing and Integration",11th International Database Engineering and Applications Symposium, 2007.
  14. U. Draisbach and F. Naumann, "A comparison and generalization of blocking and windowing algorithms for duplicate detection," in Proceedings of the International Workshop on Quality in Databases (QDB),2009.
  15. R. Baxter and P Christen. , "A comparison of fast blocking methods for record linkage," In In ACM SIGKDD workshop on Data Cleansing, Record Linkage and Object Consolidation, pages 25-27, Washington DC, 2003.
  16. D. Bharambe, S. Jain, A. Jain, "A Survey: Detection of Duplicate Record", International Journal of Emerging Technology and Advanced Engineering, Volume 2, Issue 11, November 2012.
  17. W. Winkler, "Overview of Record Linkage and Current Research Directions", Statistical Research Report, February 8, 2006
  18. V. Raisinghani, S. Sarawagi, " Cleaning Methods in Data Warehouse", School of Information Technology, IIT Bombay, 1999.
Index Terms

Computer Science
Information Sciences

Keywords

Deduplication Similarity Measure Sorted Neighborhood Method(SNM) Windowing Blocking.