CFP last date
22 April 2024
Reseach Article

Proof of Duplication Detection in Data by Applying Similarity Strategies

Published on December 2015 by Varsha Wandhekar, Arti Mohanpurkar
National Conference on Advances in Computing
Foundation of Computer Science USA
NCAC2015 - Number 3
December 2015
Authors: Varsha Wandhekar, Arti Mohanpurkar
21327191-e8fd-4282-b51f-55dae6fb6a9e

Varsha Wandhekar, Arti Mohanpurkar . Proof of Duplication Detection in Data by Applying Similarity Strategies. National Conference on Advances in Computing. NCAC2015, 3 (December 2015), 14-19.

@article{
author = { Varsha Wandhekar, Arti Mohanpurkar },
title = { Proof of Duplication Detection in Data by Applying Similarity Strategies },
journal = { National Conference on Advances in Computing },
issue_date = { December 2015 },
volume = { NCAC2015 },
number = { 3 },
month = { December },
year = { 2015 },
issn = 0975-8887,
pages = { 14-19 },
numpages = 6,
url = { /proceedings/ncac2015/number3/23372-5034/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 National Conference on Advances in Computing
%A Varsha Wandhekar
%A Arti Mohanpurkar
%T Proof of Duplication Detection in Data by Applying Similarity Strategies
%J National Conference on Advances in Computing
%@ 0975-8887
%V NCAC2015
%N 3
%P 14-19
%D 2015
%I International Journal of Computer Applications
Abstract

Deduplication is the process of determining all categories of information within a data set that signify the same real life / world entity. The data gathered from various resources may have data high quality issues in it. The concept to identify duplicates by using windowing and blocking strategy. The objective is to achieve better precision, good efficiency and also to reduce the false positive rate all are in accordance with the estimated similarities of records. Various Similarity metrics are commonly used to recognize the similar field entries. So the main focus of this paper is to applying appropriate similarity measure on appropriate data to properly identifying the duplicates. De-duplication is a property which provides additional information of similarities between the two entities. Thus, in today's data centric environment there are huge numbers of defects in similarity measure. As a result to identify the duplicates is always been a challenging task. In this paper the primary focus is given on exact identification of duplicates in the database by applying concept of windowing & blocking. The objective is to achieve better precision, good efficiency and also to reduce the false positive rate all are in accordance with the estimated similarities of records.

References
  1. M. Rehman, V. Esichaikul, "Duplicate Record Detection For Database Cleansing", Second International Conference on Machine Vision, 2009.
  2. E. Rahm and H. Hai Do, "Data Cleaning: Problems and Current Approaches", IEEE Computer Society Technical Committee on Data Engineering, 2000, pp:3-13.
  3. L. Gu and R. Baxter, "Adaptive filtering for efficient record linkage," in Proceedings of the SIAM International Conference on Data Mining, 2004, pp. 477–481
  4. A. Elmagarmid, P. Ipeirotis, and V. Verykios, "Duplicate record detection: A survey", IEEE Transactions on Knowledge and Data Engineering (TKDE), 2007, pp:1-16.
  5. S. Yan, D. Lee, M. Kan, C. Lee Giles, "Adaptive Sorted Neighborhood Methods for Efficient Record Linkage", ACM,JCDL, June 2007, pp:17-22.
  6. L. Leitao, P. Calado, and M. Herschel, "Efficient and Effective Duplicate Detection in Hierarchical Data", IEEE Transactions On Knowledge And Data Engineering, Vol. 25, No. 5, May 2013
  7. N. Koudas, S. Sarawagi, D. Srivastava,"Record Linkage: Similarity Measures and Algorithms", ACM, SIGMOD 2006, pp:802-804.
  8. V. Wandhekar, A. Mohanpurkar, "A Review on Efficient and Effective Duplicate Detection in Data", International Journal for Research in Applied Science and Engineering Technology (IJRASET), ISSN: 2321-9653, Volume 2 Issue XI, November 2014,pp: 103-107.
  9. U. Draisbach, F. Naumann, "A Generalization of Blocking and Windowing Algorithms for Duplicate Detection", IEEE, 2011, pp: 18-24.
  10. M. Bilenko, B. Kamath, R. Mooney, "Adaptive Blocking: Learning to Scale Up Record Linkage", In Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM-06), Hong Kong, December 2006, pp. 87-96.
  11. U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg, "Adaptive windows for duplicate detection," ACM SIGKDD interntional conference on Knowledge discover and data mining, NY, USA, 2011
  12. K. Prasad, S. Chaturvedi, T. Faruquie, L. Subramaniam, "Automated Selection of Blocking Columns for Record Linkage", IEEE,2012.
  13. J. Nin, V. Mulero, N. Bazan, Josep-L. L. Pey, "On the Use of Semantic Blocking Techniques for Data Cleansing and Integration",11th International Database Engineering and Applications Symposium, 2007.
  14. U. Draisbach and F. Naumann, "A comparison and generalization of blocking and windowing algorithms for duplicate detection," in Proceedings of the International Workshop on Quality in Databases (QDB),2009.
  15. R. Baxter and P Christen. , "A comparison of fast blocking methods for record linkage," In In ACM SIGKDD workshop on Data Cleansing, Record Linkage and Object Consolidation, pages 25-27, Washington DC, 2003.
  16. D. Bharambe, S. Jain, A. Jain, "A Survey: Detection of Duplicate Record", International Journal of Emerging Technology and Advanced Engineering, Volume 2, Issue 11, November 2012.
  17. W. Winkler, "Overview of Record Linkage and Current Research Directions", Statistical Research Report, February 8, 2006
  18. V. Raisinghani, S. Sarawagi, " Cleaning Methods in Data Warehouse", School of Information Technology, IIT Bombay, 1999.
Index Terms

Computer Science
Information Sciences

Keywords

Deduplication Similarity Measure Sorted Neighborhood Method(snm) Windowing Blocking.