Validation of Deduplication in Data using Similarity Measure

Varsha Wandhekar; Arti Mohanpurkar

Call for Paper

March Edition

IJCA solicits high quality original research papers for the upcoming March edition of the journal. The last date of research paper submission is 20 February 2026

Submit your paper

Know more

The week's pick

A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

Jundi Yang Heng Yao

Random Articles

Reseach Article

Validation of Deduplication in Data using Similarity Measure

by Varsha Wandhekar, Arti Mohanpurkar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 116 - Number 21

Year of Publication: 2015

Authors: Varsha Wandhekar, Arti Mohanpurkar

10.5120/20460-2819

Varsha Wandhekar, Arti Mohanpurkar . Validation of Deduplication in Data using Similarity Measure. International Journal of Computer Applications. 116, 21 ( April 2015), 18-22. DOI=10.5120/20460-2819

@article{ 10.5120/20460-2819,

author = { Varsha Wandhekar, Arti Mohanpurkar },

title = { Validation of Deduplication in Data using Similarity Measure },

journal = { International Journal of Computer Applications },

issue_date = { April 2015 },

volume = { 116 },

number = { 21 },

month = { April },

year = { 2015 },

issn = { 0975-8887 },

pages = { 18-22 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume116/number21/20460-2819/ },

doi = { 10.5120/20460-2819 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T22:57:45.875635+05:30

%A Varsha Wandhekar

%A Arti Mohanpurkar

%T Validation of Deduplication in Data using Similarity Measure

%J International Journal of Computer Applications

%@ 0975-8887

%V 116

%N 21

%P 18-22

%D 2015

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Deduplication is the process of determining all categories of information within a data set that signify the same real life / world entity. The data gathered from various resources may have data high quality issues in it. The concept to identify duplicates by using windowing and blocking strategy. The objective is to achieve better precision, good efficiency and also to reduce the false positive rate all are in accordance with the estimated similarities of records. Various Similarity metrics are commonly used to recognize the similar field entries. So the main focus of this paper is to applying appropriate similarity measure on appropriate data to properly identifying the duplicates.

References

M. Rehman, V. Esichaikul, "Duplicate Record Detection For Database Cleansing", Second International Conference on Machine Vision, 2009.
E. Rahm and H. Hai Do, "Data Cleaning: Problems and Current Approaches", IEEE Computer Society Technical Committee on Data Engineering, 2000, pp:3-13.
L. Gu and R. Baxter, "Adaptive filtering for efficient record linkage," in Proceedings of the SIAM International Conference on Data Mining, 2004, pp. 477–481
A. Elmagarmid, P. Ipeirotis, and V. Verykios, "Duplicate record detection: A survey", IEEE Transactions on Know ledge and Data Engineering (TKDE), 2007, pp:1-16.
S. Yan, D. Lee, M. Kan, C. Lee Giles, "Adaptive Sorted Neighborhood Methods for Efficient Record Linkage", ACM,JCDL, June 2007, pp:17-22.
L. Leitao, P. Calado, and M. Herschel, "Efficient and Effective Duplicate Detection in Hierarchical Data", IEEE Transactions On Knowledge And Data Engineering, Vol. 25, No. 5, May 2013
N. Koudas, S. Sarawagi, D. Srivastava,"Record Linkage: Similarity Measures and Algorithms", ACM, SIGMOD 2006, pp:802-804.
V. Wandhekar , A. Mohanpurkar, "A Review on Efficient and Effective Duplicate Detection in Data", International Journal for Research in Applied Science and Engineering Technology (IJRASET), ISSN: 2321-9653, Volume 2 Issue XI, November 2014,pp: 103-107.
U. Draisbach, F. Naumann, "A Generalization of Blocking and Windowing Algorithms for Duplicate Detection", IEEE, 2011, pp: 18-24.
M. Bilenko, B. Kamath, R. Mooney, "Adaptive Blocking: Learning to Scale Up Record Linkage", In Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM-06), Hong Kong, December 2006, pp. 87-96.
U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg, "Adaptive windows for duplicate detection," ACM SIGKDD interntional conference on Knowledge discover and data mining, NY, USA, 2011
K. Prasad, S. Chaturvedi, T. Faruquie, L. Subramaniam, "Automated Selection of Blocking Columns for Record Linkage", IEEE,2012.
J. Nin, V. Mulero, N. Bazan, Josep-L. L. Pey, "On the Use of Semantic Blocking Techniques for Data Cleansing and Integration",11th International Database Engineering and Applications Symposium, 2007.
U. Draisbach and F. Naumann, "A comparison and generalization of blocking and windowing algorithms for duplicate detection," in Proceedings of the International Workshop on Quality in Databases (QDB),2009.
R. Baxter and P Christen. , "A comparison of fast blocking methods for record linkage," In In ACM SIGKDD workshop on Data Cleansing, Record Linkage and Object Consolidation, pages 25-27, Washington DC, 2003.
D. Bharambe, S. Jain, A. Jain, "A Survey: Detection of Duplicate Record", International Journal of Emerging Technology and Advanced Engineering, Volume 2, Issue 11, November 2012.
W. Winkler, "Overview of Record Linkage and Current Research Directions", Statistical Research Report, February 8, 2006
V. Raisinghani, S. Sarawagi, " Cleaning Methods in Data Warehouse", School of Information Technology, IIT Bombay, 1999.

Index Terms

Computer Science

Information Sciences

Keywords

Deduplication Similarity Measure Sorted Neighborhood Method(SNM) Windowing Blocking.