A Survey on Data Deduplication in Large Scale Data

Saniya Sudhakaran; Meera Treesa Mathews

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework

Ziqiao Ao Juhi Singh Sebastian Antinome

Random Articles

Reseach Article

A Survey on Data Deduplication in Large Scale Data

by Saniya Sudhakaran, Meera Treesa Mathews

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 165 - Number 1

Year of Publication: 2017

Authors: Saniya Sudhakaran, Meera Treesa Mathews

10.5120/ijca2017913696

Saniya Sudhakaran, Meera Treesa Mathews . A Survey on Data Deduplication in Large Scale Data. International Journal of Computer Applications. 165, 1 ( May 2017), 1-4. DOI=10.5120/ijca2017913696

@article{ 10.5120/ijca2017913696,

author = { Saniya Sudhakaran, Meera Treesa Mathews },

title = { A Survey on Data Deduplication in Large Scale Data },

journal = { International Journal of Computer Applications },

issue_date = { May 2017 },

volume = { 165 },

number = { 1 },

month = { May },

year = { 2017 },

issn = { 0975-8887 },

pages = { 1-4 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume165/number1/27534-2017913696/ },

doi = { 10.5120/ijca2017913696 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-07T00:11:10.357007+05:30

%A Saniya Sudhakaran

%A Meera Treesa Mathews

%T A Survey on Data Deduplication in Large Scale Data

%J International Journal of Computer Applications

%@ 0975-8887

%V 165

%N 1

%P 1-4

%D 2017

%I Foundation of Computer Science (FCS), NY, USA

Abstract

This paper presents a survey on data deduplication on large scale data. deduplication is nothing but finding the duplicate records or duplicate data when compared with one or more data base or data sets.The data deduplication task has attracted a considerable amount of attention from the research community in order to provide effective and efficient solutions. Matching records from several data bases is known as record linkage. Those matched data contains important and useable information. These information is too costly to acquire because of which data deduplication process getting more attention day by day. Removing duplicate records during data cleaning process in a single database is a critical step, because the outcomes of subsequent data processing or data mining may get greatly influenced by duplicates.As database size increases day by day the matching process’s complexity becoming one of the major challenges for data deduplication. To overcome this problem we propose a Two Stage Sampling Selection (T3S) model which has two stages, in which, the strategy is proposed to produce balanced subsets candidate pairs which are to be labeled is done in the first stage and in the second stage we produced a smaller and more informative training sets than in the first stage.An active selection is incrementally invoked for removing the redundant pairs which are created in the first stage. This training set can be effectively used for identifying where the most ambiguous pairs lie and to configure the classification approaches. when compared with state-of-the-art deduplication methods in large datasets Our evaluation shows that T3S is able to reduce the labeling effort substantially while achieving a competitive or superior matching quality.

References

P. Christen, A survey of indexing techniques for scalable record linkage and deduplication,” IEEE Transactions on knowlwdge and data engineering, 24, (2012)1537-1555.
A. Elmagarmid, P. Ipeirotis, and V. Verykios, Duplicate record detection: A survey, IEEE Transactions on knowlwdge and data engineering, 19, (2007)1-16.
R. J. Bayardo, Y. Ma, and R. Srikant, Scaling up all pairs similarity search, proceedings of 16th international conference in world wide web(2007)131-140
S. Chaudhuri, V. Ganti, and R. Kaushik, A primitive operator for similarity joins in data cleaning,proceedings in 22nd international conference in data engineering,(2006)p.5.
J. Wang, G. Li, and J. Fe, Fast-join: An efficient method for fuzzy token matching based string similarity join, proceedings in IEEE 27th international conference in data engineering(2011 )458-469 .
C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang, Efficient similarity joins for near-duplicate detection, ACM transactions in database systems,36,(2011)15:1-15:41
K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi, Active sampling for entity matching, proceedings in 18th ACM SIGKDD international conference in knowledge discovery in data mining,(2012)1131-1139.
S. Sarawagi and A. Bhamidipaty, Interactive deduplication using active learning, proceedings in 8th ACM SIGKDD international conference in knowledge discovery data mining( 2002)269-278.
P. Christen, Automatic record linkage using seeded nearest neighbour and support vector machine classification, proceedings in 14th ACM SIGKDD international conference in knowledge discovery data mining(2008)151-159.
A. Arasu, M. Gotz, and R. Kaushik, On active learning of record matching packages, proceedings in ACM SIGMOD international conference in manage data(2010)783-794.
D. Cohn, L. Atlas, and R. Ladner, Improving generalization with active learning, machine learning15,(1994)201-221.
G. Dal Bianco, R. Galante, C. A. Heuser, and M. A. Gonalves, Tuning large scale deduplication with reduced effort, proceedings in international conference in scientific statist on database manage(2013)1-12.

Index Terms

Computer Science

Information Sciences

Keywords

Dedupliction T3S FS-Dedup