CFP last date
20 June 2024
Call for Paper
July Edition
IJCA solicits high quality original research papers for the upcoming July edition of the journal. The last date of research paper submission is 20 June 2024

Submit your paper
Know more
Reseach Article

A Survey on Data Deduplication in Large Scale Data

by Saniya Sudhakaran, Meera Treesa Mathews
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 165 - Number 1
Year of Publication: 2017
Authors: Saniya Sudhakaran, Meera Treesa Mathews
10.5120/ijca2017913696

Saniya Sudhakaran, Meera Treesa Mathews . A Survey on Data Deduplication in Large Scale Data. International Journal of Computer Applications. 165, 1 ( May 2017), 1-4. DOI=10.5120/ijca2017913696

@article{ 10.5120/ijca2017913696,
author = { Saniya Sudhakaran, Meera Treesa Mathews },
title = { A Survey on Data Deduplication in Large Scale Data },
journal = { International Journal of Computer Applications },
issue_date = { May 2017 },
volume = { 165 },
number = { 1 },
month = { May },
year = { 2017 },
issn = { 0975-8887 },
pages = { 1-4 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume165/number1/27534-2017913696/ },
doi = { 10.5120/ijca2017913696 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T00:11:10.357007+05:30
%A Saniya Sudhakaran
%A Meera Treesa Mathews
%T A Survey on Data Deduplication in Large Scale Data
%J International Journal of Computer Applications
%@ 0975-8887
%V 165
%N 1
%P 1-4
%D 2017
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper presents a survey on data deduplication on large scale data. deduplication is nothing but finding the duplicate records or duplicate data when compared with one or more data base or data sets.The data deduplication task has attracted a considerable amount of attention from the research community in order to provide effective and efficient solutions. Matching records from several data bases is known as record linkage. Those matched data contains important and useable information. These information is too costly to acquire because of which data deduplication process getting more attention day by day. Removing duplicate records during data cleaning process in a single database is a critical step, because the outcomes of subsequent data processing or data mining may get greatly influenced by duplicates.As database size increases day by day the matching process’s complexity becoming one of the major challenges for data deduplication. To overcome this problem we propose a Two Stage Sampling Selection (T3S) model which has two stages, in which, the strategy is proposed to produce balanced subsets candidate pairs which are to be labeled is done in the first stage and in the second stage we produced a smaller and more informative training sets than in the first stage.An active selection is incrementally invoked for removing the redundant pairs which are created in the first stage. This training set can be effectively used for identifying where the most ambiguous pairs lie and to configure the classification approaches. when compared with state-of-the-art deduplication methods in large datasets Our evaluation shows that T3S is able to reduce the labeling effort substantially while achieving a competitive or superior matching quality.

References
  1. P. Christen, A survey of indexing techniques for scalable record linkage and deduplication,” IEEE Transactions on knowlwdge and data engineering, 24, (2012)1537-1555.
  2. A. Elmagarmid, P. Ipeirotis, and V. Verykios, Duplicate record detection: A survey, IEEE Transactions on knowlwdge and data engineering, 19, (2007)1-16.
  3. R. J. Bayardo, Y. Ma, and R. Srikant, Scaling up all pairs similarity search, proceedings of 16th international conference in world wide web(2007)131-140
  4. S. Chaudhuri, V. Ganti, and R. Kaushik, A primitive operator for similarity joins in data cleaning,proceedings in 22nd international conference in data engineering,(2006)p.5.
  5. J. Wang, G. Li, and J. Fe, Fast-join: An efficient method for fuzzy token matching based string similarity join, proceedings in IEEE 27th international conference in data engineering(2011 )458-469 .
  6. C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang, Efficient similarity joins for near-duplicate detection, ACM transactions in database systems,36,(2011)15:1-15:41
  7. K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi, Active sampling for entity matching, proceedings in 18th ACM SIGKDD international conference in knowledge discovery in data mining,(2012)1131-1139.
  8. S. Sarawagi and A. Bhamidipaty, Interactive deduplication using active learning, proceedings in 8th ACM SIGKDD international conference in knowledge discovery data mining( 2002)269-278.
  9. P. Christen, Automatic record linkage using seeded nearest neighbour and support vector machine classification, proceedings in 14th ACM SIGKDD international conference in knowledge discovery data mining(2008)151-159.
  10. A. Arasu, M. Gotz, and R. Kaushik, On active learning of record matching packages, proceedings in ACM SIGMOD international conference in manage data(2010)783-794.
  11. D. Cohn, L. Atlas, and R. Ladner, Improving generalization with active learning, machine learning15,(1994)201-221.
  12. G. Dal Bianco, R. Galante, C. A. Heuser, and M. A. Gonalves, Tuning large scale deduplication with reduced effort, proceedings in international conference in scientific statist on database manage(2013)1-12.
Index Terms

Computer Science
Information Sciences

Keywords

Dedupliction T3S FS-Dedup