Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm

Call for Paper

June Edition

IJCA solicits high quality original research papers for the upcoming June edition of the journal. The last date of research paper submission is 20 May 2024

Submit your paper

Know more

The week's pick

Enhancing Privacy Preservation: Multi-Attribute Protection with P-Sensitive K-Anonymity

Twinkle Patel Kiran Amin

Random Articles

An Efficient Hybrid Parallel Prefix Adders for Reverse Converters using QCA Technology

Nov

2016

Computerized Preventive Maintenance Management System (CPMMS) for Haematology Department Equipments

January

2015

Security Enhancement in Cloud Storage using ARIA and Elgamal Algorithms

Aug

2017

EARRA: Enhanced Adaptive Rate Response Adjustment Technique for Congestion Control in Networks

Jun

2017

Reseach Article

Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm

Published on December 2014 by Amita Fulsundar

Innovations and Trends in Computer and Communication Engineering

Foundation of Computer Science USA

ITCCE - Number 1

December 2014

Authors: Amita Fulsundar

c832fa94-03bc-4e18-9c56-e6d36ec0fb85

Amita Fulsundar . Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm. Innovations and Trends in Computer and Communication Engineering. ITCCE, 1 (December 2014), 1-4.

@article{

author = { Amita Fulsundar },

title = { Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm },

journal = { Innovations and Trends in Computer and Communication Engineering },

issue_date = { December 2014 },

volume = { ITCCE },

number = { 1 },

month = { December },

year = { 2014 },

issn = 0975-8887,

pages = { 1-4 },

numpages = 4,

url = { /proceedings/itcce/number1/19037-2001/ },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Proceeding Article

%1 Innovations and Trends in Computer and Communication Engineering

%A Amita Fulsundar

%T Review on Duplicate Detection in Hierarchical Data Using Network Pruning Algorithm

%J Innovations and Trends in Computer and Communication Engineering

%@ 0975-8887

%V ITCCE

%N 1

%P 1-4

%D 2014

%I International Journal of Computer Applications

Abstract

The goal of the data mining process is to extract information from various data sources. Different sources can provide documents that contain data with different structure may be considered as representing the same conceptual information. Solution to this is duplication detection. Duplicate detection is detection of same real world entity in the data sources. Duplicate detection is a necessary task in data cleansing. Various algorithms are proposed for detection of duplicates in relational data, but very few solutions are focused on hierarchical data like XML. Duplicate Detection exactly identifies whether the data is duplicated or not. A peculiar method XMLDup is introduced for duplicate detection in XML data. XMLDup uses Bayesian network to evaluate probability of two XML elements being duplicates. It considers not only the content within the elements but also the way that content is structured. To improve the run time efficiency of network evaluation, a lossless pruning strategy is used. The algorithm achieves high accuracy and recall score in several data sets. The XMLDup perform state-of-the-art in duplicate detection in terms of both effectiveness and efficiency.

References

E. Rahm and H. H. Do, "Data cleaning: Problems and current approaches," IEEE Data Engineering Bulletin, vol. 23, pp. 3–13, 2000.
S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu,"Approximate XML joins," in Conference on the Management of Data(SIGMOD), 2002.
R. Ananthakrishna, S. Chaudhuri, and V. Ganti, "Eliminatingfuzzy duplicates in data warehouses," in Conference on Very LargeDatabases (VLDB), Hong Kong, China, 2002, pp. 586–597.
D. Milano, M. Scannapieco, and T. Catarci, "Structure awareXML object identification," in VLDB Workshop on Clean Databases(CleanDB), Seoul, Korea, 2006.
M. Weis and F. Naumann, "Dogmatix tracks down duplicatesin XML," in Conference on the Management of Data (SIGMOD),Baltimore, MD, 2005, pp. 431–442.
J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks ofplausible inference, 2nd ed. Morgan Kaufmann Publishers, 1988.
L. Leita o, P. Calado, and M. Weis, "Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection", Proc. 16th ACM Int'l Conf. Information and Knowledge Management,pp. 293-302, 2007.
A. M. Kade and C. A. Heuser, "Matching XML documents inhighly dynamic applications," in ACM Symposium on DocumentEngineering (DocEng), 2008, pp. 191–198.

Index Terms

Computer Science

Information Sciences

Keywords

Duplicate Detection Xml Bayesian Networks Data Cleaning And Optimization.