Efhcient Duplicate Detection and Elimination in Hierarchical Multimedia Data

Manjusha R. Pawar; J. V. Shinde

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework

Ziqiao Ao Juhi Singh Sebastian Antinome

Random Articles

Reseach Article

Efhcient Duplicate Detection and Elimination in Hierarchical Multimedia Data

by Manjusha R. Pawar, J. V. Shinde

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 122 - Number 12

Year of Publication: 2015

Authors: Manjusha R. Pawar, J. V. Shinde

10.5120/21751-5018

Manjusha R. Pawar, J. V. Shinde . Efhcient Duplicate Detection and Elimination in Hierarchical Multimedia Data. International Journal of Computer Applications. 122, 12 ( July 2015), 15-21. DOI=10.5120/21751-5018

@article{ 10.5120/21751-5018,

author = { Manjusha R. Pawar, J. V. Shinde },

title = { Efhcient Duplicate Detection and Elimination in Hierarchical Multimedia Data },

journal = { International Journal of Computer Applications },

issue_date = { July 2015 },

volume = { 122 },

number = { 12 },

month = { July },

year = { 2015 },

issn = { 0975-8887 },

pages = { 15-21 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume122/number12/21751-5018/ },

doi = { 10.5120/21751-5018 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T23:10:21.750678+05:30

%A Manjusha R. Pawar

%A J. V. Shinde

%T Efhcient Duplicate Detection and Elimination in Hierarchical Multimedia Data

%J International Journal of Computer Applications

%@ 0975-8887

%V 122

%N 12

%P 15-21

%D 2015

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Today's important task is to clean data in data warehouses which has complex hierarchical structure. This is possibly done by detecting duplicates in large databases to increase the efficiency of data mining and to make it effective. Recently new algorithms are proposed that consider relations in a single table; hence by comparing records pairwise they can easily find out duplications. But now a day the data is being stored in more complex and semi-structured or hierarchical structure and the problem arose is how to detect duplicates on XML data. Also due to differences between various data models, the algorithms which are for single relations cannot be applied on XML data. The objective of this project is to detect duplicates in hierarchical data which contain textual data and multimedia data like images, audio and video. It also focuses on eliminating the duplicates by using elimination technique such as delete. Here Bayesian network is used with modified pruning algorithm for duplicate detection, and experiments are performed on both artificial and real world datasets. The new XMLMultiDup method is able to perform duplicate detection with high efficiency and effectiveness on multimedia datasets. This method compares each level of XML tree from root to the leaves computing probabilities of similarity by assigning weights. It goes through the comparison of structure, each descendant of both datasets and find duplicates despite difference in data.

References

Luis Leitao, Pavel Calado and Melanie Herschel, "Efficient and Effective Duplicate Detection in Hierarchical Data," IEEE Trans. on Knowledge and Data Engineering, Vol. 25, No. 5, May 2013.
E. Rahm and H. H. Do, "Data Cleaning: Problems and Current Approaches," IEEE Data Eng. Bull. , vol. 23, no. 4, pp. 3-13, Dec. 2000.
Joe Tekli, Richard Chbeir, Kokou Yetongnon "An overview on XML similarity: Background, current trends and future directions",Computer Science Review, Volume 3, Issue 3, August 2009, Pages 151173
S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, "Approximate XML Joins," Proc. ACM SIGMOD Conf. Management of Data, 2002.
M. A. Hernandez and S. J. Stolfo, "The Merge/Purge Problem for Large Databases," Proc. ACM SIGMOD Conf. Management of Data, pp. 127-138, 1995.
K. -H. Lee, Y. -C. Choy, and S. -B. Cho, "An efficient algorithm to compute differences between structured documents," IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 16, no. 8, pp. 965979, Aug. 2004.
L. Leitao and P. Calado, "Duplicate Detection through Structure Optimization," Proc. 20th ACM Intl Conf. Information and Knowledge Management, pp. 443-452, 2011.
R. Ananthakrishna, S. Chaudhuri, and V. Ganti, "Eliminating Fuzzy Duplicates in Data Warehouses," Proc. Conf. Very Large Databases (VLDB), pp. 586-597, 2002.
J. C. P. Carvalho and A. S. da Silva, "Finding Similar Identities among Objects from Multiple Web Sources," Proc. CIKM Workshop Web Information and Data Management (WIDM), pp. 90-93, 2003.
M. Weis and F. Naumann, "Dogmatix Tracks Down Duplicates in XML," Proc. ACM SIGMOD Conf. Management of Data, pp. 431-442, 2005.
D. Milano, M. Scannapieco, and T. Catarci, "Structure Aware XML Object Identification," Proc. VLDB Workshop Clean Databases (CleanDB), 2006.
L. Leitao, P. Calado, and M. Weis, "Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection," Proc. 16th ACM Intl Conf. Information and Knowledge Management, pp. 293-302, 2007.
F. Naumann and M. Herschel, "An Introduction to Duplicate Detection. Morgan and Claypool, 2010.
A. M. Kade and C. A. Heuser, "Matching XML Documents in Eng. Highly Dynamic Applications," Proc. ACM Symp. Document Eng.
M. Weis and F. Naumann. Duplicate detection in xml. In SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 10–19, Paris, France, 2004.
http://www. hpi. uni-potsdam. de/naumann/projekte/repeatability/.
http://www. cs. utexas. edu/users/ml/riddle/data. html.
https://www. hpi. uni-potsdam. de/fileadmin/hpi/FG_Naumann/
http://www. researchgate. net/publication/225867479_An_Overview_of_XML_Duplicate_Detection_Algorithms
http://se-pubs. dbs. uni-leipzig. de/files/Weis2006ADuplicateDetectionBenchmark. pdf
http://www. morganclaypool. com/doi/abs/10. 2200/S00262ED1V01Y201003DTM003
http://citeseerx. ist. psu. edu/viewdoc/download?doi=10. 1. 1. 70. 8263&rep=rep1&type=pdf#page=14
http://www2. cs. uni-paderborn. de/cs/ag-boettcher/lehre/SS05/sem-ss05/SIGMOD05

Index Terms

Computer Science

Information Sciences

Keywords

XML Data Bayesian network Pruning.