Call for Paper - January 2023 Edition
IJCA solicits original research papers for the January 2023 Edition. Last date of manuscript submission is December 20, 2022. Read More

Effcient Duplicate Detection and Elimination in Hierarchical Multimedia Data

International Journal of Computer Applications
© 2015 by IJCA Journal
Volume 122 - Number 12
Year of Publication: 2015
Manjusha R. Pawar
J. V. Shinde

Manjusha R Pawar and J V Shinde. Article: Efhcient Duplicate Detection and Elimination in Hierarchical Multimedia Data. International Journal of Computer Applications 122(12):15-21, July 2015. Full text available. BibTeX

	author = {Manjusha R. Pawar and J. V. Shinde},
	title = {Article: Efhcient Duplicate Detection and Elimination in Hierarchical Multimedia Data},
	journal = {International Journal of Computer Applications},
	year = {2015},
	volume = {122},
	number = {12},
	pages = {15-21},
	month = {July},
	note = {Full text available}


Today's important task is to clean data in data warehouses which has complex hierarchical structure. This is possibly done by detecting duplicates in large databases to increase the efficiency of data mining and to make it effective. Recently new algorithms are proposed that consider relations in a single table; hence by comparing records pairwise they can easily find out duplications. But now a day the data is being stored in more complex and semi-structured or hierarchical structure and the problem arose is how to detect duplicates on XML data. Also due to differences between various data models, the algorithms which are for single relations cannot be applied on XML data. The objective of this project is to detect duplicates in hierarchical data which contain textual data and multimedia data like images, audio and video. It also focuses on eliminating the duplicates by using elimination technique such as delete. Here Bayesian network is used with modified pruning algorithm for duplicate detection, and experiments are performed on both artificial and real world datasets. The new XMLMultiDup method is able to perform duplicate detection with high efficiency and effectiveness on multimedia datasets. This method compares each level of XML tree from root to the leaves computing probabilities of similarity by assigning weights. It goes through the comparison of structure, each descendant of both datasets and find duplicates despite difference in data.


  • Luis Leitao, Pavel Calado and Melanie Herschel, "Efficient and Effective Duplicate Detection in Hierarchical Data," IEEE Trans. on Knowledge and Data Engineering, Vol. 25, No. 5, May 2013.
  • E. Rahm and H. H. Do, "Data Cleaning: Problems and Current Approaches," IEEE Data Eng. Bull. , vol. 23, no. 4, pp. 3-13, Dec. 2000.
  • Joe Tekli, Richard Chbeir, Kokou Yetongnon "An overview on XML similarity: Background, current trends and future directions",Computer Science Review, Volume 3, Issue 3, August 2009, Pages 151173
  • S. Guha, H. V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, "Approximate XML Joins," Proc. ACM SIGMOD Conf. Management of Data, 2002.
  • M. A. Hernandez and S. J. Stolfo, "The Merge/Purge Problem for Large Databases," Proc. ACM SIGMOD Conf. Management of Data, pp. 127-138, 1995.
  • K. -H. Lee, Y. -C. Choy, and S. -B. Cho, "An efficient algorithm to compute differences between structured documents," IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 16, no. 8, pp. 965979, Aug. 2004.
  • L. Leitao and P. Calado, "Duplicate Detection through Structure Optimization," Proc. 20th ACM Intl Conf. Information and Knowledge Management, pp. 443-452, 2011.
  • R. Ananthakrishna, S. Chaudhuri, and V. Ganti, "Eliminating Fuzzy Duplicates in Data Warehouses," Proc. Conf. Very Large Databases (VLDB), pp. 586-597, 2002.
  • J. C. P. Carvalho and A. S. da Silva, "Finding Similar Identities among Objects from Multiple Web Sources," Proc. CIKM Workshop Web Information and Data Management (WIDM), pp. 90-93, 2003.
  • M. Weis and F. Naumann, "Dogmatix Tracks Down Duplicates in XML," Proc. ACM SIGMOD Conf. Management of Data, pp. 431-442, 2005.
  • D. Milano, M. Scannapieco, and T. Catarci, "Structure Aware XML Object Identification," Proc. VLDB Workshop Clean Databases (CleanDB), 2006.
  • L. Leitao, P. Calado, and M. Weis, "Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection," Proc. 16th ACM Intl Conf. Information and Knowledge Management, pp. 293-302, 2007.
  • F. Naumann and M. Herschel, "An Introduction to Duplicate Detection. Morgan and Claypool, 2010.
  • A. M. Kade and C. A. Heuser, "Matching XML Documents in Eng. Highly Dynamic Applications," Proc. ACM Symp. Document Eng.
  • M. Weis and F. Naumann. Duplicate detection in xml. In SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 10–19, Paris, France, 2004.
  • http://www. hpi. uni-potsdam. de/naumann/projekte/repeatability/.
  • http://www. cs. utexas. edu/users/ml/riddle/data. html.
  • https://www. hpi. uni-potsdam. de/fileadmin/hpi/FG_Naumann/
  • http://www. researchgate. net/publication/225867479_An_Overview_of_XML_Duplicate_Detection_Algorithms
  • http://se-pubs. dbs. uni-leipzig. de/files/Weis2006ADuplicateDetectionBenchmark. pdf
  • http://www. morganclaypool. com/doi/abs/10. 2200/S00262ED1V01Y201003DTM003
  • http://citeseerx. ist. psu. edu/viewdoc/download?doi=10. 1. 1. 70. 8263&rep=rep1&type=pdf#page=14
  • http://www2. cs. uni-paderborn. de/cs/ag-boettcher/lehre/SS05/sem-ss05/SIGMOD05