CFP last date
20 May 2024
Reseach Article

Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling

by V. A. Narayana, P. Premchand, A. Govardhan
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 59 - Number 3
Year of Publication: 2012
Authors: V. A. Narayana, P. Premchand, A. Govardhan
10.5120/9530-3954

V. A. Narayana, P. Premchand, A. Govardhan . Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling. International Journal of Computer Applications. 59, 3 ( December 2012), 22-29. DOI=10.5120/9530-3954

@article{ 10.5120/9530-3954,
author = { V. A. Narayana, P. Premchand, A. Govardhan },
title = { Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling },
journal = { International Journal of Computer Applications },
issue_date = { December 2012 },
volume = { 59 },
number = { 3 },
month = { December },
year = { 2012 },
issn = { 0975-8887 },
pages = { 22-29 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume59/number3/9530-3954/ },
doi = { 10.5120/9530-3954 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:05:09.989229+05:30
%A V. A. Narayana
%A P. Premchand
%A A. Govardhan
%T Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling
%J International Journal of Computer Applications
%@ 0975-8887
%V 59
%N 3
%P 22-29
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Recent years have witnessed the drastic development of World Wide Web (WWW). Information is being accessible at the finger tip anytime anywhere through the massive web repository. The performance and reliability of web engines thus face huge problems due to the presence of enormous amount of web data. The voluminous amount of web documents has resulted in problems for search engines leading to the fact that the search results are of less relevance to the user. In addition to this, the presence of duplicate and near-duplicate web documents has created an additional overhead for the search engines critically affecting their performance. The demand for integrating data from heterogeneous sources leads to the problem of near-duplicate web pages. The detection of near duplicate documents within a collection has recently become an area of great interest. In this research, we have presented an efficient approach for the detection of near duplicate web pages in web crawling which uses keywords and the distance measure. Besides that, G. S. Manku et al. 's fingerprint based approach proposed in 2007 was considered as one of the "state-of-the-art" algorithms for finding near-duplicate web pages. Then we have implemented both the approaches and conducted an extensive comparative study between our similarity score based approach and G. S. Manku et al. 's fingerprint based approach. We have analyzed our results in terms of time complexity, space complexity, Memory usage and the confusion matrix parameters. After taking into account the above mentioned performance factors for the two approaches, the comparison study clearly portrays our approach the better (less complex) of the two based on the factors considered.

References
  1. A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan, "Searching the web", ACM Transactions on Internet Technology, vol. 1, no. 1: pp. 2-43, 2001.
  2. M. Charikar. "Similarity estimation techniques from rounding algorithms". In Proceedings of the 34th Annual Symposium on Theory of Computing (STOC 2002), pp: 380-388, 2002.
  3. R. J. Bayardo, Y. Ma and R. Srikant, "Scaling up all pairs similarity search". In Proceedings of the 16th international conference on World Wide Web, pp. 131 – 140, Banff, Alberta, Canada, 2007.
  4. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. "Syntactic clustering of the web", Computer Networks and ISDN Systems, vol. 29, pp. 1157–1166, 1997.
  5. J. Cho, N. Shivakumar, and H. Garcia-Molina. "Finding replicated web collections". ACM SIGMOD Record, vol. 29, no. 2, pp. 355 – 366, June 2000.
  6. J. G. Conrad, X. S. Guo, and C. P. Schriber. "Online duplicate document detection: signature reliability in a dynamic retrieval environment", in Proceedings of the twelfth international conference on Information and knowledge management, pp. 443 - 452 New Orleans, LA, USA, 2003.
  7. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu , "Efficient Similarity Joins for Near-Duplicate Detection", in Proceeding of the 17th international conference on World Wide Web, pp. 131-140, 2008.
  8. M. Henzinger, "Finding near-duplicate web pages: a large-scale evaluation of algorithms", in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 284-291, 2006.
  9. Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma, "Detecting near-duplicates for web crawling", in Proceedings of the 16th international conference on World Wide Web, pp: 141 - 150, 2007.
  10. D. Fetterly, M. Manasse, and M. Najork. "On the evolution of clusters of near-duplicate web pages". In Proceedings of the First Conference on Latin American Web Congress, 2003.
  11. D. Gibson, R. Kumar, and A. Tomkins. "Discovering large dense subgraphs in massive graphs", In Proceedings of the 31st international conference on Very large data bases, pp. 721 – 732, Trondheim, Norway, 2005.
  12. T. C. Hoad and J. Zobel. "Methods for identifying versioned and plagiarized documents", Journal of the American Society for Information Science and Technology, vol. 54, no. 3, pp. 203–215, 2003.
  13. E. Spertus, M. Sahami, and O. Buyukkokten. "Evaluating similarity measures: a large-scale study in the orkut social network", in proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 678 – 684, Chicago, Illinois, USA, 2005.
  14. Ziv Bar-Yossef, Idit Keidar,Uri Schonfeld, "Do not crawl in the dust: different urls with similar text," in Proceedings of the 16th international conference on World Wide Web, pp: 111 - 120, 2007.
  15. Hui Yang, Jamie Callan, Stuart Shulman, "Next steps in near-duplicate detection for eRulemaking," Proceedings of the 2006 international conference on Digital government research, vol. 151, pp: 239 - 248, 2006.
  16. S. Brin, J. Davis, and H. Garcia-Molina, "Copy detection mechanisms for digital documents", In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pp. 398–409. ACM Press, May 1995.
  17. D. Metzler, Y. Bernstein and W. Bruce Croft. "Similarity Measures for Tracking Information Flow", in Proceedings of the fourteenth international conference on Information and knowledge management, Bremen, Germany, 2005,
  18. H. Yang and J. Callan, "Near-duplicate detection for eRulemaking", in Proceedings of the 2005 international conference on Digital government research, pp: 78 - 86, 2005.
  19. V. A. Narayana, P. Premchand and A. Govardhan, "Effective Detection of Near-Duplicate Web Documents in Web Crawling", International Journal of Computational Intelligence Research, vol. 5, no. 1 ,pp. 83–96, 2009.
  20. V. A. Narayana, P. Premchand and A. Govardhan, (2010) "Fixing the Threshold for Effective Detection of Near Duplicate Web Documents in Web Crawling". Proceedings of 6th International Conference on Advanced Data Mining and Applications, Chongqing University, China Published in LNCS of SPRINGER in volume 6440/2010 pp 169-180.
  21. V. A. Narayana, P. Premchand and A. Govardhan, "Near-Duplicate Web Page Detection: A Comparative Study of Two Contrary Approaches" Paper published in proceedings of 6th International Conference on Computer Sciences and Convergence Information Technology, Jeju Island, Korea from 29 Nov - 01 Dec 2011. Indexed in IEEE XPLORE. pp 769-776
  22. V. A. Narayana, P. Premchand and A. Govardhan, "To Create A Confusion Matrix in Respect of Threshold Being Fixed for Effective Detection of Near Duplicate Web Documents in Web Crawling" Paper published in proceedings of 6th International Conference on Computer Sciences and Convergence Information Technology, Jeju Island, Korea from 29 Nov - 01 Dec 2011. Indexed in IEEE XPLORE. pp 763-768
Index Terms

Computer Science
Information Sciences

Keywords

Near Duplicate Documents Similarity Score Measure Confusion Matrix Storage Space Complexity Memory Usage Analysis Computation Time Analysis