Call for Paper - August 2022 Edition
IJCA solicits original research papers for the August 2022 Edition. Last date of manuscript submission is July 20, 2022. Read More

Near Duplicate Web Page Detection using NDupDet Algorithm

Print
PDF
International Journal of Computer Applications
© 2013 by IJCA Journal
Volume 61 - Number 4
Year of Publication: 2013
Authors:
Nilakshi Joshi
Jayant Gadge
10.5120/9920-4537

Nilakshi Joshi and Jayant Gadge. Article: Near Duplicate Web Page Detection using NDupDet Algorithm. International Journal of Computer Applications 61(4):56-59, January 2013. Full text available. BibTeX

@article{key:article,
	author = {Nilakshi Joshi and Jayant Gadge},
	title = {Article: Near Duplicate Web Page Detection using NDupDet Algorithm},
	journal = {International Journal of Computer Applications},
	year = {2013},
	volume = {61},
	number = {4},
	pages = {56-59},
	month = {January},
	note = {Full text available}
}

Abstract

Web is a system of interlinked hypertext documents accessed via Internet. Internet is a global system of interconnected computer networks that serve billions of users worldwide. The huge amount of documents on the web is challenging for web search engines. Web contains multiple copies of the same content or same web page. Many of these pages on the Web are duplicates and near duplicates of other pages. Web search engines face substantial problems due to such duplicate and near duplicate web pages. These pages enlarge the space required to store the index, increase the cost of serving results and so frustrates the users. To assist search engines to provide search results free of redundancy to users and to provide distinct useful results on the first page, duplicate and near duplicate detection is required. The proposed approach will detect near duplicate web pages to increase search effectiveness and storage efficiency of search engine.

References

  • J Prasanna Kumar, P Govindarajulu ,"Duplicate and Near Duplicate Documents Detection: A Review" European Journal of Scientific Research ISSN 1450-216X Vol. 32 No. 4, pp. 514-527,2009
  • Bassma S. Alsulami, Maysoon F. Abulkhair, Fathy E. Eassa, "Near Duplicate Document Detection Survey",International Journal of Computer Science & Communication Networks,Vol 2(2), 147-151,2010
  • Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan, "A Novel Approach for Near-Duplicate Detection of
  • Web Pages using TDW Matrix", International Journal of Computer Applications (0975 – 8887)Volume 19– No. 7, April 2011
  • A. Broder, S. Glassman, M. Manasse and G. Zweig, "Syntactic clustering of the web", In Proc. of the 6th International World Wide Web Conference, Apr. 1997
  • Zahra Eskandari Gharghe, Behrouz Minaei Bidgoli,"Weighted shingling: an adaptation of shingling for weighted shingles",2009 IEEE
  • Junping Qiu and Qian Zeng, Detection and Optimized Disposal of NearDuplicate Pages, 2nd International Conference on Future Computer and Communication, Vol. 2, pp: 604-607, 2010.
  • V. A. Narayana, P. Premchand and A. Govardhan, "Effective Detection of Near-Duplicate Web Documents in Web Crawling", International Journal of Computational Intelligence Research, Volume 5, Number 1, pp. 83–96, 2009.
  • Salha Alzahrani, Naomie Salim, "Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection Lab Report for PAN at CLEF", 2010
  • M. F. Porter, "An algorithm for suffix stripping Program", 14 no. 3, pp 130-137, July 1980.