CFP last date
20 May 2024
Call for Paper
June Edition
IJCA solicits high quality original research papers for the upcoming June edition of the journal. The last date of research paper submission is 20 May 2024

Submit your paper
Know more
Reseach Article

Near Duplicate Web Page Detection using NDupDet Algorithm

by Nilakshi Joshi, Jayant Gadge
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 61 - Number 4
Year of Publication: 2013
Authors: Nilakshi Joshi, Jayant Gadge
10.5120/9920-4537

Nilakshi Joshi, Jayant Gadge . Near Duplicate Web Page Detection using NDupDet Algorithm. International Journal of Computer Applications. 61, 4 ( January 2013), 56-59. DOI=10.5120/9920-4537

@article{ 10.5120/9920-4537,
author = { Nilakshi Joshi, Jayant Gadge },
title = { Near Duplicate Web Page Detection using NDupDet Algorithm },
journal = { International Journal of Computer Applications },
issue_date = { January 2013 },
volume = { 61 },
number = { 4 },
month = { January },
year = { 2013 },
issn = { 0975-8887 },
pages = { 56-59 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume61/number4/9920-4537/ },
doi = { 10.5120/9920-4537 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:08:14.781659+05:30
%A Nilakshi Joshi
%A Jayant Gadge
%T Near Duplicate Web Page Detection using NDupDet Algorithm
%J International Journal of Computer Applications
%@ 0975-8887
%V 61
%N 4
%P 56-59
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Web is a system of interlinked hypertext documents accessed via Internet. Internet is a global system of interconnected computer networks that serve billions of users worldwide. The huge amount of documents on the web is challenging for web search engines. Web contains multiple copies of the same content or same web page. Many of these pages on the Web are duplicates and near duplicates of other pages. Web search engines face substantial problems due to such duplicate and near duplicate web pages. These pages enlarge the space required to store the index, increase the cost of serving results and so frustrates the users. To assist search engines to provide search results free of redundancy to users and to provide distinct useful results on the first page, duplicate and near duplicate detection is required. The proposed approach will detect near duplicate web pages to increase search effectiveness and storage efficiency of search engine.

References
  1. J Prasanna Kumar, P Govindarajulu ,"Duplicate and Near Duplicate Documents Detection: A Review" European Journal of Scientific Research ISSN 1450-216X Vol. 32 No. 4, pp. 514-527,2009
  2. Bassma S. Alsulami, Maysoon F. Abulkhair, Fathy E. Eassa, "Near Duplicate Document Detection Survey",International Journal of Computer Science & Communication Networks,Vol 2(2), 147-151,2010
  3. Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan, "A Novel Approach for Near-Duplicate Detection of
  4. Web Pages using TDW Matrix", International Journal of Computer Applications (0975 – 8887)Volume 19– No. 7, April 2011
  5. A. Broder, S. Glassman, M. Manasse and G. Zweig, "Syntactic clustering of the web", In Proc. of the 6th International World Wide Web Conference, Apr. 1997
  6. Zahra Eskandari Gharghe, Behrouz Minaei Bidgoli,"Weighted shingling: an adaptation of shingling for weighted shingles",2009 IEEE
  7. Junping Qiu and Qian Zeng, Detection and Optimized Disposal of NearDuplicate Pages, 2nd International Conference on Future Computer and Communication, Vol. 2, pp: 604-607, 2010.
  8. V. A. Narayana, P. Premchand and A. Govardhan, "Effective Detection of Near-Duplicate Web Documents in Web Crawling", International Journal of Computational Intelligence Research, Volume 5, Number 1, pp. 83–96, 2009.
  9. Salha Alzahrani, Naomie Salim, "Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection Lab Report for PAN at CLEF", 2010
  10. M. F. Porter, "An algorithm for suffix stripping Program", 14 no. 3, pp 130-137, July 1980.
Index Terms

Computer Science
Information Sciences

Keywords

NDupDet algorithm Near duplicate web pages search engine Web URL