Near Duplicate Web Page Detection using NDupDet Algorithm

Nilakshi Joshi; Jayant Gadge

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 20 July 2026

Submit your paper

Know more

The week's pick

Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations

Muskan Saraf Sajjad Rezvani Boroujeni Justin Beaudry Hossein Abedi Tom Bush

Random Articles

Simulation of MJ_CDTmin based Scheduling Algorithm in Grid Environment

March

2012

Real-Time Implementation and Analysis of Crop-Field for Agriculture Management System based on Microcontroller with GPRS (M-GPRS) and SMS

July

2014

Simulation based Performance Analysis of Zone Routing Protocol in Manet

February

2014

Overview and Applications of Particle Swarm Optimization on GPGPU

November

2014

Reseach Article

Near Duplicate Web Page Detection using NDupDet Algorithm

by Nilakshi Joshi, Jayant Gadge

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 61 - Number 4

Year of Publication: 2013

Authors: Nilakshi Joshi, Jayant Gadge

10.5120/9920-4537

Nilakshi Joshi, Jayant Gadge . Near Duplicate Web Page Detection using NDupDet Algorithm. International Journal of Computer Applications. 61, 4 ( January 2013), 56-59. DOI=10.5120/9920-4537

@article{ 10.5120/9920-4537,

author = { Nilakshi Joshi, Jayant Gadge },

title = { Near Duplicate Web Page Detection using NDupDet Algorithm },

journal = { International Journal of Computer Applications },

issue_date = { January 2013 },

volume = { 61 },

number = { 4 },

month = { January },

year = { 2013 },

issn = { 0975-8887 },

pages = { 56-59 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume61/number4/9920-4537/ },

doi = { 10.5120/9920-4537 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T21:08:14.781659+05:30

%A Nilakshi Joshi

%A Jayant Gadge

%T Near Duplicate Web Page Detection using NDupDet Algorithm

%J International Journal of Computer Applications

%@ 0975-8887

%V 61

%N 4

%P 56-59

%D 2013

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Web is a system of interlinked hypertext documents accessed via Internet. Internet is a global system of interconnected computer networks that serve billions of users worldwide. The huge amount of documents on the web is challenging for web search engines. Web contains multiple copies of the same content or same web page. Many of these pages on the Web are duplicates and near duplicates of other pages. Web search engines face substantial problems due to such duplicate and near duplicate web pages. These pages enlarge the space required to store the index, increase the cost of serving results and so frustrates the users. To assist search engines to provide search results free of redundancy to users and to provide distinct useful results on the first page, duplicate and near duplicate detection is required. The proposed approach will detect near duplicate web pages to increase search effectiveness and storage efficiency of search engine.

References

J Prasanna Kumar, P Govindarajulu ,"Duplicate and Near Duplicate Documents Detection: A Review" European Journal of Scientific Research ISSN 1450-216X Vol. 32 No. 4, pp. 514-527,2009
Bassma S. Alsulami, Maysoon F. Abulkhair, Fathy E. Eassa, "Near Duplicate Document Detection Survey",International Journal of Computer Science & Communication Networks,Vol 2(2), 147-151,2010
Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan, "A Novel Approach for Near-Duplicate Detection of
Web Pages using TDW Matrix", International Journal of Computer Applications (0975 – 8887)Volume 19– No. 7, April 2011
A. Broder, S. Glassman, M. Manasse and G. Zweig, "Syntactic clustering of the web", In Proc. of the 6th International World Wide Web Conference, Apr. 1997
Zahra Eskandari Gharghe, Behrouz Minaei Bidgoli,"Weighted shingling: an adaptation of shingling for weighted shingles",2009 IEEE
Junping Qiu and Qian Zeng, Detection and Optimized Disposal of NearDuplicate Pages, 2nd International Conference on Future Computer and Communication, Vol. 2, pp: 604-607, 2010.
V. A. Narayana, P. Premchand and A. Govardhan, "Effective Detection of Near-Duplicate Web Documents in Web Crawling", International Journal of Computational Intelligence Research, Volume 5, Number 1, pp. 83–96, 2009.
Salha Alzahrani, Naomie Salim, "Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection Lab Report for PAN at CLEF", 2010
M. F. Porter, "An algorithm for suffix stripping Program", 14 no. 3, pp 130-137, July 1980.

Index Terms

Computer Science

Information Sciences

Keywords

NDupDet algorithm Near duplicate web pages search engine Web URL