Near Duplicate Web Page Detection using NDupDet Algorithm

Nilakshi Joshi; Jayant Gadge

Call for Paper

June Edition

IJCA solicits high quality original research papers for the upcoming June edition of the journal. The last date of research paper submission is 20 May 2024

Submit your paper

Know more

The week's pick

Enhancing Privacy Preservation: Multi-Attribute Protection with P-Sensitive K-Anonymity

Twinkle Patel Kiran Amin

Random Articles

An Overview of Text Summarization

Aug

2017

Requirement Risk Identification: A Practitioner's Approach

September

2014

Comparison of SLA based Energy Efficient Dynamic Virtual Machine Consolidation Algorithms

September

2014

Microsoft Teams Approaches to Solve Collaboration Needs

Mar

2019

Reseach Article

Near Duplicate Web Page Detection using NDupDet Algorithm

by Nilakshi Joshi, Jayant Gadge

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 61 - Number 4

Year of Publication: 2013

Authors: Nilakshi Joshi, Jayant Gadge

10.5120/9920-4537

Nilakshi Joshi, Jayant Gadge . Near Duplicate Web Page Detection using NDupDet Algorithm. International Journal of Computer Applications. 61, 4 ( January 2013), 56-59. DOI=10.5120/9920-4537

@article{ 10.5120/9920-4537,

author = { Nilakshi Joshi, Jayant Gadge },

title = { Near Duplicate Web Page Detection using NDupDet Algorithm },

journal = { International Journal of Computer Applications },

issue_date = { January 2013 },

volume = { 61 },

number = { 4 },

month = { January },

year = { 2013 },

issn = { 0975-8887 },

pages = { 56-59 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume61/number4/9920-4537/ },

doi = { 10.5120/9920-4537 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T21:08:14.781659+05:30

%A Nilakshi Joshi

%A Jayant Gadge

%T Near Duplicate Web Page Detection using NDupDet Algorithm

%J International Journal of Computer Applications

%@ 0975-8887

%V 61

%N 4

%P 56-59

%D 2013

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Web is a system of interlinked hypertext documents accessed via Internet. Internet is a global system of interconnected computer networks that serve billions of users worldwide. The huge amount of documents on the web is challenging for web search engines. Web contains multiple copies of the same content or same web page. Many of these pages on the Web are duplicates and near duplicates of other pages. Web search engines face substantial problems due to such duplicate and near duplicate web pages. These pages enlarge the space required to store the index, increase the cost of serving results and so frustrates the users. To assist search engines to provide search results free of redundancy to users and to provide distinct useful results on the first page, duplicate and near duplicate detection is required. The proposed approach will detect near duplicate web pages to increase search effectiveness and storage efficiency of search engine.

References

J Prasanna Kumar, P Govindarajulu ,"Duplicate and Near Duplicate Documents Detection: A Review" European Journal of Scientific Research ISSN 1450-216X Vol. 32 No. 4, pp. 514-527,2009
Bassma S. Alsulami, Maysoon F. Abulkhair, Fathy E. Eassa, "Near Duplicate Document Detection Survey",International Journal of Computer Science & Communication Networks,Vol 2(2), 147-151,2010
Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan, "A Novel Approach for Near-Duplicate Detection of
Web Pages using TDW Matrix", International Journal of Computer Applications (0975 – 8887)Volume 19– No. 7, April 2011
A. Broder, S. Glassman, M. Manasse and G. Zweig, "Syntactic clustering of the web", In Proc. of the 6th International World Wide Web Conference, Apr. 1997
Zahra Eskandari Gharghe, Behrouz Minaei Bidgoli,"Weighted shingling: an adaptation of shingling for weighted shingles",2009 IEEE
Junping Qiu and Qian Zeng, Detection and Optimized Disposal of NearDuplicate Pages, 2nd International Conference on Future Computer and Communication, Vol. 2, pp: 604-607, 2010.
V. A. Narayana, P. Premchand and A. Govardhan, "Effective Detection of Near-Duplicate Web Documents in Web Crawling", International Journal of Computational Intelligence Research, Volume 5, Number 1, pp. 83–96, 2009.
Salha Alzahrani, Naomie Salim, "Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection Lab Report for PAN at CLEF", 2010
M. F. Porter, "An algorithm for suffix stripping Program", 14 no. 3, pp 130-137, July 1980.

Index Terms

Computer Science

Information Sciences

Keywords

NDupDet algorithm Near duplicate web pages search engine Web URL