Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results

Kavita Garg; Jayshankar Prasad; Saba Hilal

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 21 July 2025

Submit your paper

Know more

The week's pick

Navigating the Future of Cybersecurity: A Strategic Approach to Crypto Agility for Modern Enterprises

Aditya Gupta

Random Articles

Passenger Travel behavior Model in Railway Network Simulation

Apr

2017

Review of Application of Internet of Things in Agriculture in India

Aug

2018

Web Application Top 10 OWASP Attacks and Defence Mechanism

Aug

2023

An Incorporated Voting Strategy on Majority and Score- based Fuzzy Voting Algorithms for Safety-Critical Systems

July

2014

Reseach Article

Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results

by Kavita Garg, Jayshankar Prasad, Saba Hilal

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 163 - Number 5

Year of Publication: 2017

Authors: Kavita Garg, Jayshankar Prasad, Saba Hilal

10.5120/ijca2017913526

Kavita Garg, Jayshankar Prasad, Saba Hilal . Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results. International Journal of Computer Applications. 163, 5 ( Apr 2017), 20-23. DOI=10.5120/ijca2017913526

@article{ 10.5120/ijca2017913526,

author = { Kavita Garg, Jayshankar Prasad, Saba Hilal },

title = { Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results },

journal = { International Journal of Computer Applications },

issue_date = { Apr 2017 },

volume = { 163 },

number = { 5 },

month = { Apr },

year = { 2017 },

issn = { 0975-8887 },

pages = { 20-23 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume163/number5/27392-2017913526/ },

doi = { 10.5120/ijca2017913526 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-07T00:09:21.804779+05:30

%A Kavita Garg

%A Jayshankar Prasad

%A Saba Hilal

%T Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results

%J International Journal of Computer Applications

%@ 0975-8887

%V 163

%N 5

%P 20-23

%D 2017

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The study of identification of near duplicate content involves identifying search categories which generate same URL in a query result. These categories are needed to be identified so that results can be improved by removing duplicate URL. Generating same URL in results irritates the user and it also decreases priority of other URL. These URL displayed on second or third page which user do not bother to open. Near duplicate content sometimes hides better results from the user and make the search results ineffective. There are many algorithms and procedures or filters to reduce the duplicity. But to reduce duplicity there is need to identify that duplicates. Which categories generate most duplicate results, in what form redundancy exists, which search engine generates these duplicate results and so on. This paper shows efforts to identify categories with maximum duplicates in term of same URL.

References

H.Yang, J.Callan, S.Shulman (2006), “Next Steps in Near-Duplicate Detection for eRulemaking”, Proceedings of the international conference on Digital government research, pages 239-248.
S.Weissman, S.Ayhan, J.Bradley, and J.Lin (2015), “Identifying Duplicate and Contradictory Information in Wikipedia”, Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 57-60.
R.V R,et al,(2016) ,“Speeding up of Search Engine by Detection and Control of Duplicate Documents on the Web”, International Journal of Computer Science and Information Technologies,Vol.7 (2) , 637-642.
M.Egele, S.Barbara, E.Kirda(2011) ,“Removing Web Spam Links from Search Engine Results” Journal in Computer Virology, Vol.7( 1), doi>10.1007/s11416-009-0132-6

Index Terms

Computer Science

Information Sciences

Keywords

Keywords are your own designated keywords which can be used for easy location of the manuscript using any search engines.