CFP last date
22 April 2024
Reseach Article

Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results

by Kavita Garg, Jayshankar Prasad, Saba Hilal
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 163 - Number 5
Year of Publication: 2017
Authors: Kavita Garg, Jayshankar Prasad, Saba Hilal
10.5120/ijca2017913526

Kavita Garg, Jayshankar Prasad, Saba Hilal . Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results. International Journal of Computer Applications. 163, 5 ( Apr 2017), 20-23. DOI=10.5120/ijca2017913526

@article{ 10.5120/ijca2017913526,
author = { Kavita Garg, Jayshankar Prasad, Saba Hilal },
title = { Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results },
journal = { International Journal of Computer Applications },
issue_date = { Apr 2017 },
volume = { 163 },
number = { 5 },
month = { Apr },
year = { 2017 },
issn = { 0975-8887 },
pages = { 20-23 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume163/number5/27392-2017913526/ },
doi = { 10.5120/ijca2017913526 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T00:09:21.804779+05:30
%A Kavita Garg
%A Jayshankar Prasad
%A Saba Hilal
%T Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results
%J International Journal of Computer Applications
%@ 0975-8887
%V 163
%N 5
%P 20-23
%D 2017
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The study of identification of near duplicate content involves identifying search categories which generate same URL in a query result. These categories are needed to be identified so that results can be improved by removing duplicate URL. Generating same URL in results irritates the user and it also decreases priority of other URL. These URL displayed on second or third page which user do not bother to open. Near duplicate content sometimes hides better results from the user and make the search results ineffective. There are many algorithms and procedures or filters to reduce the duplicity. But to reduce duplicity there is need to identify that duplicates. Which categories generate most duplicate results, in what form redundancy exists, which search engine generates these duplicate results and so on. This paper shows efforts to identify categories with maximum duplicates in term of same URL.

References
  1. H.Yang, J.Callan, S.Shulman (2006), “Next Steps in Near-Duplicate Detection for eRulemaking”, Proceedings of the international conference on Digital government research, pages 239-248.
  2. S.Weissman, S.Ayhan, J.Bradley, and J.Lin (2015), “Identifying Duplicate and Contradictory Information in Wikipedia”, Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 57-60.
  3. R.V R,et al,(2016) ,“Speeding up of Search Engine by Detection and Control of Duplicate Documents on the Web”, International Journal of Computer Science and Information Technologies,Vol.7 (2) , 637-642.
  4. M.Egele, S.Barbara, E.Kirda(2011) ,“Removing Web Spam Links from Search Engine Results” Journal in Computer Virology, Vol.7( 1), doi>10.1007/s11416-009-0132-6
Index Terms

Computer Science
Information Sciences

Keywords

Keywords are your own designated keywords which can be used for easy location of the manuscript using any search engines.