Call for Paper - January 2024 Edition
IJCA solicits original research papers for the January 2024 Edition. Last date of manuscript submission is December 20, 2023. Read More

Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results

Print
PDF
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2017
Authors:
Kavita Garg, Jayshankar Prasad, Saba Hilal
10.5120/ijca2017913526

Kavita Garg, Jayshankar Prasad and Saba Hilal. Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results. International Journal of Computer Applications 163(5):20-23, April 2017. BibTeX

@article{10.5120/ijca2017913526,
	author = {Kavita Garg and Jayshankar Prasad and Saba Hilal},
	title = {Study of Near Duplicate Content: Identification of Categories Generating Maximum Duplicate URL in Results},
	journal = {International Journal of Computer Applications},
	issue_date = {April 2017},
	volume = {163},
	number = {5},
	month = {Apr},
	year = {2017},
	issn = {0975-8887},
	pages = {20-23},
	numpages = {4},
	url = {http://www.ijcaonline.org/archives/volume163/number5/27392-2017913526},
	doi = {10.5120/ijca2017913526},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}
}

Abstract

The study of identification of near duplicate content involves identifying search categories which generate same URL in a query result. These categories are needed to be identified so that results can be improved by removing duplicate URL. Generating same URL in results irritates the user and it also decreases priority of other URL. These URL displayed on second or third page which user do not bother to open. Near duplicate content sometimes hides better results from the user and make the search results ineffective. There are many algorithms and procedures or filters to reduce the duplicity. But to reduce duplicity there is need to identify that duplicates. Which categories generate most duplicate results, in what form redundancy exists, which search engine generates these duplicate results and so on. This paper shows efforts to identify categories with maximum duplicates in term of same URL.

References

  1. H.Yang, J.Callan, S.Shulman (2006), “Next Steps in Near-Duplicate Detection for eRulemaking”, Proceedings of the international conference on Digital government research, pages 239-248.
  2. S.Weissman, S.Ayhan, J.Bradley, and J.Lin (2015), “Identifying Duplicate and Contradictory Information in Wikipedia”, Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 57-60.
  3. R.V R,et al,(2016) ,“Speeding up of Search Engine by Detection and Control of Duplicate Documents on the Web”, International Journal of Computer Science and Information Technologies,Vol.7 (2) , 637-642.
  4. M.Egele, S.Barbara, E.Kirda(2011) ,“Removing Web Spam Links from Search Engine Results” Journal in Computer Virology, Vol.7( 1), doi>10.1007/s11416-009-0132-6

Keywords

Keywords are your own designated keywords which can be used for easy location of the manuscript using any search engines.