Call for Paper - August 2020 Edition
IJCA solicits original research papers for the August 2020 Edition. Last date of manuscript submission is July 20, 2020. Read More

Smart Approach to Reduce the Web Crawling Traffic of Existing System using HTML based Update File at Web Server

Print
PDF
International Journal of Computer Applications
© 2010 by IJCA Journal
Number 7 - Article 6
Year of Publication: 2010
Authors:
Shekhar Mishra
Anurag Jain
Dr. A.K. Sachan
10.5120/1593-2140

Shekhar Mishra, Anurag Jain and Dr. A K Sachan. Article:Smart Approach to Reduce the Web Crawling Traffic of Existing System using HTML based Update File at Web Server. International Journal of Computer Applications 11(7):34–38, December 2010. Published By Foundation of Computer Science. BibTeX

@article{key:article,
	author = {Shekhar Mishra and Anurag Jain and Dr. A.K. Sachan},
	title = {Article:Smart Approach to Reduce the Web Crawling Traffic of Existing System using HTML based Update File at Web Server},
	journal = {International Journal of Computer Applications},
	year = {2010},
	volume = {11},
	number = {7},
	pages = {34--38},
	month = {December},
	note = {Published By Foundation of Computer Science}
}

Abstract

Web crawler is used for downloading information from web. Web pages are changed without any notice. Web crawler frequently revisits websites to check updates. It is expected that 40% of present internet traffic is because of web crawling. In this paper we propose a file which maintains the list of updated URLs of web pages of web site. Format of file is based on HTML. Crawler will only visit the UPDATE File, and need not have to revisit the full website to know the updates. This scheme can easily implement on today’s system with little modification on web application and web crawler. In simulator we test proposed method; using a website of 13 pages for experiment. Experiment results shows that this scheme is very promising.

Reference

  • “Web crawler”, From Wikipedia, http://en.wikipedia.org/wiki/Web_crawler
  • “World Wide Web”, From Wikipedia, http://en.wikipedia.org/wiki/World_Wide_Web
  • “Robots Exclusion Protocol”, http://www.robotstxt.org/robotstxt.html
  • “Robots exclusion standard”, Wikipedia http://en.wikipedia.org/wiki/Robots_exclusion_standard
  • “Sitemaps”, from Wikipedia, http://en.wikipedia.org/wiki/Sitemaps
  • Bal.S and Nath.R,”Filtering the web pages that are not modified at remote site without downloading using mobile crawler”. Information Technology journal 9(2)2010 ISSN 1812- 5638, Asian Network for Sciencetific information. (pp: 376-380)
  • Cambazoglu, B.B.; Junqueira, F.; Plachouras, V.; Telloli, L., “On the feasibility of geographically distributed web crawling.” (ISBN: 978-963-9799-28-8) In the proceedings of Third International ICST Conference on Scalable Information Systems, ICST, Vico Equense, Italy (2008)
  • Chandramouli A and Gauch. S. “A Co-operative Web Services Paradigm for Supporting Crawlers”, In the proceedings of Computer-Assisted Information Retrieval (Recherche d'Information et ses Applications) - RIAO 2007, 8th International Conference, Carnegie Mellon University, Pittsburgh, PA, USA, May 30 - June 1, 2007.
  • McCurley S. Kevin “Incremental Crawling” Google Research http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en//research/pubs/archive/34403.pdf
  • Sharma A.K, Dixit. A and Singhal N. “Design of a Priority Based Frequency Regulated Incremental Crawler” 2010 International Journal of Computer Applications (ISSN: 0975 – 8887) Volume 1 – No. 1. (pp: 42-47)
  • Sun. Y, Councill G. Isaac and Giles C. Lee, “The Ethicality of Web Crawlers”, in the proceedings of 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Toronto Canada august 2010. (pp: 668-675)
  • Thelwall. M and Stuart. D, “Web crawling ethics revisited: Cost, privacy and denial of service". Journal of the American Society for Information Science and Technology. 2006. Volume 57, Issue 13 November 2006. (pp: 1771 - 1779)
  • Yuan, X.M. and J. Harms, “An efficient scheme to remove crawler traffic from the internet.” Proceedings of the 11th International Conference on Computer Communications and Networks, Oct 2002. 14-16, IEEE CS Press, (pp: 90-95).