CFP last date
20 May 2024
Reseach Article

Article:Smart Approach to Reduce the Web Crawling Traffic of Existing System using HTML based Update File at Web Server

by Shekhar Mishra, Anurag Jain, Dr. A.K. Sachan
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 11 - Number 7
Year of Publication: 2010
Authors: Shekhar Mishra, Anurag Jain, Dr. A.K. Sachan
10.5120/1593-2140

Shekhar Mishra, Anurag Jain, Dr. A.K. Sachan . Article:Smart Approach to Reduce the Web Crawling Traffic of Existing System using HTML based Update File at Web Server. International Journal of Computer Applications. 11, 7 ( December 2010), 34-38. DOI=10.5120/1593-2140

@article{ 10.5120/1593-2140,
author = { Shekhar Mishra, Anurag Jain, Dr. A.K. Sachan },
title = { Article:Smart Approach to Reduce the Web Crawling Traffic of Existing System using HTML based Update File at Web Server },
journal = { International Journal of Computer Applications },
issue_date = { December 2010 },
volume = { 11 },
number = { 7 },
month = { December },
year = { 2010 },
issn = { 0975-8887 },
pages = { 34-38 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume11/number7/1593-2140/ },
doi = { 10.5120/1593-2140 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T19:59:57.322556+05:30
%A Shekhar Mishra
%A Anurag Jain
%A Dr. A.K. Sachan
%T Article:Smart Approach to Reduce the Web Crawling Traffic of Existing System using HTML based Update File at Web Server
%J International Journal of Computer Applications
%@ 0975-8887
%V 11
%N 7
%P 34-38
%D 2010
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Web crawler is used for downloading information from web. Web pages are changed without any notice. Web crawler frequently revisits websites to check updates. It is expected that 40% of present internet traffic is because of web crawling. In this paper we propose a file which maintains the list of updated URLs of web pages of web site. Format of file is based on HTML. Crawler will only visit the UPDATE File, and need not have to revisit the full website to know the updates. This scheme can easily implement on today’s system with little modification on web application and web crawler. In simulator we test proposed method; using a website of 13 pages for experiment. Experiment results shows that this scheme is very promising.

References
  1. “Web crawler”, From Wikipedia, http://en.wikipedia.org/wiki/Web_crawler
  2. “World Wide Web”, From Wikipedia, http://en.wikipedia.org/wiki/World_Wide_Web
  3. “Robots Exclusion Protocol”, http://www.robotstxt.org/robotstxt.html
  4. “Robots exclusion standard”, Wikipedia http://en.wikipedia.org/wiki/Robots_exclusion_standard
  5. “Sitemaps”, from Wikipedia, http://en.wikipedia.org/wiki/Sitemaps
  6. Bal.S and Nath.R,”Filtering the web pages that are not modified at remote site without downloading using mobile crawler”. Information Technology journal 9(2)2010 ISSN 1812- 5638, Asian Network for Sciencetific information. (pp: 376-380)
  7. Cambazoglu, B.B.; Junqueira, F.; Plachouras, V.; Telloli, L., “On the feasibility of geographically distributed web crawling.” (ISBN: 978-963-9799-28-8) In the proceedings of Third International ICST Conference on Scalable Information Systems, ICST, Vico Equense, Italy (2008)
  8. Chandramouli A and Gauch. S. “A Co-operative Web Services Paradigm for Supporting Crawlers”, In the proceedings of Computer-Assisted Information Retrieval (Recherche d'Information et ses Applications) - RIAO 2007, 8th International Conference, Carnegie Mellon University, Pittsburgh, PA, USA, May 30 - June 1, 2007.
  9. McCurley S. Kevin “Incremental Crawling” Google Research http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en//research/pubs/archive/34403.pdf
  10. Sharma A.K, Dixit. A and Singhal N. “Design of a Priority Based Frequency Regulated Incremental Crawler” 2010 International Journal of Computer Applications (ISSN: 0975 – 8887) Volume 1 – No. 1. (pp: 42-47)
  11. Sun. Y, Councill G. Isaac and Giles C. Lee, “The Ethicality of Web Crawlers”, in the proceedings of 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Toronto Canada august 2010. (pp: 668-675)
  12. Thelwall. M and Stuart. D, “Web crawling ethics revisited: Cost, privacy and denial of service". Journal of the American Society for Information Science and Technology. 2006. Volume 57, Issue 13 November 2006. (pp: 1771 - 1779)
  13. Yuan, X.M. and J. Harms, “An efficient scheme to remove crawler traffic from the internet.” Proceedings of the 11th International Conference on Computer Communications and Networks, Oct 2002. 14-16, IEEE CS Press, (pp: 90-95).
Index Terms

Computer Science
Information Sciences

Keywords

Web Search Engine Web Web Crawler Web Crawling Traffic