Call for Paper - October 2019 Edition
IJCA solicits original research papers for the October 2019 Edition. Last date of manuscript submission is September 20, 2019. Read More

Web Crawler: A Review

Print
PDF
International Journal of Computer Applications
© 2013 by IJCA Journal
Volume 63 - Number 2
Year of Publication: 2013
Authors:
Md. Abu Kausar
V. S. Dhaka
Sanjeev Kumar Singh
10.5120/10440-5125

Md. Abu Kausar, V S Dhaka and Sanjeev Kumar Singh. Article: Web Crawler: A Review. International Journal of Computer Applications 63(2):31-36, February 2013. Full text available. BibTeX

@article{key:article,
	author = {Md. Abu Kausar and V. S. Dhaka and Sanjeev Kumar Singh},
	title = {Article: Web Crawler: A Review},
	journal = {International Journal of Computer Applications},
	year = {2013},
	volume = {63},
	number = {2},
	pages = {31-36},
	month = {February},
	note = {Full text available}
}

Abstract

Information Retrieval deals with searching and retrieving information within the documents and it also searches the online databases and internet. Web crawler is defined as a program or software which traverses the Web and downloads web documents in a methodical, automated manner. Based on the type of knowledge, web crawler is usually divided in three types of crawling techniques: General Purpose Crawling, Focused crawling and Distributed Crawling. In this paper, the applicability of Web Crawler in the field of web search and a review on Web Crawler to different problem domains in web search is discussed.

References

  • Berners-Lee, Tim, "The World Wide Web: Past, Present and Future", MIT USA, Aug 1996, available at: http://www. w3. org/People/Berners-Lee/1996/ppf. html.
  • Berners-Lee, Tim, and Cailliau, CN, R. , "Worldwide Web: Proposal for a Hypertext Project" CERN October 1990, available at: http://www. w3. org/Proposal. html.
  • "Internet World Stats. Worldwide internet users", available at: http://www. internetworldstats. com (accessed on May 5, 2012).
  • Maurice de Kunder, "Size of the World Wide Web", Available at: http://www. worldwidewebsize. com (accessed on May 5, 2012).
  • P. J. Deutsch. Original Archie Announcement, 1990. URL http://groups. google. com/group/comp. archives/msg/a77343f9175b24c3?output=gplain.
  • A. Emtage and P. Deutsch. Archie: An Electronic Directory Service for the Internet. In roceedings of the Winter 1992 USENIX Conference, pp. 93–110, San Francisco, California, USA, 1991.
  • G. S. Machovec. Veronica: A Gopher Navigational Tool on the Internet. Information Intelligence, Online Libraries, and Microcomputers, 11(10): pp. 1–4, Oct. 1 1993. ISSN 0737-7770.
  • R. Jones. Jughead: Jonzy's Universal Gopher Hierarchy Excavation And Display. unpublished, Apr. 1993.
  • J. Harris. Mining the Internet: Networked Information Location Tools: Gophers, Veronica, Archie, and Jughead. Computing Teacher, 21(1):pp. 16–19, Aug. 1 1993. ISSN 0278-9175.
  • H. Hahn and R. Stout. The Gopher, Veronica, and Jughead. In The Internet Complete Reference, pp. 429–457. Osborne McGraw-Hill, 1994.
  • T. Berners-Lee, R. Cailliau, J. Groff, and B. Pollermann. World-Wide Web: The Information Universe. Electronic Networking: Research, Applications and Policy, 1(2): pp. 74–82, 1992. URL http://citeseer. ist. psu. edu/berners-lee92worldwide. html.
  • T. Berners-Lee. W3C, Mar. 2008. URL http://www. w3. org/.
  • M. K. Gray. World Wide Web Wanderer, 1996b. URL http://www. mit. edu/people/mkgray/net/.
  • W. Sonnenreich and T. Macinta. Web Developer. com, Guide to Search Engines. John Wiley & Sons, New York, New York, USA, 1998.
  • M. Koster. ALIWEB - Archie-Like Indexing in the WEB. Computer Networks and ISDN Systems, 27(2): pp. 175–182, 1994a. ISSN 0169-7552. doi: http://dx. doi. org/10. 1016/0169-7552(94)90131-7.
  • M. Koster. A Standard for Robot Exclusion, 1994b. URL http://www. robotstxt. org/wc/norobots. html. http://www. robotstxt. org/wc/exclusion. html.
  • B. Pinkerton. Finding What People Want: Experiences with the WebCrawler. In Proceedings of the Second International World Wide Web Conference, Chicago, Illinois, USA, Oct. 1994.
  • Infoseek, Mar. 2008. URL www. infoseek. co. jp
  • Lycos, Mar. 2008. URL http://www. lycos. com
  • Altavista, Mar. 2008. URL www. altavista. com
  • Excite, Mar. 2008. URL www. excite. com
  • Dogpile, Mar. 2008. URL www. dogpile. com
  • Inktomi, Mar. 2008. URL www. inktomi. com
  • Ask. com, Mar. 2008. URL http://ask. com/.
  • Northern Light, Mar. 2008. URL http://www. northernlight. com
  • D. Sullivan. Search Engine Watch: Where are they now? Search Engines we've Known & Loved, Mar. 4 2003b. URL http://searchenginewatch. com/sereport/article. php/2175241.
  • Google. Google's New GoogleScout Feature Expands Scope of Search on the Internet, Sept. 1999. URL http://www. google. com/press/pressrel/pressrelease4. html.
  • L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998. URL http://citeseer. ist. psu. edu/page98pagerank. html
  • S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In P. H. Enslow Jr. and A. Ellis, editors, WWW7: Proceedings of the Seventh International Conference on World Wide Web, pp. 107–117, Brisbane, Australia, Apr. 14–18 1998. Elsevier Science Publishers B. V. , Amsterdam, The Netherlands. doi: http://dx. doi. org/10. 1016/S0169-7552(98)00110-X.
  • Junghoo Cho and Hector Garcia-Molina "Parallel Crawlers". Proceedings of the 11th international conference on World Wide Web WWW '02", May 7–11, 2002, Honolulu, Hawaii, USA. ACM 1-58113-449-5/02/0005.
  • Rajashree Shettar, Dr. Shobha G, "Web Crawler On Client Machine", Proceedings of the International MultiConference of Engineers and Computer Scientists 2008 Vol II IMECS 2008, 19-21 March, 2008, Hong Kong
  • Eytan Adar, Jaime Teevan, Susan T. Dumais and Jonathan L. Elsas "The Web Changes Everything: Understanding the Dynamics of Web Content", ACM 2009.
  • A. K. Sharma, J. P. Gupta and D. P. Agarwal "PARCAHYD: An Architecture of a Parallel Crawler based on Augmented Hypertext Documents", International Journal of Advancements in Technology, pp. 270-283, October 2010.
  • Ashutosh Dixit and Dr. A. K. Sharma, "A Mathematical Model for Crawler Revisit Frequency", IEEE 2nd International Advance Computing Conference, pp. 316-319, 2010.
  • Shruti Sharma, A. K. Sharma and J. P. Gupta "A Novel Architecture of a Parallel Web Crawler", International Journal of Computer Applications (0975 – 8887) Volume 14– No. 4, pp. 38-42, January 2011
  • Alex Goh Kwang Leng, Ravi Kumar P, Ashutosh Kumar Singh and Rajendra Kumar Dash "PyBot: An Algorithm for Web Crawling", IEEE 2011
  • Song Zheng, "Genetic and Ant Algorithms Based Focused Crawler Design", Second International Conference on Innovations in Bio-inspired Computing and Applications pp. 374-378, 2011
  • Lili Yana, Zhanji Guia, Wencai Dub and Qingju Guoa "An Improved PageRank Method based on Genetic Algorithm for Web Search", Procedia Engineering, pp. 2983-2987, Elsevier 2011
  • Andoena Balla, Athena Stassopoulou and Marios D. Dikaiakos (2011), "Real-time Web Crawler Detection", 18th International Conference on Telecommunications, pp. 428-432, 2011
  • Bahador Saket and Farnaz Behrang "A New Crawling Method Based on AntNet Genetic and Routing Algorithms", International Symposium on Computing, Communication, and Control, pp. 350-355, IACSIT Press, Singapore, 2011
  • Anbukodi. S and Muthu Manickam. K "Reducing Web Crawler Overhead using Mobile Crawler", PROCEEDINGS OF ICETECT, pp. 926-932, 2011
  • K. S. Kim, K. Y. Kim, K. H. Lee, T. K. Kim, and W. S. Cho "Design and Implementation of Web Crawler Based on Dynamic Web Collection Cycle", pp. 562-566, IEEE 2012
  • MetaCrawler Search Engine, available at: http://www. metacrawler. com.
  • Cho, J. and H. Garcia-Molina. The evolution of the Web and implications for an incremental crawler. VLDB '00, 200-209, 2000.
  • Douglis, F. , A. Feldmann, B. Krishnamurthy, and J. Mogul. Rate of change and other metrics: A live study of the World Wide Web. USENIX Symposium on Internet Technologies and Systems, 1997.
  • Fetterly, D. , M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of Web pages. WWW '03, 669-678, 2003.
  • Kim, J. K. , and S. H. Lee. An empirical study of the change of Web pages. APWeb '05, 632-642, 2005.
  • Koehler, W. Web page change and persistence: A four-year longitudinal study. JASIST, 53(2), 162-171, 2002.
  • Kwon, S. H. , S. H. Lee, and S. J. Kim. Effective criteria for Web page changes. In Proceedings of APWeb '06, 837-842, 2006.
  • Ntoulas, A. , Cho, J. , and Olston, C. What's new on the Web? The evolution of the Web from a search engine perspective. WWW '04 , 1-12, 2004.
  • Olston, C. and Pandey, S. Recrawl scheduling based on information longevity. WWW '08, 437-446, 2008.
  • Pitkow, J. and Pirolli, P. Life, death, and lawfulness on the electronic frontier. CHI '97, 383-390, 1997.
  • Selberg, E. and Etzioni, O. On the instability of Web search engines. In Proceedings of RIAO '00, 2000.
  • Teevan, J. , E. Adar, R. Jones, and M. A. Potts. Information reretrieval: repeat queries in Yahoo's logs. SIGIR '07, 151-158, 2007.