CFP last date
20 May 2024
Reseach Article

A Novel Architecture of a Parallel Web Crawler

by Shruti Sharma, A.K.Sharma, J.P.Gupta
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 14 - Number 4
Year of Publication: 2011
Authors: Shruti Sharma, A.K.Sharma, J.P.Gupta
10.5120/1846-2476

Shruti Sharma, A.K.Sharma, J.P.Gupta . A Novel Architecture of a Parallel Web Crawler. International Journal of Computer Applications. 14, 4 ( January 2011), 38-42. DOI=10.5120/1846-2476

@article{ 10.5120/1846-2476,
author = { Shruti Sharma, A.K.Sharma, J.P.Gupta },
title = { A Novel Architecture of a Parallel Web Crawler },
journal = { International Journal of Computer Applications },
issue_date = { January 2011 },
volume = { 14 },
number = { 4 },
month = { January },
year = { 2011 },
issn = { 0975-8887 },
pages = { 38-42 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume14/number4/1846-2476/ },
doi = { 10.5120/1846-2476 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:02:34.437209+05:30
%A Shruti Sharma
%A A.K.Sharma
%A J.P.Gupta
%T A Novel Architecture of a Parallel Web Crawler
%J International Journal of Computer Applications
%@ 0975-8887
%V 14
%N 4
%P 38-42
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Due to the explosion in the size of the WWW[1,4,5] it becomes essential to make the crawling process parallel. In this paper we present an architecture for a parallel crawler that consists of multiple crawling processes called as C-procs which can run on network of workstations. The proposed crawler is scalable, is resilient against system crashes and other event. The aim of this architecture is to efficiently and effectively crawl the current set of publically indexable web pages so that we can maximize the download rate while minimizing the overhead from parallelization

References
  1. Mike Burner, “Crawling towards Eternity: Building an archive of the World Wide Web”, Web Techniques Magazine, 2(5), May 1997.
  2. Berners-Lee and Daniel Connolly, “Hypertext Markup Language.Internetworking draft”, Published on the WWW at http://www.w3.org/hypertext/WWW/MarkUp/HTML.html.
  3. Jumgoo Cho and Hector Garcia-Molina, “The evolution of the Web and implications for an incremental crawler”, Prc. Of VLDB Conf., 2000.
  4. Allen Heydon and Mark Najork, “Mercator: A Scalable, Extensible Web Crawler”,
  5. Junghoo Cho, “Parallel Crawlers” proceedings of www2002, Honolulu, hawaii, USA, May 7-11, 2002. ACM 1-58113-449-5/02/005.
  6. A.K.Sharma, J. P. Gupta, D. P. Agarwal, “Augment Hypertext Documents suitable for parallel crawlers”, Proc. of WITSA-2003, a National workshop on Information Technology Services and Applications, Feb’2003, New Delhi.
  7. http:/research.compaq.com/SRC/mercator/papers/www/paper.html Jonathan Vincent, Graham King, Mark Udall, “General Principles of Parallelism in Search/Optimisation Heuristcs”,
  8. Vladislav Shkapenyuk and Torsten Suel, “Design and Implementation of a High performance Distributed Web Crawler”, Technical Report, Department of Computer and Information Science, Polytechnic University, Brooklyn, July 2001.
  9. Brian Pinkerton, “Finding what people want: Experiences with the web crawler.”Proc. Of WWW conf., 1994.
  10. Jumgoo Cho and Hector Garcia-Molina, “The evolution of the Web and implications for an incremental crawler”, Prc. Of VLDB Conf.,2000.
  11. Sergey Brin and Lawrence Page, “The anatomy of large scale hyper textual web search engine”, Proc. Of 7th International World Wide Web Conference, volume 30, Computer Networks and ISDN Systems, pp 107-117, April 1998.
  12. Junghoo Cho and Hector Garcia-Molina, “Incremental crawler and evolution of web”, Technical Report, Department of Computer Science, Stanford University.
  13. Alexandros Ntoulas, Junghoo Cho, Christopher Olston "What's New on the Web? The Evolution of the Web from a Search Engine Perspective." In Proceedings of the World-Wide Web Conference (WWW), May 2004.
  14. Michael K. Bergman, “The deep web: Surfacing hidden value”, Journal of Electronic Publishing, 7(1), 2001.
  15. V. Crescenzi, G. Mecca, and P. Merialdo. “Roadrunner: Towards Automatic Data Extraction from Large Web Sites,” VLDB Journal, 2001, pp. 109-118.
  16. P. G. Ipeirotis and L. Gravano, “Distributed search over the hidden-web: Hierarchical sampling and selection,” In Proceedings of VLDB ‘02, 2002, pp. 394-405.
  17. Robots exclusion protocol. http://info.webcrawler.com/mak/projects/robots/exclusion.html.
  18. M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel, and D. C. Steere. Coda: a highly available file system for a distributed workstation environment. IEEE Transactions on Computers, 39(4):447–459, April 1990.
  19. D. Hirschberg. Parallel algorithms for the transitive closure and the connected component problem. In Proceedings of the 8th Annual ACM Symposium on the Theory of Computing, 1976.
Index Terms

Computer Science
Information Sciences

Keywords

WWW Search Engines Crawlers Parallel Crawlers