A Novel Architecture of a Parallel Web Crawler

Shruti Sharma; A.K.Sharma; J.P.Gupta

Call for Paper

June Edition

IJCA solicits high quality original research papers for the upcoming June edition of the journal. The last date of research paper submission is 20 May 2024

Submit your paper

Know more

The week's pick

Enhancing Privacy Preservation: Multi-Attribute Protection with P-Sensitive K-Anonymity

Twinkle Patel Kiran Amin

Random Articles

Implementation of RS Encoder and RS Decoder using UHD Architecture

September

2013

Customized Travel Planner using MapReduce and Approximation Algorithm

June

2015

A Random Matrix - based Fraud Prevention Model

Jun

2017

A Hybrid Feature Selection Method based on IGSBFS and Naïve Bayes for the Diagnosis of Erythemato - Squamous Diseases

March

2012

Reseach Article

A Novel Architecture of a Parallel Web Crawler

by Shruti Sharma, A.K.Sharma, J.P.Gupta

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 14 - Number 4

Year of Publication: 2011

Authors: Shruti Sharma, A.K.Sharma, J.P.Gupta

10.5120/1846-2476

Shruti Sharma, A.K.Sharma, J.P.Gupta . A Novel Architecture of a Parallel Web Crawler. International Journal of Computer Applications. 14, 4 ( January 2011), 38-42. DOI=10.5120/1846-2476

@article{ 10.5120/1846-2476,

author = { Shruti Sharma, A.K.Sharma, J.P.Gupta },

title = { A Novel Architecture of a Parallel Web Crawler },

journal = { International Journal of Computer Applications },

issue_date = { January 2011 },

volume = { 14 },

number = { 4 },

month = { January },

year = { 2011 },

issn = { 0975-8887 },

pages = { 38-42 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume14/number4/1846-2476/ },

doi = { 10.5120/1846-2476 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:02:34.437209+05:30

%A Shruti Sharma

%A A.K.Sharma

%A J.P.Gupta

%T A Novel Architecture of a Parallel Web Crawler

%J International Journal of Computer Applications

%@ 0975-8887

%V 14

%N 4

%P 38-42

%D 2011

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Due to the explosion in the size of the WWW[1,4,5] it becomes essential to make the crawling process parallel. In this paper we present an architecture for a parallel crawler that consists of multiple crawling processes called as C-procs which can run on network of workstations. The proposed crawler is scalable, is resilient against system crashes and other event. The aim of this architecture is to efficiently and effectively crawl the current set of publically indexable web pages so that we can maximize the download rate while minimizing the overhead from parallelization

References

Mike Burner, “Crawling towards Eternity: Building an archive of the World Wide Web”, Web Techniques Magazine, 2(5), May 1997.
Berners-Lee and Daniel Connolly, “Hypertext Markup Language.Internetworking draft”, Published on the WWW at http://www.w3.org/hypertext/WWW/MarkUp/HTML.html.
Jumgoo Cho and Hector Garcia-Molina, “The evolution of the Web and implications for an incremental crawler”, Prc. Of VLDB Conf., 2000.
Allen Heydon and Mark Najork, “Mercator: A Scalable, Extensible Web Crawler”,
Junghoo Cho, “Parallel Crawlers” proceedings of www2002, Honolulu, hawaii, USA, May 7-11, 2002. ACM 1-58113-449-5/02/005.
A.K.Sharma, J. P. Gupta, D. P. Agarwal, “Augment Hypertext Documents suitable for parallel crawlers”, Proc. of WITSA-2003, a National workshop on Information Technology Services and Applications, Feb’2003, New Delhi.
http:/research.compaq.com/SRC/mercator/papers/www/paper.html Jonathan Vincent, Graham King, Mark Udall, “General Principles of Parallelism in Search/Optimisation Heuristcs”,
Vladislav Shkapenyuk and Torsten Suel, “Design and Implementation of a High performance Distributed Web Crawler”, Technical Report, Department of Computer and Information Science, Polytechnic University, Brooklyn, July 2001.
Brian Pinkerton, “Finding what people want: Experiences with the web crawler.”Proc. Of WWW conf., 1994.
Jumgoo Cho and Hector Garcia-Molina, “The evolution of the Web and implications for an incremental crawler”, Prc. Of VLDB Conf.,2000.
Sergey Brin and Lawrence Page, “The anatomy of large scale hyper textual web search engine”, Proc. Of 7th International World Wide Web Conference, volume 30, Computer Networks and ISDN Systems, pp 107-117, April 1998.
Junghoo Cho and Hector Garcia-Molina, “Incremental crawler and evolution of web”, Technical Report, Department of Computer Science, Stanford University.
Alexandros Ntoulas, Junghoo Cho, Christopher Olston "What's New on the Web? The Evolution of the Web from a Search Engine Perspective." In Proceedings of the World-Wide Web Conference (WWW), May 2004.
Michael K. Bergman, “The deep web: Surfacing hidden value”, Journal of Electronic Publishing, 7(1), 2001.
V. Crescenzi, G. Mecca, and P. Merialdo. “Roadrunner: Towards Automatic Data Extraction from Large Web Sites,” VLDB Journal, 2001, pp. 109-118.
P. G. Ipeirotis and L. Gravano, “Distributed search over the hidden-web: Hierarchical sampling and selection,” In Proceedings of VLDB ‘02, 2002, pp. 394-405.
Robots exclusion protocol. http://info.webcrawler.com/mak/projects/robots/exclusion.html.
M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel, and D. C. Steere. Coda: a highly available file system for a distributed workstation environment. IEEE Transactions on Computers, 39(4):447–459, April 1990.
D. Hirschberg. Parallel algorithms for the transitive closure and the connected component problem. In Proceedings of the 8th Annual ACM Symposium on the Theory of Computing, 1976.

Index Terms

Computer Science

Information Sciences

Keywords

WWW Search Engines Crawlers Parallel Crawlers