Improvised Architecture for Distributed Web Crawling

Tilak Patidar; Aditya Ambasth

Call for Paper

July Edition

IJCA solicits high quality original research papers for the upcoming July edition of the journal. The last date of research paper submission is 20 June 2025

Submit your paper

Know more

The week's pick

Designing Multi-Tenant E-Learning Systems in the Cloud: A Process-Oriented Approach for Higher Education

Sameh Azouzi Sonia Ayachi Ghannouchi

Random Articles

Clustering based Energy Efficient Protocol for Wireless Sensor Network Comparison Study

Mar

2017

Article:Early Detection of Breast Cancer using Self Similar Fractal Method

November

2010

Comparative Performance Analysis of Block and Convolution Codes

June

2015

Identifying Human Personalized Sentiment with Streaming Data

Feb

2017

Reseach Article

Improvised Architecture for Distributed Web Crawling

by Tilak Patidar, Aditya Ambasth

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 151 - Number 9

Year of Publication: 2016

Authors: Tilak Patidar, Aditya Ambasth

10.5120/ijca2016911857

Tilak Patidar, Aditya Ambasth . Improvised Architecture for Distributed Web Crawling. International Journal of Computer Applications. 151, 9 ( Oct 2016), 14-20. DOI=10.5120/ijca2016911857

@article{ 10.5120/ijca2016911857,

author = { Tilak Patidar, Aditya Ambasth },

title = { Improvised Architecture for Distributed Web Crawling },

journal = { International Journal of Computer Applications },

issue_date = { Oct 2016 },

volume = { 151 },

number = { 9 },

month = { Oct },

year = { 2016 },

issn = { 0975-8887 },

pages = { 14-20 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume151/number9/26260-2016911857/ },

doi = { 10.5120/ijca2016911857 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T23:56:38.354287+05:30

%A Tilak Patidar

%A Aditya Ambasth

%T Improvised Architecture for Distributed Web Crawling

%J International Journal of Computer Applications

%@ 0975-8887

%V 151

%N 9

%P 14-20

%D 2016

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Web crawlers are program, designed to fetch web pages for information retrieval system. Crawlers facilitate this process by following hyperlinks in web pages to automatically download new or update existing web pages in the repository. A web crawler interacts with millions of hosts, fetches millions of page per second and updates these pages into a database, creating a need for maintaining I/O performance, network resources within OS limit, which are essential in order to achieve high performance at a reasonable cost. This paper aims to showcase efficient techniques to develop a scalable web crawling system, addressing challenges which deals with issues related to the structure of the web, distributed computing, job scheduling, spider traps, canonicalizing URLs and inconsistent data formats on the web. A brief discussion on new web crawler architecture is done in this paper.

References

Shkapenyuk, V. and Suel, T. (2002). Design and implementation of a high performance distributed web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 357-368, San Jose, California. IEEE CS Press.
J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In 7th Int.World Wide Web Conference, May 1998.
M. Najork and J. Wiener. Breadth-first search crawling yields high-quality pages. In 10th Int. World Wide Web Conference, 2001
Web Crawling, By Christopher Olston and Marc Najork Foundations and Trends R in Information Retrieval Vol. 4, No. 3 (2010) 175–246 c 2010 C. Olston and M. Najork DOI: 10.1561/1500000017.
Common Crawl, “Common Crawl’s Move to Nutch,” http://commoncrawl.org/2014/02/common-crawl-move-to-nutch/
Burton H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors.
J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 117–128, May 2000.
.J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proc. of 26th Int. Conf. on Very Large Data Bases, pages 117–128, September 2000
George Adam, Christos Bouras, Professor Vassilis Poulopoulos, Utilizing RSS feeds for crawling the Web Conference: Fourth International Conference on Internet and Web Applications and Services, ICIW 2009, 24-28 May 2009, Venice/Mestre, Italy.
Chakrabarti, Soumen, Martin Van den Berg, and Byron Dom. "Focused crawling: a new approach to topic-specific Web resource discovery."Computer Networks 31.11 (1999): 1623-1640.
Broder, A. and Mitzenmacher, M., 2004. Network applications of bloom filters: A survey. Internet mathematics, 1(4), pp.485-509.
High Scalability, “10 Things You Should Know About Running MongoDB At Scale” http://highscalability.com/blog/2014/3/5/10-things-you-should-know-about-running-mongodb-at-scale.html
MongoDB,“GridFS - MongoDB Manual 3.2” https://docs.mongodb.com/manual/core/gridfs/
Compose, “Better Bulking for MongoDB 2.6 & Beyond –Compose an IBM company”. https://www.compose.com/articles/better-bulking-for-mongodb-2-6-and-beyond/
Castillo, Carlos, and Ricardo Baeza-Yates. Practical Issues of Crawling Large Web Collections. Technical report, 2005.

Index Terms

Computer Science

Information Sciences

Keywords

Web Crawler Distributed Computing Bloom Filter Batch Crawling Selection Policy Politeness Policy.