CFP last date
20 May 2024
Reseach Article

Unvisited URL Relevancy Calculation in Focused Crawling Based on NaÔve Bayesian Classification

by Lizashree Mishra, Amritesh Kumar, Debashis Hati
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 3 - Number 9
Year of Publication: 2010
Authors: Lizashree Mishra, Amritesh Kumar, Debashis Hati
10.5120/767-1074

Lizashree Mishra, Amritesh Kumar, Debashis Hati . Unvisited URL Relevancy Calculation in Focused Crawling Based on NaÔve Bayesian Classification. International Journal of Computer Applications. 3, 9 ( July 2010), 23-30. DOI=10.5120/767-1074

@article{ 10.5120/767-1074,
author = { Lizashree Mishra, Amritesh Kumar, Debashis Hati },
title = { Unvisited URL Relevancy Calculation in Focused Crawling Based on NaÔve Bayesian Classification },
journal = { International Journal of Computer Applications },
issue_date = { July 2010 },
volume = { 3 },
number = { 9 },
month = { July },
year = { 2010 },
issn = { 0975-8887 },
pages = { 23-30 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume3/number9/767-1074/ },
doi = { 10.5120/767-1074 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T19:51:29.189427+05:30
%A Lizashree Mishra
%A Amritesh Kumar
%A Debashis Hati
%T Unvisited URL Relevancy Calculation in Focused Crawling Based on NaÔve Bayesian Classification
%J International Journal of Computer Applications
%@ 0975-8887
%V 3
%N 9
%P 23-30
%D 2010
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Vertical search engines use focused crawler as their key component and develop some specific algorithms to select web pages relevant to some pre-defined set of topics. Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size of the web. Focused crawler aims to search only the subset of the web related to a specific topic, and offer a potential solution to the problem. A focused crawler is an agent that targets a particular topic and visits and gathers only a relevant, narrow web segment while trying not to waste resources on irrelevant material. As the crawler is only a computer program, it cannot determine how relevant a web page is. The major problem is how to retrieve the maximal set of relevant and quality page. In our proposed approach, we classify the unvisited URL based on visited URLs attribute score, i.e., unvisited URLs are relevant to topics or not, and then decide based on seed page attribute score. Based on score, we put “Yes” or “No” values in the table. URLs attributes are: it’s Anchor text relevancy, its description in Google search engine and calculates the similarity score of description with topic keywords, cohesive text similarity with topic keywords and Relevancy score of its parent pages. Relevancy score is calculated based on vector space model. Classification is done by Naïve Bayesian classification methods.

References
Index Terms

Computer Science
Information Sciences

Keywords

Crawler Focused crawler Vector space model Naïve Bayesian classification methods