Web Information Extraction: Tag Density and Keyword Approach

Shikha Shukla; Nitin; Sitendra Tamrakar

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework

Ziqiao Ao Juhi Singh Sebastian Antinome

Random Articles

Reseach Article

Web Information Extraction: Tag Density and Keyword Approach

by Shikha Shukla, Nitin, Sitendra Tamrakar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 61 - Number 12

Year of Publication: 2013

Authors: Shikha Shukla, Nitin, Sitendra Tamrakar

10.5120/9981-4811

Shikha Shukla, Nitin, Sitendra Tamrakar . Web Information Extraction: Tag Density and Keyword Approach. International Journal of Computer Applications. 61, 12 ( January 2013), 28-30. DOI=10.5120/9981-4811

@article{ 10.5120/9981-4811,

author = { Shikha Shukla, Nitin, Sitendra Tamrakar },

title = { Web Information Extraction: Tag Density and Keyword Approach },

journal = { International Journal of Computer Applications },

issue_date = { January 2013 },

volume = { 61 },

number = { 12 },

month = { January },

year = { 2013 },

issn = { 0975-8887 },

pages = { 28-30 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume61/number12/9981-4811/ },

doi = { 10.5120/9981-4811 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T21:08:55.919474+05:30

%A Shikha Shukla

%A Nitin

%A Sitendra Tamrakar

%T Web Information Extraction: Tag Density and Keyword Approach

%J International Journal of Computer Applications

%@ 0975-8887

%V 61

%N 12

%P 28-30

%D 2013

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Web page consists of lots of noise in the form of advertisements, irrelevant information, copyrights information and menus. To extract the information from web we use the two concepts, text density and title of the page. Generally the main content of the page is denser than the other and noises has lesser text information. The title is the most important information on the page that tells us about what is this page for. So we simply extract all the information that is denser than particular threshold or at least contain one of the keywords that is made from the title of the page. By using this approach the more false negatives can be avoided. This approach gives very satisfactory results.

References

Shin, Kwangcheol, and Geun Sik Jo. "Catch Crawler: Automatic Web Information Extractor Using Style Sheet. " Semantic Computing and Applications, 2008. IWSCA'08. IEEE International Workshop on. IEEE, 2008.
Sun, Fei, Dandan Song, and Lejian Liao. "Dom based content extraction via text density. " SIGIR. Vol. 11. 2011.
Asfia, Mohsen, Mir Mohsen Pedram, and Amir Masoud Rahmani. "Main Content Extraction from Detailed Web Pages. " International Journal of Computer Applications IJCA 4. 11 (2010): 18-21.
Downey, Doug, et al. "Learning text patterns for web information extraction and assessment. " AAAI-04 workshop on adaptive text extraction and mining. 2004.
Yi, Lan, and Bing Liu. "Web page cleaning for web mining through feature weighting. " International joint conference on artificial intelligence. Vol. 18. LAWRENCE ERLBAUM ASSOCIATES LTD, 2003.

Index Terms

Computer Science

Information Sciences

Keywords

Crawler Web mining information extraction