A Heuristic Approach for Web Content Extraction

Neha Gupta; Dr. Saba Hilal

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

A Heuristic Approach for Web Content Extraction

by Neha Gupta, Dr. Saba Hilal

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 15 - Number 5

Year of Publication: 2011

Authors: Neha Gupta, Dr. Saba Hilal

10.5120/1945-2601

Neha Gupta, Dr. Saba Hilal . A Heuristic Approach for Web Content Extraction. International Journal of Computer Applications. 15, 5 ( February 2011), 20-24. DOI=10.5120/1945-2601

@article{ 10.5120/1945-2601,

author = { Neha Gupta, Dr. Saba Hilal },

title = { A Heuristic Approach for Web Content Extraction },

journal = { International Journal of Computer Applications },

issue_date = { February 2011 },

volume = { 15 },

number = { 5 },

month = { February },

year = { 2011 },

issn = { 0975-8887 },

pages = { 20-24 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume15/number5/1945-2601/ },

doi = { 10.5120/1945-2601 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:03:21.459744+05:30

%A Neha Gupta

%A Dr. Saba Hilal

%T A Heuristic Approach for Web Content Extraction

%J International Journal of Computer Applications

%@ 0975-8887

%V 15

%N 5

%P 20-24

%D 2011

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Today internet has made the life of human dependent on it. Almost everything and anything can be searched on net. Web pages usually contain huge amount of information that may not interest the user, as it may not be the part of the main content of the web page. To extract the main content of the web page, data mining techniques need to be implemented. A lot of research has already been done in this field. Current automatic techniques are unsatisfactory as their outputs are not appropriate for the query of the user. In this paper, we are presenting an automatic approach to extract the main content of the web page using tag tree & heuristics to filter the clutter and display the main content. Experimental results have shown that the technique presented in this paper is able to outperform existing techniques dramatically.

References

P. Atzeni , G. Mecca, “ Cut & Paste” , Proceedings of 16th ACM SIGMOD Symposium on Principles of database systems, 1997
C. Chang and S. Lui, “ IEPAD: Information extraction based on pattern discovery” , In Proc. of 2001 Intl. World Wide Web Conf., pages 681–688, 2001
J. Hammer, H. Garcia-Molina et al , “ Extracting semi-structured data from the web” , Proceedings of workshop on management of Semi=-Structured Data, Pages 18-25, 1997
H. Garcia-Molina, Y. Papakonstantinou, D. Quass, “A. Rajaraman, Y. Sagiv, J. D. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogenous information sources”, Journal of Intelligent Information Systems, 8(2):117–132, 1997.
M. Garofalokis, A. Gionis, R. Rastogi, S. Seshadr, and K. Shim, “ XTRACT: A system for extracting document type descriptors from XML documents”, In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 165–176, 2000.
E. M. Gold, “Language identification in the limit. Information and Control”, 10(5):447–474, 1967.
S. Grumbach and G. Mecca, “ In search of the lost schema” , In Proc. of 1999 Intl. Conf. of Database Theory, pages 314–331, 1999.
J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo,R. Aranha, “Extracting semi structure information from the
Web”, In Proceedings of the Workshop on Management of Semi structured Data, 1997.
Chang, C-H., Lui, S-L, “IEPAD: Information Extraction based on pattern discovery”, ACM Digital Library WWW-01, pp 681-688, 2001
A. Laender, B. Ribeiro-Neto et.al, “ A brief survey of Web Data Extraction tools” , Sigmod Record, 31(2),2002
D.W. Embley, Y. Jiang, “ Record Boundary Discovery in Web Documents”, In Proceeding of the 1999 ACM SIGMOD, Philadelphia, USA, June 1999.
Shian-Hua Lin, Jan-Ming Ho, “Discovering informative content blocks from Web documents”, SIGKDD-2002, 2002.
Ziv Bar-Yossef, Sridhar Rajagopalan, “Template detection via data mining and its applications” , WWW-2002, 2002
Mong Li Lee, Tok Wang Ling, Wai Lup Low, “Intelliclean: A knowledge-based intelligent data cleaner”, SIGKDD-2000, 2000.
Hung-Yu Kao, Ming-Syan Chen Shian-Hua Lin, and Jan-Ming Ho, “Entropy-Based Link Analysis for Mining Web Informative Structures”, CIKM-2002, 2002.
Jon M, Kleinberg, “Authoritative sources in a hyperlinked environment.” In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 1998
D. Buttler, L. Liu, C.Pu, “Omini: An object mining and extraction system for the web”, Technical Report, Sept 2000, Georgia Tech, College of Computing.
Liu,B, Zhai, Y., “NET- “A System for extracting Web Data From Flat and Nested Data Records”, WISE-05 (Proceeding of 6th International Conference on Web Information System Engineering), 2005
Buttler, D. Ling Liu Pu, C., “A fully automated object extraction system for the World Wide Web”, IEEE explore ICDCS 01, pp 361-370, 2001

Index Terms

Computer Science

Information Sciences

Keywords

HTML Parser Tag Tree Web Content Extraction Heuristics