CFP last date
22 April 2024
Reseach Article

A Heuristic Approach for Web Content Extraction

by Neha Gupta, Dr. Saba Hilal
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 15 - Number 5
Year of Publication: 2011
Authors: Neha Gupta, Dr. Saba Hilal
10.5120/1945-2601

Neha Gupta, Dr. Saba Hilal . A Heuristic Approach for Web Content Extraction. International Journal of Computer Applications. 15, 5 ( February 2011), 20-24. DOI=10.5120/1945-2601

@article{ 10.5120/1945-2601,
author = { Neha Gupta, Dr. Saba Hilal },
title = { A Heuristic Approach for Web Content Extraction },
journal = { International Journal of Computer Applications },
issue_date = { February 2011 },
volume = { 15 },
number = { 5 },
month = { February },
year = { 2011 },
issn = { 0975-8887 },
pages = { 20-24 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume15/number5/1945-2601/ },
doi = { 10.5120/1945-2601 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:03:21.459744+05:30
%A Neha Gupta
%A Dr. Saba Hilal
%T A Heuristic Approach for Web Content Extraction
%J International Journal of Computer Applications
%@ 0975-8887
%V 15
%N 5
%P 20-24
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Today internet has made the life of human dependent on it. Almost everything and anything can be searched on net. Web pages usually contain huge amount of information that may not interest the user, as it may not be the part of the main content of the web page. To extract the main content of the web page, data mining techniques need to be implemented. A lot of research has already been done in this field. Current automatic techniques are unsatisfactory as their outputs are not appropriate for the query of the user. In this paper, we are presenting an automatic approach to extract the main content of the web page using tag tree & heuristics to filter the clutter and display the main content. Experimental results have shown that the technique presented in this paper is able to outperform existing techniques dramatically.

References
  1. P. Atzeni , G. Mecca, “ Cut & Paste” , Proceedings of 16th ACM SIGMOD Symposium on Principles of database systems, 1997
  2. C. Chang and S. Lui, “ IEPAD: Information extraction based on pattern discovery” , In Proc. of 2001 Intl. World Wide Web Conf., pages 681–688, 2001
  3. J. Hammer, H. Garcia-Molina et al , “ Extracting semi-structured data from the web” , Proceedings of workshop on management of Semi=-Structured Data, Pages 18-25, 1997
  4. H. Garcia-Molina, Y. Papakonstantinou, D. Quass, “A. Rajaraman, Y. Sagiv, J. D. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogenous information sources”, Journal of Intelligent Information Systems, 8(2):117–132, 1997.
  5. M. Garofalokis, A. Gionis, R. Rastogi, S. Seshadr, and K. Shim, “ XTRACT: A system for extracting document type descriptors from XML documents”, In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 165–176, 2000.
  6. E. M. Gold, “Language identification in the limit. Information and Control”, 10(5):447–474, 1967.
  7. S. Grumbach and G. Mecca, “ In search of the lost schema” , In Proc. of 1999 Intl. Conf. of Database Theory, pages 314–331, 1999.
  8. J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo,R. Aranha, “Extracting semi structure information from the
  9. Web”, In Proceedings of the Workshop on Management of Semi structured Data, 1997.
  10. Chang, C-H., Lui, S-L, “IEPAD: Information Extraction based on pattern discovery”, ACM Digital Library WWW-01, pp 681-688, 2001
  11. A. Laender, B. Ribeiro-Neto et.al, “ A brief survey of Web Data Extraction tools” , Sigmod Record, 31(2),2002
  12. D.W. Embley, Y. Jiang, “ Record Boundary Discovery in Web Documents”, In Proceeding of the 1999 ACM SIGMOD, Philadelphia, USA, June 1999.
  13. Shian-Hua Lin, Jan-Ming Ho, “Discovering informative content blocks from Web documents”, SIGKDD-2002, 2002.
  14. Ziv Bar-Yossef, Sridhar Rajagopalan, “Template detection via data mining and its applications” , WWW-2002, 2002
  15. Mong Li Lee, Tok Wang Ling, Wai Lup Low, “Intelliclean: A knowledge-based intelligent data cleaner”, SIGKDD-2000, 2000.
  16. Hung-Yu Kao, Ming-Syan Chen Shian-Hua Lin, and Jan-Ming Ho, “Entropy-Based Link Analysis for Mining Web Informative Structures”, CIKM-2002, 2002.
  17. Jon M, Kleinberg, “Authoritative sources in a hyperlinked environment.” In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 1998
  18. D. Buttler, L. Liu, C.Pu, “Omini: An object mining and extraction system for the web”, Technical Report, Sept 2000, Georgia Tech, College of Computing.
  19. Liu,B, Zhai, Y., “NET- “A System for extracting Web Data From Flat and Nested Data Records”, WISE-05 (Proceeding of 6th International Conference on Web Information System Engineering), 2005
  20. Buttler, D. Ling Liu Pu, C., “A fully automated object extraction system for the World Wide Web”, IEEE explore ICDCS 01, pp 361-370, 2001
Index Terms

Computer Science
Information Sciences

Keywords

HTML Parser Tag Tree Web Content Extraction Heuristics