CFP last date
20 May 2024
Reseach Article

Web Content Extraction by Integrating Textual and Visual Importance of Web Pages

by K. Nethra, J. Anitha
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 91 - Number 3
Year of Publication: 2014
Authors: K. Nethra, J. Anitha
10.5120/15861-4785

K. Nethra, J. Anitha . Web Content Extraction by Integrating Textual and Visual Importance of Web Pages. International Journal of Computer Applications. 91, 3 ( April 2014), 20-24. DOI=10.5120/15861-4785

@article{ 10.5120/15861-4785,
author = { K. Nethra, J. Anitha },
title = { Web Content Extraction by Integrating Textual and Visual Importance of Web Pages },
journal = { International Journal of Computer Applications },
issue_date = { April 2014 },
volume = { 91 },
number = { 3 },
month = { April },
year = { 2014 },
issn = { 0975-8887 },
pages = { 20-24 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume91/number3/15861-4785/ },
doi = { 10.5120/15861-4785 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:11:48.633280+05:30
%A K. Nethra
%A J. Anitha
%T Web Content Extraction by Integrating Textual and Visual Importance of Web Pages
%J International Journal of Computer Applications
%@ 0975-8887
%V 91
%N 3
%P 20-24
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

A Web page has huge information and the information in the Web pages is useful in real world applications. The additional contents in the Web page like links, footers, headers and advertisements may cause the content extraction to be complicated. Irrelevant content in the Web page is treated as noisy content. A method is necessary to extract the informative content and discard the noisy content from Web pages. An integration of textual and visual importance is used to extract the informative content from Web pages. Initially a Web page is converted in to DOM (Document Object Model) tree. For each node in the DOM tree, textual and visual importance is calculated. Textual importance and visual importance is combined to form hybrid density. Density sum is calculated and used in content extraction algorithm to extract the informative content from Web pages. Performance of Web content extraction is obtained by calculating precision, recall, f-measure and accuracy.

References
  1. Baluja, S. (2006). Browsing on smalls screens: Recasting web-page segmentation in to an efficient machine learning framework. In WWW '06: proceedings of the 15th international conference on World Wide Web. NewYork: NY,USA, ACM. pp. 33–42
  2. Baroni,M . , Chantree,F. ,Kilgarri,A. , Sharo, (2008). Cleaneval : A competition for cleaning web pages. In Proceedings of the sixth international,language resources and evaluation (LREC'08).
  3. Chen,Y. , Ma,W. -Y. ,& Zhang,H. -J. (2003). Detecting web page structure for adaptive viewing on small form factor devices. In Proceedings of the12th international conference on World Wide Web (WWW'03). NewYork, NY,USA:ACM. pp. 225–233
  4. Dandan Song, Fei Sun, Lejian Liao. " A hybrid approach for content extraction with text density and visual importance of DOM nodes". In the proceedings of Springer Knowl Inf Syst, DOI 10. 1007/s10115-013-0687-x, Verlag London 2013.
  5. Debnath, S. ,Mitra,P. ,Pal,N. ,&Giles,C. L. (2005). Automatici dentification of informative sections of web pages. IEEE Transaction on Knowledge and Data Engineering, 17(9), 1233–1246.
  6. Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma. " VIPS: a Vision-based Page Segmentation Algorithm". Technical Report MSR-TR-2003-79, Microsoft Research, 2003.
  7. Finn A, Kushmerick N, Smyth B (2001) Fact or fiction: content classification for digital libraries. In: Joint DELOS-NSF workshop: personalization and recommender systems in digital libraries
  8. Gibson, J. ,Wellner,B. ,&Lubar,S. (2007). Adaptive web-page content identification. In WIDM '07:Proceedings of the 9th annual ACM international workshop on Web information and data management, New York, NY,USA,ACM. pp. 105–112
  9. Gottron T (2008) Content code blurring: a new approach to content extraction. In: Proceedings of DEXA '08, pp 29–33
  10. Kohlschutter, C(2009). A densitometric analysis of web template content. In WWW 09: Proceedings of the 18th international conference on World Wide Web. New York,NY,USA:ACM.
  11. Kohlschutter,C. ,Fankhauser,P,&Nejdl,W (2010). Boiler plate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and datamining (WSDM'10). NewYork ,NY,USA:ACM. pp. 441–450
  12. Kovacevic, M. ,Diligenti, M. , Gori,M. , & Milutinovic,V. (2002). Recognition of common areas in a web page using visual information:A possible application in a page classification. In the proceedings of 2002 IEEE international conference on data mining(ICDM'02),MaebashiCity,Japan,December.
  13. Lan Yi ,Bing Liu,Xiaoli Li. "Eliminating Noisy Information in web pages for Data Mining" . In the Proceedings of ACM 1-58113-737-0/03/0008,SIGKDD . 03, August 24-27, 2003, Washington, DC, USA
  14. Liang Chen, Shaozhi Ye, Xing Li. " Template Detection for Large Scale Search Engines". In the proceedings of ACM 1-59593-108-2/06/0004SAC'06 April 23-27, 2006, Dijon, France.
  15. Mantratzis C, Orgun M, Cassidy S (2005) Separating XHTML content from navigation clutter using DOM-structure block analysis. In: Proceedings of HYPERTEXT '05, pp 145–147
  16. Pinto D, Branstein M, Coleman R, CroftWB, King M, LiW,Wei X(2002) QuASM: a system for question answering using semi-structured data. In: Proceedings of JCDL '02, pp 46–55
  17. Uzun Erdinc,Hayri Volkan Agun ,Tarik Yerlikaya. (2013). A hybrid approach for extracting informative content from web pages. In the Proceeding of Elsevier journal.
  18. Uzun E. ,Yerlikaya,T. , & Kurt, M. (2011b). A light weight parser for extracting useful contents from web pages. In 2nd International symposium on computing in science&engineering–ISCSE2011,Kusadasi, Aydin,Turkey,pp. 66–72.
  19. Weninger T, Hsu WH, Han J (2010) CETR—content extraction via tag ratios. In: Proceedings of WWW'10. NY, USA, New York, pp 971–980.
  20. Yves Weissig, Thomas Gottron. "Combinations of Content Extraction Algorithms". In: Proceedings of iiWAS'08, pp 591–595
Index Terms

Computer Science
Information Sciences

Keywords

Web Content Extraction Web content Mining DOM tree Vision based Page Segmentation.