Web Content Extraction by Integrating Textual and Visual Importance of Web Pages

K. Nethra; J. Anitha

Call for Paper

September Edition

IJCA solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 20 August 2026

Submit your paper

Know more

The week's pick

AI-Assisted Observability in Distributed Microservice Architectures

Kyrylo Sotnykov

Random Articles

Analysis of Communities Detection Algorithms in Complex Networks

Sep

2017

Optimizing Data Storage for AI, Generative AI, and Machine Learning: Challenges, Architectures, and Future Direction

Mar

2025

An Insight in to Network Traffic Analysis using Packet Sniffer

May

2014

Image Segmentation by Clustering Methods: Performance Analysis

September

2011

Reseach Article

Web Content Extraction by Integrating Textual and Visual Importance of Web Pages

by K. Nethra, J. Anitha

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 91 - Number 3

Year of Publication: 2014

Authors: K. Nethra, J. Anitha

10.5120/15861-4785

K. Nethra, J. Anitha . Web Content Extraction by Integrating Textual and Visual Importance of Web Pages. International Journal of Computer Applications. 91, 3 ( April 2014), 20-24. DOI=10.5120/15861-4785

@article{ 10.5120/15861-4785,

author = { K. Nethra, J. Anitha },

title = { Web Content Extraction by Integrating Textual and Visual Importance of Web Pages },

journal = { International Journal of Computer Applications },

issue_date = { April 2014 },

volume = { 91 },

number = { 3 },

month = { April },

year = { 2014 },

issn = { 0975-8887 },

pages = { 20-24 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume91/number3/15861-4785/ },

doi = { 10.5120/15861-4785 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T22:11:48.633280+05:30

%A K. Nethra

%A J. Anitha

%T Web Content Extraction by Integrating Textual and Visual Importance of Web Pages

%J International Journal of Computer Applications

%@ 0975-8887

%V 91

%N 3

%P 20-24

%D 2014

%I Foundation of Computer Science (FCS), NY, USA

Abstract

A Web page has huge information and the information in the Web pages is useful in real world applications. The additional contents in the Web page like links, footers, headers and advertisements may cause the content extraction to be complicated. Irrelevant content in the Web page is treated as noisy content. A method is necessary to extract the informative content and discard the noisy content from Web pages. An integration of textual and visual importance is used to extract the informative content from Web pages. Initially a Web page is converted in to DOM (Document Object Model) tree. For each node in the DOM tree, textual and visual importance is calculated. Textual importance and visual importance is combined to form hybrid density. Density sum is calculated and used in content extraction algorithm to extract the informative content from Web pages. Performance of Web content extraction is obtained by calculating precision, recall, f-measure and accuracy.

References

Baluja, S. (2006). Browsing on smalls screens: Recasting web-page segmentation in to an efficient machine learning framework. In WWW '06: proceedings of the 15th international conference on World Wide Web. NewYork: NY,USA, ACM. pp. 33–42
Baroni,M . , Chantree,F. ,Kilgarri,A. , Sharo, (2008). Cleaneval : A competition for cleaning web pages. In Proceedings of the sixth international,language resources and evaluation (LREC'08).
Chen,Y. , Ma,W. -Y. ,& Zhang,H. -J. (2003). Detecting web page structure for adaptive viewing on small form factor devices. In Proceedings of the12th international conference on World Wide Web (WWW'03). NewYork, NY,USA:ACM. pp. 225–233
Dandan Song, Fei Sun, Lejian Liao. " A hybrid approach for content extraction with text density and visual importance of DOM nodes". In the proceedings of Springer Knowl Inf Syst, DOI 10. 1007/s10115-013-0687-x, Verlag London 2013.
Debnath, S. ,Mitra,P. ,Pal,N. ,&Giles,C. L. (2005). Automatici dentification of informative sections of web pages. IEEE Transaction on Knowledge and Data Engineering, 17(9), 1233–1246.
Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma. " VIPS: a Vision-based Page Segmentation Algorithm". Technical Report MSR-TR-2003-79, Microsoft Research, 2003.
Finn A, Kushmerick N, Smyth B (2001) Fact or fiction: content classification for digital libraries. In: Joint DELOS-NSF workshop: personalization and recommender systems in digital libraries
Gibson, J. ,Wellner,B. ,&Lubar,S. (2007). Adaptive web-page content identification. In WIDM '07:Proceedings of the 9th annual ACM international workshop on Web information and data management, New York, NY,USA,ACM. pp. 105–112
Gottron T (2008) Content code blurring: a new approach to content extraction. In: Proceedings of DEXA '08, pp 29–33
Kohlschutter, C(2009). A densitometric analysis of web template content. In WWW 09: Proceedings of the 18th international conference on World Wide Web. New York,NY,USA:ACM.
Kohlschutter,C. ,Fankhauser,P,&Nejdl,W (2010). Boiler plate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and datamining (WSDM'10). NewYork ,NY,USA:ACM. pp. 441–450
Kovacevic, M. ,Diligenti, M. , Gori,M. , & Milutinovic,V. (2002). Recognition of common areas in a web page using visual information:A possible application in a page classification. In the proceedings of 2002 IEEE international conference on data mining(ICDM'02),MaebashiCity,Japan,December.
Lan Yi ,Bing Liu,Xiaoli Li. "Eliminating Noisy Information in web pages for Data Mining" . In the Proceedings of ACM 1-58113-737-0/03/0008,SIGKDD . 03, August 24-27, 2003, Washington, DC, USA
Liang Chen, Shaozhi Ye, Xing Li. " Template Detection for Large Scale Search Engines". In the proceedings of ACM 1-59593-108-2/06/0004SAC'06 April 23-27, 2006, Dijon, France.
Mantratzis C, Orgun M, Cassidy S (2005) Separating XHTML content from navigation clutter using DOM-structure block analysis. In: Proceedings of HYPERTEXT '05, pp 145–147
Pinto D, Branstein M, Coleman R, CroftWB, King M, LiW,Wei X(2002) QuASM: a system for question answering using semi-structured data. In: Proceedings of JCDL '02, pp 46–55
Uzun Erdinc,Hayri Volkan Agun ,Tarik Yerlikaya. (2013). A hybrid approach for extracting informative content from web pages. In the Proceeding of Elsevier journal.
Uzun E. ,Yerlikaya,T. , & Kurt, M. (2011b). A light weight parser for extracting useful contents from web pages. In 2nd International symposium on computing in science&engineering–ISCSE2011,Kusadasi, Aydin,Turkey,pp. 66–72.
Weninger T, Hsu WH, Han J (2010) CETR—content extraction via tag ratios. In: Proceedings of WWW'10. NY, USA, New York, pp 971–980.
Yves Weissig, Thomas Gottron. "Combinations of Content Extraction Algorithms". In: Proceedings of iiWAS'08, pp 591–595

Index Terms

Computer Science

Information Sciences

Keywords

Web Content Extraction Web content Mining DOM tree Vision based Page Segmentation.