CFP last date
20 May 2024
Reseach Article

Main Content Extraction from Detailed Web Pages

by Amir Masoud Rahmani, Mir Mohsen Pedram, Mohsen Asfia
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 4 - Number 11
Year of Publication: 2010
Authors: Amir Masoud Rahmani, Mir Mohsen Pedram, Mohsen Asfia
10.5120/869-1219

Amir Masoud Rahmani, Mir Mohsen Pedram, Mohsen Asfia . Main Content Extraction from Detailed Web Pages. International Journal of Computer Applications. 4, 11 ( August 2010), 18-21. DOI=10.5120/869-1219

@article{ 10.5120/869-1219,
author = { Amir Masoud Rahmani, Mir Mohsen Pedram, Mohsen Asfia },
title = { Main Content Extraction from Detailed Web Pages },
journal = { International Journal of Computer Applications },
issue_date = { August 2010 },
volume = { 4 },
number = { 11 },
month = { August },
year = { 2010 },
issn = { 0975-8887 },
pages = { 18-21 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume4/number11/869-1219/ },
doi = { 10.5120/869-1219 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T19:52:50.296889+05:30
%A Amir Masoud Rahmani
%A Mir Mohsen Pedram
%A Mohsen Asfia
%T Main Content Extraction from Detailed Web Pages
%J International Journal of Computer Applications
%@ 0975-8887
%V 4
%N 11
%P 18-21
%D 2010
%I Foundation of Computer Science (FCS), NY, USA
Abstract

As we know internet detailed web pages contains information which are not considered as primary content such as advertisements, headers, footers, navigation links and copyright information. Also information on web pages such as comments and reviews are not preferred by search engines to index as informative content, thereby having an algorithm to extracts only main content could help better quality on web page indexing. Almost all algorithms have been proposed are tag dependent means they could only look for primary content among specific tags such as < TABLE > or < DIV >. The algorithm in this paper simulates a web page user visit and how the user finds the main content block position in the page. The proposed method is tag independent and has two phases to accomplish the extraction job. First it transforms input DOM tree obtained from input HTML detailed web page into a block tree based on their visual representation and DOM structure in a way that on every node it will have specification vector, then it traverses the obtained small block tree to find main block having dominant computed value in comparison with other block nodes based on its specification vector values. The introduced method doesn’t have any learning phases and could find informative content on any random input detailed web page. This method has been tested in large variety of websites and as we will show, it gains better precision and recall based on other compared method K-FE.

References
  1. Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran Soares da Silva, Juliana S. Teixeira: A Brief Survey of Web Data Extraction Tools. SIGMOD Record 31(2): 84-93 (2002).
  2. Bing Liu: Web Data Mining. Springer (2007).
  3. C. J. Van Rijsbergen: Information Retrieval. Butterworth-Heinemann (1979).
  4. Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma: Block Based Web Search. In: Proc. 2004 Int. Conf. on Research and Development in Information Retrieval (SI-GIR’04), Sheffield, UK (July 2004).
  5. Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma: Extracting Content Structure for Web Pages based on Visual Representation. In: The Fifth Asia Pacific Web Conference (APWeb2003), Springer Lecture Notes in Computer Science (2003).
  6. Deng Cai, Xiaofei He, Ji-Rong Wen and Wei-Ying Ma: Block Level Link Analysis. In: Proc. 2004 Int. Conf. on Research and Development in Information Retrieval (SIGIR’04), Sheffield, UK (July 2004).
  7. Hwanjo Yu, AnHai Doan, and Jiawei Han: Mining for Information Discovery on the Web: Overview and Illustrative Research. In: Intelligent Technologies for Information Analysis, edited by Ning Zhong, Springer-Verlag, invited paper, pp. 135-168 (2004).
  8. Jeff Pasternack, Dan Roth: Extracting Article Text from the Web with Maximum Subsequence Segmentation. In: www '09: proceedings of the 18th international conference on World Wide Web, New York, ny, usa, acm, 971—980 (2009).
  9. Lakshmish Ramaswamy, Arun Iyengar, Ling Liu and Fred Douglis: Automatic Detection of Fragments in Dynamically Generated Web Pages. In: 13th International Conference on the World Wide Web (WWW-2004), pp. 443-454 (2004).
  10. Lan Yi, Bing Liu, and Xiao-Li Li: Eliminating Noisy Information in Web Pages for Data Mining. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining (KDD-2003), Washington, DC, USA, August 24 – 27 (2003).
  11. Sandip Debnath, Prasenjit Mitra, Nirmal Pal, C. Lee Giles: Automatic Identification of Informative Sections of Web Pages. In: IEEE Transactions on Knowledge and Data Engineering, 17(9): 1233-1246 (2005).
  12. Shian-Hua Lin and Jan-Ming Ho: Discovering informative content blocks from web documents. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 588–593 (2002).
  13. Suhit Gupta, Gail Kaiser, David Neistadt, Peter Grimm: DOM-based Content Extraction of HTML Documents. In: 12th International World Wide Web Conference, 12th International World Wide Web Conference (May 2003).
  14. Valter Crescenzi , Giansalvatore Mecca , Paolo Merialdo: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, p.109-118, September 11-14 (2001).
  15. World Wide Web Consortium. World Wide Web consortium hypertext markup language.
  16. Yanhong Zhai, and Bing Liu: Web Data Extraction Based on Partial Tree Alignment. In: Proc. The 14th international World Wide Web conference (WWW-2005), in Chiba, Japan10-14 (2005).
  17. Yves Weibig, Thomas Gottron: Combinations of Content Extraction Algorithms. In: Workshop Information Retrieval (2009).
Index Terms

Computer Science
Information Sciences

Keywords

Web mining Noise elimination Informative content Information retrieval Information extraction