Call for Paper - November 2023 Edition
IJCA solicits original research papers for the November 2023 Edition. Last date of manuscript submission is October 20, 2023. Read More

Realization of Framework for Web Content Extraction and Classification

International Journal of Computer Applications
© 2011 by IJCA Journal
Number 1 - Article 1
Year of Publication: 2011
Ganesh D. Puri
Prof. Y.C. Kulkarni

Ganesh D Puri and Prof. Y C Kulkarni. Article:Realization of Framework for Web Content Extraction and Classification. International Journal of Computer Applications 32(6):22-26, October 2011. Full text available. BibTeX

	author = {Ganesh D. Puri and Prof. Y.C. Kulkarni},
	title = {Article:Realization of Framework for Web Content Extraction and Classification},
	journal = {International Journal of Computer Applications},
	year = {2011},
	volume = {32},
	number = {6},
	pages = {22-26},
	month = {October},
	note = {Full text available}


Web content extraction and classification can be viewed as combination of different methods. Nowadays web page contains lot of information including main contents. Contents extraction which are of user’s interest is main task. Text mining is the technique that helps users to find useful information from a large amount of digital text documents on the Web or databases. It is therefore crucial that a good text mining model should retrieve the information that meets user’s needs within a relatively efficient time frame. A first step toward any Web-based text mining effort would be to collect a significant number of Web mentions of a subject. Thus, the challenge becomes not only to find all the subject occurrences, but also to filter out just those that have the desired meaning. The system described in this paper is capable of extracting main content and classify it. Vector space model method is used for classification.


  • Bing Liu ‘Web data mining’ Exploring hyperlinks contents and usage data.Springer Heidelberg, New York.
  • Weiguo Fan1, Linda Wallace, Stephanie Rich, Zhongju Zhang “Tapping into the Power of Text Mining”.
  • Suhit Gupta "context Based content Extraction of HTML Documents" M.S. Thesis Proposal, Dept of comp. sci.,Columbia University,New York,2004.
  • Shiqun Yin Gang Wang Yuhui Qiu Weiqun Zhang. ” Research and Implement of Classification Algorithm on Web Text Mining”. IEEE.(2007)446-449
  • Thomas Gottron. "Evaluatig content extraction on HTML documents" In ITA '07:Proceeding of 2nd International Conference on Internet Technologies and Applications, pages 123-132,September 2007.
  • Neha Gupta, Dr.saba Hilal "A Heuristic Approach for Web content extraction"International Journal of Computer Applications(0975-8887) volume 15-No.5 Feb 2011
  • Yin Yuhui Qiu Jike Ge, Xiaohong Lan.”Research and Realization of Extraction Algorithm on Web Text Mining”. (2007)278-281. Workshop on Intelligent Information Tech nology Application
  • Shiquin Yin Yuhui Qiu ,Chengwen Zhong Jifu Zhou. “Study of Web Information extraction and Classification Method”.IEEE Transaction(2007)5548-5552.
  • Yves Weissig, Thomas Gottron. “Combinations of Content Extraction Algorithms”.