CFP last date
20 March 2024
Call for Paper
April Edition
IJCA solicits high quality original research papers for the upcoming April edition of the journal. The last date of research paper submission is 20 March 2024

Submit your paper
Know more
Reseach Article

An Efficient Web Content Extraction from Large Collection of Web Documents using Mining Methods

by S.mahesha, M. Giri, M.s Shashidhara
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 69 - Number 7
Year of Publication: 2013
Authors: S.mahesha, M. Giri, M.s Shashidhara

S.mahesha, M. Giri, M.s Shashidhara . An Efficient Web Content Extraction from Large Collection of Web Documents using Mining Methods. International Journal of Computer Applications. 69, 7 ( May 2013), 8-13. DOI=10.5120/11852-7614

@article{ 10.5120/11852-7614,
author = { S.mahesha, M. Giri, M.s Shashidhara },
title = { An Efficient Web Content Extraction from Large Collection of Web Documents using Mining Methods },
journal = { International Journal of Computer Applications },
issue_date = { May 2013 },
volume = { 69 },
number = { 7 },
month = { May },
year = { 2013 },
issn = { 0975-8887 },
pages = { 8-13 },
numpages = {9},
url = { },
doi = { 10.5120/11852-7614 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
%0 Journal Article
%1 2024-02-06T21:29:33.924132+05:30
%A S.mahesha
%A M. Giri
%A M.s Shashidhara
%T An Efficient Web Content Extraction from Large Collection of Web Documents using Mining Methods
%J International Journal of Computer Applications
%@ 0975-8887
%V 69
%N 7
%P 8-13
%D 2013
%I Foundation of Computer Science (FCS), NY, USA

Web mining is a one class of data mining. Web Mining is a variation of data mining that distills untapped source of abundantly available free textual information. The need and importance of web mining is growing along with the massive volumes of data generated in web day-to-day life. Web data Clustering is the organization of a collection of web documents into clusters based on similarity. A good clustering algorithm should have high intra-cluster similarity and low inter-cluster similarity. The process of grouping similar documents for versatile applications has put the eye of researchers in this area. In general, web data always arrives in a continuous, multiple, rapid and time varying flow. The Researchers in web mining proposed many methods to extract web contents, but they are fail to handle dynamic data. Web content extraction algorithms are important to extract useful contents from web sources. We propose a new method for web content extraction. It consist of four phases: Web document selection phase, web cube creation phase, web document preprocessing phase, and presentation phase. In the first phase list of web documents are selected for mining, second phase documents are used to create web cube, third phase documents are preprocessed, in the final phase results are presented to users. The experimental results of proposed system are compared with existing methods, Performance of proposed system is better than previous methods.

  1. Magdalini Eirinaki and Michalis Vazirgiannis. Web mining for web personalization. ACM Transactions on Internet Technology, 3(1):1-27, 2003.
  2. Dimitrios Pierrakos, Georgios Paliouras, Christos Papatheodorou, and Constantine D. Spyropoulos. Web usage mining as a tool for personalization: A survey. User Modeling and User-Adapted Interaction, 6(2):311-372, 2003.
  3. Zhen Zhang. Large-scale deep web integration: Exploring and querying structured data on the deep web, 2006.
  4. Ryan Levering and Michal Cutler. The portrait of a common HTML web page. In 2006 ACM symposium on Document engineering, pages 198-204, 2006.
  5. Michael K. Bergman. The deep web: Surfacing hidden value. The Journal of Electronic Publishing, 7(1), 2001.
  6. King-Ip Lin and Hui Chen. Automatic information discovery from the 'invisible web'. In 2002 International Conference on Information Technology: Coding and Computing, page 332, 2002.
  7. Dirk Lewandowski and Philipp Mayr. Exploring the academic invisible web. Library Hi Tech, 24(4):529-539, 2006.
  8. Bin He and Kevin Chen-Chuan Chang. Statistical schema matching across web query interfaces. In 2003 ACM SIGMOD International Conference on Management of Data, 2003.
  9. Witold Abramowicz, Dominik Flejter, Tomasz Kaczmarek, Monika Starzecka, and Adam Walczak. Semantically enhanced deep web. In 3rd International AST Workshop, pages 675-680, 2008.
  10. Dominik Flejter and Tomasz Kaczmarek. Wybrane aspekty integracji informacji z g l,ebokiego Internetu, pages 97-110. Wydaw. AE, 2007.
  11. Yan LI, Boqin FENG and Qinjiao MAO, "Research on Path Completion Technique In Web Usage Mining", IEEE International Symposium On Computer Science and Computational Technology, pp. 554-559, 2008.
  12. JING Chang-bin and Chen Li, " Web Log Data Preprocessing Based On Collaborative Filtering ", IEEE 2nd International Workshop On Education Technology and Computer Science, pp. 118-121, 2010.
  13. Huiping Peng, "Discovery of Interesting Association Rules Based On Web Usage Mining", IEEE Conference, pp. 272-275, 2010.
  14. Tasawar Hussain, Dr. Sohail Asghar and Nayyer Masood, "Hierarchical Sessionization at Preprocessing Level of WUM Based on Swarm Intelligence ", 6th International Conference on Emerging Technologies (ICET) IEEE, pp. 21-26, 2010.
  15. Doru Tanasa and Brigitte Trousse,"Advanced Data Preprocessing for Inter-sites Web Usage Mining ", Published by the IEEE Computer Society, pp. 59-65, March/April 2004.
  16. Ling Zheng, Hui Gui and Feng Li, "Optimized Data Preprocessing Technology For Web Log Mining", IEEE International Conference On Computer Design and Applications( ICCDA ), pp. VI-19-VI-21,2010.
  17. Nahm, U. Y. , Bilenko, M. and Mooney R. J. "Two Approaches to Handling Noisy Variation in Text Mining". ICML-2002 Workshop on Text Learning, 2002
  18. Shian-Hua Lin and Jan-Ming Ho. Discovering Informative Content Blocks from Web Documents, KDD-02, 2002.
Index Terms

Computer Science
Information Sciences


Web Cube creation Maintenance Web document Cleaning Web Mining