CFP last date
22 April 2024
Reseach Article

Extraction of Template using Clustering from Heterogeneous Web Documents

by Rashmi D Thakare, Manisha R Patil
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 119 - Number 11
Year of Publication: 2015
Authors: Rashmi D Thakare, Manisha R Patil
10.5120/21112-3906

Rashmi D Thakare, Manisha R Patil . Extraction of Template using Clustering from Heterogeneous Web Documents. International Journal of Computer Applications. 119, 11 ( June 2015), 23-31. DOI=10.5120/21112-3906

@article{ 10.5120/21112-3906,
author = { Rashmi D Thakare, Manisha R Patil },
title = { Extraction of Template using Clustering from Heterogeneous Web Documents },
journal = { International Journal of Computer Applications },
issue_date = { June 2015 },
volume = { 119 },
number = { 11 },
month = { June },
year = { 2015 },
issn = { 0975-8887 },
pages = { 23-31 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume119/number11/21112-3906/ },
doi = { 10.5120/21112-3906 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:03:46.527635+05:30
%A Rashmi D Thakare
%A Manisha R Patil
%T Extraction of Template using Clustering from Heterogeneous Web Documents
%J International Journal of Computer Applications
%@ 0975-8887
%V 119
%N 11
%P 23-31
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In general, a common template or layout is used to generate set of pages in websites. For example, Google Book lays out the details like author name, book names, reviews or comments, etc. in the similar way in all of its book pages. The database provides different values to generate the pages. The problem during automatic database value extraction from different web pages is studied which is done without any human data input. A template is well defined which would propose the framework to be used to describe how the values are inserted into the pages. An extraction algorithm is at core to extract values from web pages. This algorithm is trained to generate the template referring defined set of words having common occurrence. As a result, extracted values are semantically similar in most of the cases. Ours focus on extracting templates from heterogeneous web pages. But due to large variety of web documents in websites, there is a need to manage unknown number of templates. This is achieved by clustering web documents. The various methods for clustering, which are compared i) TEXT Minimum Description Length (TEXTMDL), ii) MinHash using Jaccard Coefficient, iii) MinHash using Dice Coefficient methods are used for clustering web pages.

References
  1. Arasu and H. Garcia-Molina, Extracting Structured Data from Web Pages. Proc. ACM SIGMOD, 2003.
  2. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, Min-Wise Independent Permutations J. Computer and System Sciences, vol. 60, no. 3, pp. 630-659, 2000.
  3. D. Chakrabarti, R. Kumar, and K. Punera, Page-Level Template Detection via Isotonic Smoothing. Proc. 16th Int?l Conf. World Wide Web (WWW), 2007.
  4. Z. Chen, F. Korn, N. Koudas, and S. Muithukrishnan, Selectivity Estimation for Boolean Queries Proc. ACM SIGMOD-SIGACTSIGART Symp. Principles of Database Systems (PODS), 2000.
  5. V. Crescenzi, G. Mecca, and P. Merialdo, Roadrunner: Towards Automatic Data Extraction from Large Web Sites Proc. 27th Int?l Conf. Very Large Data Bases (VLDB), 2001.
  6. V. Crescenzi, P. Merialdo, and P. Missier, Clustering Web Pages Based on Their Structure. Data and Knowledge Eng. , vol. 54, pp. 279- 299, 2005.
  7. I. S. Dhillon, S. Mallela, AND D. S. Modha, InformationTheoretic CO-Clustering. PROC. ACM SIGKDD, 2003
  8. D. Gibson, K. Punera, AND A. Tomkins, The Volume And Evolution Of Web Page Templates PROC. 14TH INT?L CONF. WORLD WIDE WEB (WWW), 2005.
  9. B. Long, Z. Zhang, AND P. S. Yu, Co-Clustering By Block Value Decomposition PROC. ACM SIGKDD, 2005
  10. F. Pan, X. Zhang, AND W. Wang, CRD: Fast CO-Clustering On Large Data Sets Utilizing Sampling-Based Matrix Decomposition. PROC. ACM SIGMOD, 2008
  11. Kim And Shim, Text: Automatic Template Extraction From Heterogeneous Web Pages. 'IEEE Transactions On Knowledge And Data Engineering, VOL. 23, NO. 4, APRIL 2011
  12. Hanady Abdul Salam, David B. Skillicorn, Classification Using Streaming Random Forests. IEEE Computer Society, IEEE Transactions On Knowledge And Data Engineering, VOL. 23, NO. 1, JANUARY 2011.
Index Terms

Computer Science
Information Sciences

Keywords

Webpage sectioning webpage segmentation template detection Information extraction Clustering Web data modelling Web data mining. Template Extraction Data mining Information search and retrieval