Call for Paper - March 2023 Edition
IJCA solicits original research papers for the March 2023 Edition. Last date of manuscript submission is February 20, 2023. Read More

Extraction of Template using Clustering from Heterogeneous Web Documents

Print
PDF
International Journal of Computer Applications
© 2015 by IJCA Journal
Volume 119 - Number 11
Year of Publication: 2015
Authors:
Rashmi D Thakare
Manisha R Patil
10.5120/21112-3906

Rashmi D Thakare and Manisha R Patil. Article: Extraction of Template using Clustering from Heterogeneous Web Documents. International Journal of Computer Applications 119(11):23-31, June 2015. Full text available. BibTeX

@article{key:article,
	author = {Rashmi D Thakare and Manisha R Patil},
	title = {Article: Extraction of Template using Clustering from Heterogeneous Web Documents},
	journal = {International Journal of Computer Applications},
	year = {2015},
	volume = {119},
	number = {11},
	pages = {23-31},
	month = {June},
	note = {Full text available}
}

Abstract

In general, a common template or layout is used to generate set of pages in websites. For example, Google Book lays out the details like author name, book names, reviews or comments, etc. in the similar way in all of its book pages. The database provides different values to generate the pages. The problem during automatic database value extraction from different web pages is studied which is done without any human data input. A template is well defined which would propose the framework to be used to describe how the values are inserted into the pages. An extraction algorithm is at core to extract values from web pages. This algorithm is trained to generate the template referring defined set of words having common occurrence. As a result, extracted values are semantically similar in most of the cases. Ours focus on extracting templates from heterogeneous web pages. But due to large variety of web documents in websites, there is a need to manage unknown number of templates. This is achieved by clustering web documents. The various methods for clustering, which are compared i) TEXT Minimum Description Length (TEXTMDL), ii) MinHash using Jaccard Coefficient, iii) MinHash using Dice Coefficient methods are used for clustering web pages.

References

  • Arasu and H. Garcia-Molina, Extracting Structured Data from Web Pages. Proc. ACM SIGMOD, 2003.
  • A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, Min-Wise Independent Permutations J. Computer and System Sciences, vol. 60, no. 3, pp. 630-659, 2000.
  • D. Chakrabarti, R. Kumar, and K. Punera, Page-Level Template Detection via Isotonic Smoothing. Proc. 16th Int?l Conf. World Wide Web (WWW), 2007.
  • Z. Chen, F. Korn, N. Koudas, and S. Muithukrishnan, Selectivity Estimation for Boolean Queries Proc. ACM SIGMOD-SIGACTSIGART Symp. Principles of Database Systems (PODS), 2000.
  • V. Crescenzi, G. Mecca, and P. Merialdo, Roadrunner: Towards Automatic Data Extraction from Large Web Sites Proc. 27th Int?l Conf. Very Large Data Bases (VLDB), 2001.
  • V. Crescenzi, P. Merialdo, and P. Missier, Clustering Web Pages Based on Their Structure. Data and Knowledge Eng. , vol. 54, pp. 279- 299, 2005.
  • I. S. Dhillon, S. Mallela, AND D. S. Modha, InformationTheoretic CO-Clustering. PROC. ACM SIGKDD, 2003
  • D. Gibson, K. Punera, AND A. Tomkins, The Volume And Evolution Of Web Page Templates PROC. 14TH INT?L CONF. WORLD WIDE WEB (WWW), 2005.
  • B. Long, Z. Zhang, AND P. S. Yu, Co-Clustering By Block Value Decomposition PROC. ACM SIGKDD, 2005
  • F. Pan, X. Zhang, AND W. Wang, CRD: Fast CO-Clustering On Large Data Sets Utilizing Sampling-Based Matrix Decomposition. PROC. ACM SIGMOD, 2008
  • Kim And Shim, Text: Automatic Template Extraction From Heterogeneous Web Pages. 'IEEE Transactions On Knowledge And Data Engineering, VOL. 23, NO. 4, APRIL 2011
  • Hanady Abdul Salam, David B. Skillicorn, Classification Using Streaming Random Forests. IEEE Computer Society, IEEE Transactions On Knowledge And Data Engineering, VOL. 23, NO. 1, JANUARY 2011.