CFP last date
20 June 2024
Call for Paper
July Edition
IJCA solicits high quality original research papers for the upcoming July edition of the journal. The last date of research paper submission is 20 June 2024

Submit your paper
Know more
Reseach Article

Template Extraction from Heterogeneous Web Pages with Cosine Similarity

by Kulkarni A. H., Patil B. M.
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 87 - Number 3
Year of Publication: 2014
Authors: Kulkarni A. H., Patil B. M.
10.5120/15186-3546

Kulkarni A. H., Patil B. M. . Template Extraction from Heterogeneous Web Pages with Cosine Similarity. International Journal of Computer Applications. 87, 3 ( February 2014), 4-8. DOI=10.5120/15186-3546

@article{ 10.5120/15186-3546,
author = { Kulkarni A. H., Patil B. M. },
title = { Template Extraction from Heterogeneous Web Pages with Cosine Similarity },
journal = { International Journal of Computer Applications },
issue_date = { February 2014 },
volume = { 87 },
number = { 3 },
month = { February },
year = { 2014 },
issn = { 0975-8887 },
pages = { 4-8 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume87/number3/15186-3546/ },
doi = { 10.5120/15186-3546 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:04:57.316157+05:30
%A Kulkarni A. H.
%A Patil B. M.
%T Template Extraction from Heterogeneous Web Pages with Cosine Similarity
%J International Journal of Computer Applications
%@ 0975-8887
%V 87
%N 3
%P 4-8
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Now a day's detection of templates from a large number of web pages has received a lot of attention. Template detection technique improves the performance of clustering, classification & search engines. In our work we proposed a novel algorithm by using cosine similarity based Template Extraction. We are using the cosine similarity approach to cluster the web documents. With the help of underlying structure of web documents we found the template for individual cluster. Our experimental evaluation show that our approach is effective in terms of computing Time and Clustering cost.

References
  1. S. Zheng, D. Wu, R. Song, and J. -R. Wen, "Joint Optimization of Wrapper Generation and Template Detection," Proc. ACM
  2. SIGKDD, 2007. Z. Chen, F. Korn, N. Koudas, and S. Muithukrishnan, "Selectivity Estimation for Boolean Queries," Proc. ACM SIGMOD-SIGACTSIGART Symp. Principles of Database Systems (PODS), 2000.
  3. M. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender, "Automatic Web News Extraction Using Tree Edit Distance," Proc. 13th Int'l Conf. World Wide Web (WWW), 2004.
  4. Z. Bar-Yossef and S. Rajagopalan, "Template Detection via Data Mining and Its Applications," Proc. 11th Int'l Conf. World Wide Web (WWW), 2002. Tavel, P. 2007 Modeling and Simulation Design. AK Peters Ltd.
  5. K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. M. B. Cavalcanti, and J. Freire, "A Fast and Robust Method for Web Page Template Detection and Removal," Proc. 15th ACM Int'l Conf. Information and Knowledge Management (CIKM), 2006.
  6. M. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender, "Automatic Web News Extraction Using Tree Edit Distance," Proc. 13th Int'l Conf. World Wide Web (WWW), 2004.
  7. A. Arasu and H. Garcia-Molina, "Extracting Structured Data from Web Pages," Proc. ACM SIGMOD, 2003.
  8. Chulyun Kim and Kyuseok Shim, Member, IEEE "TEXT: Automatic Template Extraction from Heterogeneous Web Pages"
  9. V. Crescenzi, G. Mecca, and P. Merialdo, "Roadrunner: Towards Automatic Data Extraction from Large Web Sites," Proc. 27th Int'l Conf. Very Large Data Bases (VLDB), 2001.
  10. K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. M. B. Cavalcanti, and J. Freire, "A Fast and Robust Method for Web Page Template Detection and Removal," Proc. 15th ACM Int'l Conf. Information and Knowledge Management (CIKM), 2006.
Index Terms

Computer Science
Information Sciences

Keywords

Template Extraction TEXT_MDL TEXT_MAX Cosine similarity clustering.