Template Extraction from Heterogeneous Web Pages with Cosine Similarity

Kulkarni A. H.; Patil B. M.

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 20 July 2026

Submit your paper

Know more

The week's pick

Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations

Muskan Saraf Sajjad Rezvani Boroujeni Justin Beaudry Hossein Abedi Tom Bush

Random Articles

Toward Mitigating Adversarial Texts

Sep

2019

Evaluation of Software Vulnerability Detection Methods and Tools: A Review

Jul

2017

Implementation of Information Technology in Snakebite Management: A Case Study of Rural Maharashtra (India)

April

2012

Factors Affected to Digital Adaption of ICT Applications in Sri Lanka: A Conceptual Model

Aug

2025

Reseach Article

Template Extraction from Heterogeneous Web Pages with Cosine Similarity

by Kulkarni A. H., Patil B. M.

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 87 - Number 3

Year of Publication: 2014

Authors: Kulkarni A. H., Patil B. M.

10.5120/15186-3546

Kulkarni A. H., Patil B. M. . Template Extraction from Heterogeneous Web Pages with Cosine Similarity. International Journal of Computer Applications. 87, 3 ( February 2014), 4-8. DOI=10.5120/15186-3546

@article{ 10.5120/15186-3546,

author = { Kulkarni A. H., Patil B. M. },

title = { Template Extraction from Heterogeneous Web Pages with Cosine Similarity },

journal = { International Journal of Computer Applications },

issue_date = { February 2014 },

volume = { 87 },

number = { 3 },

month = { February },

year = { 2014 },

issn = { 0975-8887 },

pages = { 4-8 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume87/number3/15186-3546/ },

doi = { 10.5120/15186-3546 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T22:04:57.316157+05:30

%A Kulkarni A. H.

%A Patil B. M.

%T Template Extraction from Heterogeneous Web Pages with Cosine Similarity

%J International Journal of Computer Applications

%@ 0975-8887

%V 87

%N 3

%P 4-8

%D 2014

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Now a day's detection of templates from a large number of web pages has received a lot of attention. Template detection technique improves the performance of clustering, classification & search engines. In our work we proposed a novel algorithm by using cosine similarity based Template Extraction. We are using the cosine similarity approach to cluster the web documents. With the help of underlying structure of web documents we found the template for individual cluster. Our experimental evaluation show that our approach is effective in terms of computing Time and Clustering cost.

References

S. Zheng, D. Wu, R. Song, and J. -R. Wen, "Joint Optimization of Wrapper Generation and Template Detection," Proc. ACM
SIGKDD, 2007. Z. Chen, F. Korn, N. Koudas, and S. Muithukrishnan, "Selectivity Estimation for Boolean Queries," Proc. ACM SIGMOD-SIGACTSIGART Symp. Principles of Database Systems (PODS), 2000.
M. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender, "Automatic Web News Extraction Using Tree Edit Distance," Proc. 13th Int'l Conf. World Wide Web (WWW), 2004.
Z. Bar-Yossef and S. Rajagopalan, "Template Detection via Data Mining and Its Applications," Proc. 11th Int'l Conf. World Wide Web (WWW), 2002. Tavel, P. 2007 Modeling and Simulation Design. AK Peters Ltd.
K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. M. B. Cavalcanti, and J. Freire, "A Fast and Robust Method for Web Page Template Detection and Removal," Proc. 15th ACM Int'l Conf. Information and Knowledge Management (CIKM), 2006.
M. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender, "Automatic Web News Extraction Using Tree Edit Distance," Proc. 13th Int'l Conf. World Wide Web (WWW), 2004.
A. Arasu and H. Garcia-Molina, "Extracting Structured Data from Web Pages," Proc. ACM SIGMOD, 2003.
Chulyun Kim and Kyuseok Shim, Member, IEEE "TEXT: Automatic Template Extraction from Heterogeneous Web Pages"
V. Crescenzi, G. Mecca, and P. Merialdo, "Roadrunner: Towards Automatic Data Extraction from Large Web Sites," Proc. 27th Int'l Conf. Very Large Data Bases (VLDB), 2001.
K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. M. B. Cavalcanti, and J. Freire, "A Fast and Robust Method for Web Page Template Detection and Removal," Proc. 15th ACM Int'l Conf. Information and Knowledge Management (CIKM), 2006.

Index Terms

Computer Science

Information Sciences

Keywords

Template Extraction TEXT_MDL TEXT_MAX Cosine similarity clustering.