Web Page Structure Enhanced Feature Selection for Classification of Web Pages

B. Leela Devi; A. Sankar

Call for Paper

September Edition

IJCA solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 20 August 2026

Submit your paper

Know more

The week's pick

Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations

Muskan Saraf Sajjad Rezvani Boroujeni Justin Beaudry Hossein Abedi Tom Bush

Random Articles

Optimization Algorithm in Traditional Card Game Rummy 21

Jul

2016

Impact of Energy-Efficient and Eco-Friendly Green Computing

Jun

2016

Impact of Question Classification on Accuracy of Question Answering System

Dec

2016

Performance Comparison of various levels of Fusion of Multi-focused Images using Wavelet Transform

February

2010

Reseach Article

Web Page Structure Enhanced Feature Selection for Classification of Web Pages

by B. Leela Devi, A. Sankar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 69 - Number 2

Year of Publication: 2013

Authors: B. Leela Devi, A. Sankar

10.5120/11818-7494

B. Leela Devi, A. Sankar . Web Page Structure Enhanced Feature Selection for Classification of Web Pages. International Journal of Computer Applications. 69, 2 ( May 2013), 41-47. DOI=10.5120/11818-7494

@article{ 10.5120/11818-7494,

author = { B. Leela Devi, A. Sankar },

title = { Web Page Structure Enhanced Feature Selection for Classification of Web Pages },

journal = { International Journal of Computer Applications },

issue_date = { May 2013 },

volume = { 69 },

number = { 2 },

month = { May },

year = { 2013 },

issn = { 0975-8887 },

pages = { 41-47 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume69/number2/11818-7494/ },

doi = { 10.5120/11818-7494 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T21:29:42.333230+05:30

%A B. Leela Devi

%A A. Sankar

%T Web Page Structure Enhanced Feature Selection for Classification of Web Pages

%J International Journal of Computer Applications

%@ 0975-8887

%V 69

%N 2

%P 41-47

%D 2013

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Web page classification is achieved using text classification techniques. Web page classification is different from traditional text classification due to additional information, provided by web page structure which provides much information on content importance. HTML tags provide visual web page representation and can be considered a parameter to highlight content importance. Textual keywords are base on which Information retrieval systems rely to index and retrieve documents. Keyword-based retrieval returns inaccurate/incomplete results when differing keywords describe the same document and queries concept. Concept-based retrieval tried to tackle this by using manual thesauri with term co-occurrence data, or by extracting latent word relationships and concepts from a corpus. Semantic search motivates Semantic Web from inception for classification and retrieval processes. In this paper, a model for the exploitation of semantic-based feature selection is proposed to improve search and retrieval of web pages over large document repositories. The features are classified using Support Vector Machine (SVM) using different kernels. The experimental results show improved precision and recall with the proposed method with respect to keyword-based search. .

References

Stojanovic, N. (2005). Ontology-based information retrieval: methods and tools for cooperative query answering (Doctoral dissertation, PhD thesis, University of Karlsruhe.
Thomas R. Gruber. Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum. -Comput. Stud. , 43(5-6):907–928, 1995.
Chekuri, C. , M. Goldwasser, P. Raghavan, and E. Upfal (1997, April). Web search using automated classification. In Proceedings of the Sixth International World Wide Web Conference, Santa Clara, CA. Poster POS725.
M. Fernández, V. López, M. Sabou, V. Uren, D. Vallet, E. Motta, and P. Castells. Semantic Search meets the Web. 2nd IEEE International Conference on Semantic Computing (ICSC 2008). Santa Clara, CA, USA, August 2008.
V. López, M. Fernández, E. Motta, M. Sabou, V. Uren. Question Answering on the Real Semantic Web. Poster and demo at the 6th International Semantic Web Conference (ISWC 2007). Busan, Korea, November 2007.
Victoria Uren, Yuangui Lei, Vanessa Lopez, Haiming Liu, Enrico Motta, and Marina Giordanino. The usability of semantic search tools: A review. Knowl. Eng. Rev. , 22(4):361–377, 2007.
Du, T. C. , Li, F. , & King, I. (2009). Managing knowledge on the Web–Extracting ontology from HTML Web. Decision Support Systems, 47(4), 319-331.
Riboni, D. (2002). Feature selection for web page classification. In EURASIA-ICT 2002 Proceedings of the Workshop (pp. 473-477).
Qi, X. , & Davison, B. D. (2009). Web page classification: Features and algorithms. ACM Computing Surveys (CSUR), 41(2), 12.
Zubiaga, A. , Martínez, R. , & Fresno, V. (2009, September). Getting the most out of social annotations for web page classification. In Proceedings of the 9th ACM symposium on Document engineering (pp. 74-83). ACM.
d'Amato, C. , Fanizzi, N. , Fazzinga, B. , Gottlob, G. , & Lukasiewicz, T. (2010). Combining Semantic Web search with the power of inductive reasoning. Scalable Uncertainty Management, 137-150.
Nigam, K. , McCallum, A. K. , Thrun, S. , & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2), 103-134.
Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation, 60(5), 503-520.
Papineni, K. (2001, June). Why inverse document frequency?. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (pp. 1-8). Association for Computational Linguistics.
Steinbach, M. , Karypis, G. , & Kumar, V. (2000, August). A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, pp. 525-526).
Golub, K. and A. Ardo (2005, September). Importance of HTML structural elements and metadata in automated subject classification. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Volume 3652 of LNCS, Berlin, pp. 368–378. Springer.
Suykens, J. A. , & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural processing letters, 9(3), 293-300.
Gunn, S. R. (1998). Support vector machines for classification and regression. ISIS technical report, 14.
Zhang, L. , Lin, F. , & Zhang, B. (2001, October). Support vector machine learning for image retrieval. In Image Processing, 2001. Proceedings. 2001 International Conference on (Vol. 2, pp. 721-724). IEEE.

Index Terms

Computer Science

Information Sciences

Keywords

Web Mining Feature extraction Inverse document frequency HTML Tag Support Vector Machines