Approach for Dimensionality Reduction in Web Page Classification

Shraddha Sarode; Jayant Gadge

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

Approach for Dimensionality Reduction in Web Page Classification

by Shraddha Sarode, Jayant Gadge

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 99 - Number 14

Year of Publication: 2014

Authors: Shraddha Sarode, Jayant Gadge

10.5120/17443-8245

Shraddha Sarode, Jayant Gadge . Approach for Dimensionality Reduction in Web Page Classification. International Journal of Computer Applications. 99, 14 ( August 2014), 32-37. DOI=10.5120/17443-8245

@article{ 10.5120/17443-8245,

author = { Shraddha Sarode, Jayant Gadge },

title = { Approach for Dimensionality Reduction in Web Page Classification },

journal = { International Journal of Computer Applications },

issue_date = { August 2014 },

volume = { 99 },

number = { 14 },

month = { August },

year = { 2014 },

issn = { 0975-8887 },

pages = { 32-37 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume99/number14/17443-8245/ },

doi = { 10.5120/17443-8245 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T22:28:13.097782+05:30

%A Shraddha Sarode

%A Jayant Gadge

%T Approach for Dimensionality Reduction in Web Page Classification

%J International Journal of Computer Applications

%@ 0975-8887

%V 99

%N 14

%P 32-37

%D 2014

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Dimensionality refers to number of terms in a web page. While classifying web pages high dimensionality of web pages causes problem. The main objective of reducing dimensionality of web pages is improving the performance of classifier. Processing time and accuracy are two parameters which influence the performance of a classifier. To reduce the processing time, less informative and redundant terms have to be removed from web pages. This research describes hybrid approach for dimensionality reduction in web page classification using a rough set and naïve Bayesian method. Feature selection and dimensionality reduction methods are used for reducing the dimensionality. Information gain method is used as feature selection method. Rough set based Quick Reduct algorithm is used for dimensionality reduction. Naïve Bayesian method is used for classifying web pages to optimal predefined categories. Assignment of web pages to category is based on maximum posterior probability. Words remaining after the process of feature selection and dimensionality reduction will be given to the classifier. Finally the classifier will assign most optimal predefined category to web pages.

References

Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and Techniques, 3rd Ed. , Han, Kamber & Pei, University of Illinois at Urbana-Champaign &Simon Fraser University, 2011
Ming Mao, Yefei Peng, Michael Spring, "Ontology Mapping: As a Binary Classification Problem", IEEE Fourth international conference on Semantics, Knowledge and grid, 2008
Xiaoguang Qi and Brian D. Davison, "Web Page Classification: Features and Algorithms", ACM Computing Surveys, Vol. 41, No. 2, Article 12, Publication date: February 2009.
Tom M. Mitchell, "Machine Learning," Carnegie Mellon University, McGraw-Hill Book Co, 1997.
Juan Zhang, Yi Niu, Huabei Nie, "Web Document Classification Based on Fuzzy k-NN Algorithm", International Conference on Computational Intelligence and Security, IEEE, 2009.
Rung-Ching Chen *, Chung-Hsun Hsieh, "Web page classification based on a support vector machine using a weighted vote schema", Expert Systems with Applications 31, Elsevier, 2006.
Xiaoyue Wang, Zhen Hua, Rujiang Bai. "A Hybrid Text Classification model based on Rough Sets and Genetic Algorithms" Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, IEEE, 2012.
G. S. Tomar, Shekhar Verma, Ashish Jha, "Web Page Classification using Modified Naïve Bayesian Approach", IEEE, 2006.
Selma Ayse Özel, "A Genetic Algorithm Based Optimal Feature Selection for Web Page Classification", IEEE, 2011.
Tseng, V. S. ; Ja-Hwung Su; Hao-Hua Ku; Bo-Wen Wang;" Intelligent Concept-Oriented and Content-Based Image Retrieval by using data mining and query decomposition techniques" IEEE International Conference on Multimedia and Expo. June 23 2008-April 26 2008 Page(s):1273 – 1276
Yang, Yiming, and Jan O. Pedersen. "A comparative study on feature selection in text categorization. " In ICML, vol. 97, pp. 412-420. 1997.
Pawlak, Zdzis?aw. "Rough sets. " International Journal of Computer & Information Sciences 11. 5 (1982): 341-356.
C. Velayutham and K. Thangavel, "Improved Rough Set Algorithms for Optimal Attribute Reduct", Journal of Electronic Science and Technology, VOL. 9, NO. 2, June 2011
Ramez Elmasri, Shamkant B. Navathe, "Fundamentals of Database Systems,"Addison Wesley Longman Publishing Co. , Fifth Edition, 2007.
Sang-Bum Kim, Kyong-soo Han, Hae-Chang Rim, Sung Hyon Myaeng "Some Effective techniques for Naïve Bayes Text Classification" IEEE Transactions on Knowledge and Data Engineering -2006
Vidhya. K. A, and G. Aghila, "Hybrid Text Mining Model for Document Classification", The 2nd International Conference on Computer and Automation Engineering (ICCAE), 2010.
Dino Isa, Lam Hong Lee, V. P. Kallimani, and R. RajKumar, " Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine", IEEE transactions on knowledge and data engineering, vol. 20, no. 9, September 2008
Franca Debole & Fabrizio Sebastiani, "An Analysis of the Relative Hardness of Reuters-21578 Subsets", Technical report, Institute of Science and Technologies of the National Research Council Via Giuseppe Moruzzi, Pisa, Italy, 2003

Index Terms

Computer Science

Information Sciences

Keywords

Dimensionality Reduction Feature Selection Information gain Naïve Bayes Rough Set Web Page Classification.