CFP last date
22 April 2024
Reseach Article

Approach for Dimensionality Reduction in Web Page Classification

by Shraddha Sarode, Jayant Gadge
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 99 - Number 14
Year of Publication: 2014
Authors: Shraddha Sarode, Jayant Gadge
10.5120/17443-8245

Shraddha Sarode, Jayant Gadge . Approach for Dimensionality Reduction in Web Page Classification. International Journal of Computer Applications. 99, 14 ( August 2014), 32-37. DOI=10.5120/17443-8245

@article{ 10.5120/17443-8245,
author = { Shraddha Sarode, Jayant Gadge },
title = { Approach for Dimensionality Reduction in Web Page Classification },
journal = { International Journal of Computer Applications },
issue_date = { August 2014 },
volume = { 99 },
number = { 14 },
month = { August },
year = { 2014 },
issn = { 0975-8887 },
pages = { 32-37 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume99/number14/17443-8245/ },
doi = { 10.5120/17443-8245 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:28:13.097782+05:30
%A Shraddha Sarode
%A Jayant Gadge
%T Approach for Dimensionality Reduction in Web Page Classification
%J International Journal of Computer Applications
%@ 0975-8887
%V 99
%N 14
%P 32-37
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Dimensionality refers to number of terms in a web page. While classifying web pages high dimensionality of web pages causes problem. The main objective of reducing dimensionality of web pages is improving the performance of classifier. Processing time and accuracy are two parameters which influence the performance of a classifier. To reduce the processing time, less informative and redundant terms have to be removed from web pages. This research describes hybrid approach for dimensionality reduction in web page classification using a rough set and naïve Bayesian method. Feature selection and dimensionality reduction methods are used for reducing the dimensionality. Information gain method is used as feature selection method. Rough set based Quick Reduct algorithm is used for dimensionality reduction. Naïve Bayesian method is used for classifying web pages to optimal predefined categories. Assignment of web pages to category is based on maximum posterior probability. Words remaining after the process of feature selection and dimensionality reduction will be given to the classifier. Finally the classifier will assign most optimal predefined category to web pages.

References
  1. Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and Techniques, 3rd Ed. , Han, Kamber & Pei, University of Illinois at Urbana-Champaign &Simon Fraser University, 2011
  2. Ming Mao, Yefei Peng, Michael Spring, "Ontology Mapping: As a Binary Classification Problem", IEEE Fourth international conference on Semantics, Knowledge and grid, 2008
  3. Xiaoguang Qi and Brian D. Davison, "Web Page Classification: Features and Algorithms", ACM Computing Surveys, Vol. 41, No. 2, Article 12, Publication date: February 2009.
  4. Tom M. Mitchell, "Machine Learning," Carnegie Mellon University, McGraw-Hill Book Co, 1997.
  5. Juan Zhang, Yi Niu, Huabei Nie, "Web Document Classification Based on Fuzzy k-NN Algorithm", International Conference on Computational Intelligence and Security, IEEE, 2009.
  6. Rung-Ching Chen *, Chung-Hsun Hsieh, "Web page classification based on a support vector machine using a weighted vote schema", Expert Systems with Applications 31, Elsevier, 2006.
  7. Xiaoyue Wang, Zhen Hua, Rujiang Bai. "A Hybrid Text Classification model based on Rough Sets and Genetic Algorithms" Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, IEEE, 2012.
  8. G. S. Tomar, Shekhar Verma, Ashish Jha, "Web Page Classification using Modified Naïve Bayesian Approach", IEEE, 2006.
  9. Selma Ayse Özel, "A Genetic Algorithm Based Optimal Feature Selection for Web Page Classification", IEEE, 2011.
  10. Tseng, V. S. ; Ja-Hwung Su; Hao-Hua Ku; Bo-Wen Wang;" Intelligent Concept-Oriented and Content-Based Image Retrieval by using data mining and query decomposition techniques" IEEE International Conference on Multimedia and Expo. June 23 2008-April 26 2008 Page(s):1273 – 1276
  11. Yang, Yiming, and Jan O. Pedersen. "A comparative study on feature selection in text categorization. " In ICML, vol. 97, pp. 412-420. 1997.
  12. Pawlak, Zdzis?aw. "Rough sets. " International Journal of Computer & Information Sciences 11. 5 (1982): 341-356.
  13. C. Velayutham and K. Thangavel, "Improved Rough Set Algorithms for Optimal Attribute Reduct", Journal of Electronic Science and Technology, VOL. 9, NO. 2, June 2011
  14. Ramez Elmasri, Shamkant B. Navathe, "Fundamentals of Database Systems,"Addison Wesley Longman Publishing Co. , Fifth Edition, 2007.
  15. Sang-Bum Kim, Kyong-soo Han, Hae-Chang Rim, Sung Hyon Myaeng "Some Effective techniques for Naïve Bayes Text Classification" IEEE Transactions on Knowledge and Data Engineering -2006
  16. Vidhya. K. A, and G. Aghila, "Hybrid Text Mining Model for Document Classification", The 2nd International Conference on Computer and Automation Engineering (ICCAE), 2010.
  17. Dino Isa, Lam Hong Lee, V. P. Kallimani, and R. RajKumar, " Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine", IEEE transactions on knowledge and data engineering, vol. 20, no. 9, September 2008
  18. Franca Debole & Fabrizio Sebastiani, "An Analysis of the Relative Hardness of Reuters-21578 Subsets", Technical report, Institute of Science and Technologies of the National Research Council Via Giuseppe Moruzzi, Pisa, Italy, 2003
Index Terms

Computer Science
Information Sciences

Keywords

Dimensionality Reduction Feature Selection Information gain Naïve Bayes Rough Set Web Page Classification.