CFP last date
20 May 2024
Reseach Article

Empirical Study on Filter based Feature Selection Methods for Text Classification

by Subhajit Dey Sarkar, Saptarsi Goswami
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 81 - Number 6
Year of Publication: 2013
Authors: Subhajit Dey Sarkar, Saptarsi Goswami
10.5120/14018-2173

Subhajit Dey Sarkar, Saptarsi Goswami . Empirical Study on Filter based Feature Selection Methods for Text Classification. International Journal of Computer Applications. 81, 6 ( November 2013), 38-43. DOI=10.5120/14018-2173

@article{ 10.5120/14018-2173,
author = { Subhajit Dey Sarkar, Saptarsi Goswami },
title = { Empirical Study on Filter based Feature Selection Methods for Text Classification },
journal = { International Journal of Computer Applications },
issue_date = { November 2013 },
volume = { 81 },
number = { 6 },
month = { November },
year = { 2013 },
issn = { 0975-8887 },
pages = { 38-43 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume81/number6/14018-2173/ },
doi = { 10.5120/14018-2173 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:55:23.713391+05:30
%A Subhajit Dey Sarkar
%A Saptarsi Goswami
%T Empirical Study on Filter based Feature Selection Methods for Text Classification
%J International Journal of Computer Applications
%@ 0975-8887
%V 81
%N 6
%P 38-43
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Text classification has become much more relevant with the increased volume of unstructured data from various sources. Several techniques have been developed for text classification. High dimensionality of feature space is one of the established problems in text classification. Feature selection is one of the techniques to reduce dimensionality. Feature selection helps in increasing classifier performance, reduce over filtering to speed up the classification model construction and testing and make models more interpretable. This paper presents an empirical study comparing performance of few feature selection techniques (Chi-squared, Information Gain, Mutual Information and Symmetrical Uncertainty) employed with different classifiers like naive bayes, SVM, decision tree and k-NN. Motivation of the paper is to present results of feature selection methods on various classifiers on text datasets. The study further allows comparing the relative performance of the classifiers and the methods.

References
  1. Yang, Yiming, and Thorsten Joachims. "Text categorization. " Scholarpedia 3. 5 (2008): 4242.
  2. Sriram, Bharath, et al. "Short text classification in twitter to improve information filtering. "Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2010"
  3. Aggarwal, Charu C. , and ChengXiang Zhai. "A survey of text clustering algorithms" Mining Text Data. Springer US, 2012. 77-128.
  4. Fabrizio Sebastiani. Text categorization. In Alessandro Zanasi (ed. ), Text Mining and its Applications, WIT Press, Southampton, UK, 2005, pp. 109-129.
  5. Dasgupta, Anirban, et al. "Feature selection methods for text classification. "Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007.
  6. Singh, Sanasam Ranbir, Hema A. Murthy, and Timothy A. Gonsalves. "Feature Selection for Text Classification Based on Gini Coefficient of Inequality. " Journal of Machine Learning Research-Proceedings Track 10 (2010): 76-85.
  7. Joachims, Thorsten. "A statistical learning learning model of text classification for support vector machines. " Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2001.
  8. Zhang, Yin, Rong Jin, and Zhi-Hua Zhou. "Understanding bag-of-words model: a statistical framework. " International Journal of Machine Learning and Cybernetics 1. 1-4 (2010): 43-52.
  9. Wallach, Hanna M. "Topic modeling: beyond bag-of-words. "Proceedings of the 23rd international conference on Machine learning". ACM, 2006.
  10. Refaeilzadeh, Payam, Lei Tang, and Huan Liu. "On comparison of feature selection algorithms. " Proceedings of AAAI Workshop on Evaluation Methods for Machine Learning II. 2007.
  11. Novovicova, Jana, and Antonin Malik. "Information-theoretic feature selection algorithms for text classification. " Neural Networks, 2005. IJCNN'05. Proceedings. 2005 IEEE International Joint Conference on. Vol. 5. IEEE, 2005.
  12. Singh, Sanasam Ranbir, Hema A. Murthy, and Timothy A. Gonsalves. "Feature Selection for Text Classification Based on Gini Coefficient of Inequality. " Journal of Machine Learning Research-Proceedings Track 10 (2010): 76-85.
  13. Arauzo-Azofra, Antonio, José Luis Aznarte, and José M. Benítez. "Empirical study of feature selection methods based on individual feature evaluation for classification problems. " Expert Systems with Applications 38. 7 (2011): 8170-8177.
  14. Rong-zong, S. U. N. "An Improved KNN Algorithm for Text Classification [J]. "Computer Knowledge and Technology 1 (2010): 073.
  15. Yang, Yiming, and Jan O. Pedersen. "A comparative study on feature selection in text categorization. " ICML. Vol. 97. 1997.
  16. Baharudin, Baharum, Lam Hong Lee, and Khairullah Khan. " A review of machine learning algorithms for text-documents classification. " Journal of advances in information technology 1. 1 (2010): 4-20.
  17. Chen, Jingnian, et al. "Feature selection for text classification with Naïve Bayes. " Expert Systems with Applications 36. 3 (2009): 5432-5435.
  18. Li, Shoushan, et al. "A framework of feature selection methods for text categorization. " Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 2009 Systems with Applications 38. 7 (2011): 8170-8177.
  19. DING, Xiaoming, and Yan TANG. "Improved Mutual Information Method For Text Feature Selection. "
  20. Brank, Janez, et al. "Interaction of feature selection methods and linear classification models. " Workshop on Text Learning held at ICML. 2002.
  21. Ali, Syed Imran, and Waseem Shahzad. "A feature subset selection method based on symmetric uncertainty and Ant Colony Optimization. " Emerging Technologies (ICET), 2012 International Conference on. IEEE, 2012
Index Terms

Computer Science
Information Sciences

Keywords

Feature Selection Filter Method High Dimensionality Text Classification Text Categorization.