CFP last date
20 May 2024
Reseach Article

Feature Selection and Reduction for Persian Text Classification

by Zahra Robati, Morteza Zahedi, Najmeh Fayazi Far
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 109 - Number 17
Year of Publication: 2015
Authors: Zahra Robati, Morteza Zahedi, Najmeh Fayazi Far
10.5120/19414-9005

Zahra Robati, Morteza Zahedi, Najmeh Fayazi Far . Feature Selection and Reduction for Persian Text Classification. International Journal of Computer Applications. 109, 17 ( January 2015), 1-5. DOI=10.5120/19414-9005

@article{ 10.5120/19414-9005,
author = { Zahra Robati, Morteza Zahedi, Najmeh Fayazi Far },
title = { Feature Selection and Reduction for Persian Text Classification },
journal = { International Journal of Computer Applications },
issue_date = { January 2015 },
volume = { 109 },
number = { 17 },
month = { January },
year = { 2015 },
issn = { 0975-8887 },
pages = { 1-5 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume109/number17/19414-9005/ },
doi = { 10.5120/19414-9005 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:45:01.400634+05:30
%A Zahra Robati
%A Morteza Zahedi
%A Najmeh Fayazi Far
%T Feature Selection and Reduction for Persian Text Classification
%J International Journal of Computer Applications
%@ 0975-8887
%V 109
%N 17
%P 1-5
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

With the rapid growth of the World Wide Web and increasing availability of electronic documents, the automatic text classification became a general and important machine learning problem in text mining domain. In text classification, feature selection is used for reducing the size of feature vector and for improving the performance of classifier. This paper improved Dominance which is a feature selection criterion and proposed Extended Dominance (E-Dominance) as a new criterion. E-Dominance is compared favorably with usual feature selection methods based on document frequency (DF), information gain (IG), Entropy, ?2 and Dominance on a collection of XML documents from Hamshahri2 which is a commonly used in Persian text classification. The comparative study confirms the effectiveness of proposed feature selection criterion derived from the Dominance.

References
  1. Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR'99: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42–49, 1999.
  2. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34: pages 1–47, 2002.
  3. J. S. Ronen Feldman. The text mining handbook: Advanced approaches to analyzing unstructured data. Cambridge University Press, Cambridge, 2007.
  4. E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting. In SDAIR'95: Proceedings of the 4th Symposium on Document Analysis and Information Retrieval, pages 317–332, 1995.
  5. M. F. Caropreso, S. Matwin, and F. Sebastiani. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In A. G. Chin, editor, Text Databases and Document Management: Theory and Practice, pages 78–102. Idea Group Publishing, Hershey, US, 2001.
  6. H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In SIGIR '97: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pages 67–73, 1997.
  7. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11): pages 613–620, 1975.
  8. B. C. How and W. T. Kiong. An examination of feature selection frameworks in text categorization. In AIRS'05: Proceedings of 2nd Asia information retrieval symposium, pages 558–564. Lecture notes in computer science, 2005.
  9. F. Figueiredo, L. R. , T. Couto, T. Salles, M. A. Goncalves, W. MeiraJr. Word co-occurrence features for text classification, Information Systems, 36, pages 843–858, 2011.
  10. E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting. In SDAIR'95: Proceedings of the 4th Symposium on Document Analysis and Information Retrieval, pages 317–332, 1995.
  11. C. Largeron, C. M. , M. Gery, Entropy based feature selection for text categorization, ACM Symposium on Applied Computing, TaiChung : Taiwan, Province Of China, 2011.
  12. V. Tam, A. Santoso and R Setiono. A comparative study of centroid-based, neighborhood-based and statistical approaches for effective document categorization, Proceedings of the 16th International Conference on Pattern Recognition, pages 235–238, 2002.
  13. Eui-Hong (Sam) Han, George Karypis, Vipin Kumar. Text Categorization Using Weighted Adjusted k-Nearest Neighbor Classification, Department of Computer Science and Engineering. Army HPC Research Centre, University of Minnesota, Minneapolis, USA, 1999.
  14. G. R. Dunlop. A rapid computational method for improvements to nearest neighbor interpolation, Computers& Mathematics with Applications 6(3), pages 349-353, 1980.
Index Terms

Computer Science
Information Sciences

Keywords

Text Classification E-Dominance feature selection criterion