Call for Paper - August 2022 Edition
IJCA solicits original research papers for the August 2022 Edition. Last date of manuscript submission is July 20, 2022. Read More

A Novel Text Categorization Approach based on K-means and Support Vector Machine

Print
PDF
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2015
Authors:
Rajesh Malviya, Pranita Jain
10.5120/ijca2015907164

Rajesh Malviya and Pranita Jain. Article: A Novel Text Categorization Approach based on K-means and Support Vector Machine. International Journal of Computer Applications 130(14):1-7, November 2015. Published by Foundation of Computer Science (FCS), NY, USA. BibTeX

@article{key:article,
	author = {Rajesh Malviya and Pranita Jain},
	title = {Article: A Novel Text Categorization Approach based on K-means and Support Vector Machine},
	journal = {International Journal of Computer Applications},
	year = {2015},
	volume = {130},
	number = {14},
	pages = {1-7},
	month = {November},
	note = {Published by Foundation of Computer Science (FCS), NY, USA}
}

Abstract

Continuous expansion of digital libraries and online news, the huge amount of text documents is existing on the web. Consequently the need is to organize them. Text Categorization is an active analysis field can be used for organizing text document. Text categorization is the process of assigning documents with predefined categories that are associated with their contented.

CAWP algorithm is designed for Text Categorization. But this algorithm does not present the best results for large datasets. K-means Clustering with Support Vector Machine approach is used to enhance the results. K-means group the data into a number of clusters follow which it uses as training samples for Support Vector Machine in each cluster to divide the new sample data efficiently. The experiment performed on 20Newsgroups dataset, K-means with SVM provides better results than CAWP algorithm in terms of F-measure.

References

  1. Li, Y. H., & Jain, A. K. (1998). Classification of text documents. The Computer Journal, 41(8), 537-546.
  2. Cormack, G. V., Smucker, M. D., & Clarke, C. L. (2011). Efficient and effective spam filtering and re-ranking for large web datasets. Information retrieval, 14(5), 441-465.
  3. Kallipolitis, L., Karpis, V., & Karali, I. (2012). Semantic search in the world news domain using automatically extracted metadata files. Knowledge-Based Systems, 27, 38-50.
  4. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.
  5. Jun, S., Park, S. S., & Jang, D. S. (2014). Document clustering method using dimension reduction and support vector clustering to overcome sparseness.Expert Systems with Applications, 41(7), 3204-3212.
  6. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523.
  7. Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37(1), 141-188.
  8. Feldman, R., & Sanger, J. (2007). The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press.
  9. Altınçay, H., & Erenel, Z. (2010). Analytical evaluation of term weighting schemes for text categorization. Pattern Recognition Letters, 31(11), 1310-1323.
  10. Lan, M., Tan, C. L., Su, J., & Lu, Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(4), 721-735.
  11. Debole, F., & Sebastiani, F. (2004). Supervised term weighting for automated text categorization. In Text mining and its applications (pp. 81-97). Springer Berlin Heidelberg.
  12. Matsuo, Y., & Ishizuka, M. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01), 157-169.
  13. Caropreso, M. F., Matwin, S., & Sebastiani, F. (2000). Statistical phrases in automated text categorization. Centre National de la Recherche Scientifique, Paris, France.
  14. Shehata, S., Karray, F., & Kamel, M. (2007, August). A concept-based model for enhancing text categorization. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 629-637). ACM.
  15. Xu, R., & Wunsch, D. (2008). Clustering (Vol. 10). John Wiley & Sons.
  16. Zhong, S. (2005, August). Efficient online spherical k-means clustering. InNeural Networks, 2005. IJCNN'05. Proceedings. 2005 IEEE International Joint Conference on (Vol. 5, pp. 3180-3185). IEEE.
  17. Mei, J. P., & Chen, L. (2014). Proximity-based k-partitions clustering with ranking for document categorization and analysis. Expert Systems with Applications, 41(16), 7095-7105.
  18. Schoenharl, T. W., & Madey, G. (2008). Evaluation of measurement techniques for the validation of agent-based simulations against streaming data. InComputational Science–ICCS 2008 (pp. 6-15). Springer Berlin Heidelberg.
  19. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features (pp. 137-142). Springer Berlin Heidelberg.
  20. Zhao, Y., Karypis, G., & Fayyad, U. (2005). Hierarchical clustering algorithms for document datasets. Data mining and knowledge discovery, 10(2), 141-168.
  21. Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons.
  22. Guha, S., Rastogi, R., & Shim, K. (2001). Cure: an efficient clustering algorithm for large databases. Information Systems, 26(1), 35-58.
  23. Bellec, J. H., & Kechadi, T. M. (2007, November). Cufres: clustering using fuzzy representative eventsselection for the fault recognition problem intelecommunication networks. In Proceedings of the ACM first Ph. D. workshop in CIKM (pp. 55-62). ACM.
  24. Mei, J. P., & Chen, L. (2010). Fuzzy clustering with weighted medoids for relational data. Pattern Recognition, 43(5), 1964-1974.
  25. Halkidi, M., & Vazirgiannis, M. (2008). A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29(6), 773-786.
  26. Liu, L., Kang, J., Yu, J., & Wang, Z. (2005, November). A comparative study on unsupervised feature selection methods for text clustering. In Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE'05. Proceedings of 2005 IEEE International Conference on (pp. 597-601). IEEE.
  27. González, C. G., Bonventi Jr, W., & Rodrigues, A. V. (2008). Density of closed balls in real-valued and autometrized boolean spaces for clustering applications. In Advances in Artificial Intelligence-SBIA 2008 (pp. 8-22). Springer Berlin Heidelberg.
  28. Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(7), 881-892.
  29. Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques: concepts and techniques. Elsevier.
  30. Tan, S. (2008). An improved centroid classifier for text categorization. Expert Systems with Applications, 35(1), 279-285.
  31. Ahuja, Y., & Yadav, S. K. (2012). Multiclass classification and support vector machine. Global Journal of Computer Science and Technology Interdisciplinary,12(11).
  32. Lin, J., Li, X., & Jiao, Y. (2010, March). Text Categorization Research Based on Cluster Idea. In Education Technology and Computer Science (ETCS), 2010 Second International Workshop on (Vol. 1, pp. 483-486). IEEE.
  33. Wang, Z., & Qian, X. (2008, December). Text categorization based on LDA and SVM. In Computer Science and Software Engineering, 2008 International Conference on (Vol. 1, pp. 674-677). IEEE.
  34. Srivastava, A. N., & Sahami, M. (Eds.). (2009). Text mining: Classification, clustering, and applications. CRC Press.
  35. Berry, M. W., & Kogan, J. (Eds.). (2010). Text mining: applications and theory. John Wiley & Sons.
  36. Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management,45(4), 427-437.
  37. Lang, K. (1995, July). Newsweeder: Learning to filter netnews. In Proceedings of the 12th international conference on machine learning (pp. 331-339).
  38. Wang, Z., & Xue, X. (2014). Multi-Class Support Vector Machine. In Support Vector Machines Applications (pp. 23-48). Springer International Publishing.
  39. Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003). KNN model-based approach in classification. In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE (pp. 986-996). Springer Berlin Heidelberg.
  40. Gu, Q., & Han, J. (2013). Clustered support vector machines. In proceedings of the sixteenth international conference on artificial intelligence and statistics (pp. 307-315).

Keywords

Classification, K-means, SVM, Document Categorization, Text mining