CFP last date
20 May 2024
Reseach Article

Representation and Classification of Text Documents: A Brief Review

Published on None 2010 by B S Harish, D S Guru, S Manjunath
Recent Trends in Image Processing and Pattern Recognition
Foundation of Computer Science USA
RTIPPR - Number 2
None 2010
Authors: B S Harish, D S Guru, S Manjunath
3ad3dfcd-2955-4ec5-b794-cb050b18de92

B S Harish, D S Guru, S Manjunath . Representation and Classification of Text Documents: A Brief Review. Recent Trends in Image Processing and Pattern Recognition. RTIPPR, 2 (None 2010), 110-119.

@article{
author = { B S Harish, D S Guru, S Manjunath },
title = { Representation and Classification of Text Documents: A Brief Review },
journal = { Recent Trends in Image Processing and Pattern Recognition },
issue_date = { None 2010 },
volume = { RTIPPR },
number = { 2 },
month = { None },
year = { 2010 },
issn = 0975-8887,
pages = { 110-119 },
numpages = 10,
url = { /specialissues/rtippr/number2/984-107/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Special Issue Article
%1 Recent Trends in Image Processing and Pattern Recognition
%A B S Harish
%A D S Guru
%A S Manjunath
%T Representation and Classification of Text Documents: A Brief Review
%J Recent Trends in Image Processing and Pattern Recognition
%@ 0975-8887
%V RTIPPR
%N 2
%P 110-119
%D 2010
%I International Journal of Computer Applications
Abstract

Text classification is one of the important research issues in the field of text mining, where the documents are classified with supervised knowledge. In literature we can find many text representation schemes and classifiers/learning algorithms used to classify text documents to the predefined categories. In this paper, we present various text representation schemes and compare different classifiers used to classify text documents to the predefined classes. The existing methods are compared and contrasted based on qualitative parameters viz., criteria used for classification, algorithms adopted and classification time complexities.

References
  1. Dinesh, R., Harish, B. S., Guru, D.S., and Manjunath, S. 2009. Concept of Status Matrix in Text Classification. In the Proceedings of Indian International Conference on Artificial Intelligence, Tumkur, India, pp. 2071 – 2079.
  2. Guru, D. S., Harish B. S., and Manjunath, S. 2009. Clustering of Textual Data: A Brief Survey,” In the Proceedings of International Conference on Signal and Image Processing, pp. 409 – 413.
  3. Mitra, V., Wang, C.J., and Banerjee, S. 2007. Text Classification: A least square support vector machine approach. Journal of Applied Soft Computing. vol. 7, pp. 908 – 914.
  4. Fung, G.P.C., Yu, J.X., Lu. H., and Yu, P.S. 2006. Text classification without negative example revisit. IEEE Transactions on Knowledge and Data Engineering. Vol. 18, pp. 23 – 47.
  5. Rigutini, L. 2004. Automatic Text Processing: Machine Learning Techniques. Ph.D. Thesis, University of Siena.
  6. Song, F., Liu, S., and Yang, J. 2005. A comparative study on text representation schemes in text categorization,” Journal of Pattern Analysis Application, Vol 8, 2005, pp 199 – 209.
  7. Porter, M.F. 1980. An algorithm for suffix stripping. Program, Vol. 14 (3), pp. 130 –137.
  8. Hotho, A., Nürnberger, A., and Paaß, G. 2005. A Brief Survey of Text Mining. Journal for Computational Linguistics and Language Technology. Vol. 20, pp. 19 – 62.
  9. Salton, G., Wang, A., and Yang, C.S.1975. A Vector Space Model for Automatic Indexing. Communications of the ACM, Vol. 18, pp. 613 – 620.
  10. Bernotas, M., Karklius, K., Laurutis, R., and Slotkiene, A. 2007. The peculiarities of the text document representation, using ontology and tagging-based clustering technique. Journal of Information Technology and Control. Vol. 36, pp. 217 – 220.
  11. Lan, M., Tan, C. L., Su. J., and Lu, Y.2009. Supervised and Traditional Term Weighting Methods for Automatic Text Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 31 (4), pp. 721 – 735.
  12. Altınçay, H., and Erenel, Z. 2010. Analytical evaluation of term weighting schemes for text categorization. In Journal of Pattern Recognition Letters, vol. 31 (11), pp. 1310 – 1323.
  13. Jain, A. K., and Li, Y. H. 1998. Classification of Text Documents. The Computer Journal, Vol 41, pp. 537 – 546.
  14. Hotho, A., Maedche, A., and Staab, S. 2001. Ontology-based text clustering. In Proceedings of International Joint Conference on Artificial Intelligence, pp. 30 –37.
  15. Cavnar, W.B. 1994. Using an N-Gram based document representation with a vector processing retrieval model. In Proceedings of The Third Text Retrieval Conference (TREC-3), pp. 269 – 278.
  16. Milios, E., Zhang, Y., He, B., and Dong, L. 2003. Automatic term extraction and document similarity in special text corpora. In Proceedings of Sixth Conference of the Pacific Association for Computational Linguistics (PACLing’03), pp. 275 – 284.
  17. Wei, C. P., Yang, C. C., and Lin, C. M. 2008. A Latent Semantic Indexing-based approach to multilingual document clustering. Journal of Decision Support System. Vol. 45, pp. 606 – 620.
  18. He, X., Cai, D., Liu, H., and Ma, W.Y. 2004. Locality Preserving Indexing for document representation. In SIGIR, pp. 96—103.
  19. Cai, D., He, X., Zhang, W.V., and Han J. 2007. Regularized Locality Preserving Indexing via Spectral Regression. In ACM International Conference on Information and Knowledge Management (CIKM'07), pp. 741—750.
  20. Choudhary, B., and Bhattacharyya, P. 2003. Text clustering using Universal Networking Language representation. In Eleventh International World Wide Web Conference.
  21. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T. M., Nigam, K., and Slattery, S. 1998. Learning to Extract Symbolic Knowledge from the World Wide Web. In Proceedings of AAAI/IAAI', pp. 509 – 516.
  22. Esteban, M., and Rodrıguez, O. R. 2006. A Symbolic Representation for Distributed Web Document Clustering. In the Proceedings of Fourth Latin American Web Congress, Cholula, Mexico.
  23. Isa, D., Lee, L. H., Kallimani, V. P., and Rajkumar, R. 2008. Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering. Vol. 20, pp. 23 – 31.
  24. Guru D. S., Harish B. S., and Manjunath. S. 2010. Symbolic representation of text documents. In Proceedings of Third Annual ACM Bangalore Conference.
  25. Mitchell, T. M. 1997. Machine Learning. Mc Graw Hill, New York, NY.
  26. Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys. Vol 34, pp. 1 – 47.
  27. Yang, Y., Slattery, S., and Ghani, R. 2002. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, Vol 18(2), pp. 219 – 241.
  28. Sandip, K. 2003. An experimental study of some algorithms for text categorization. M.Tech Thesis, IIT Kanpur, India.
  29. Tan, S. 2008. An improved centroid classifier for text categorization. Journal of Expert System with Applications, Vol 35, pp 279 – 285.
  30. Theeramunkong, T., and Lertnattee, V. 2001. Improving centroid-based text classification using term distribution-based weighting system and clustering. ISCIT, pp. 33–36.
  31. Lewis, D. D., and Ringuette, M. 1998. A comparison of two learning algorithms for text classification. In the Proceedings of Third annual symposium on Document Analysis and Information Retrieval, pp. 81–93.
  32. Joachims, Y. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In the Proceedings of the Fourteenth International Conference on Machine Learning, pp. 143 –151.
  33. Yang, Y., and Chute, C. G. 1994. An example based mapping method for text categorization and retrieval. ACM Transactions on Information Systems, Vol 12(3), pp 252-277.
  34. Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classifiers. In the Proceedings of the Nineteenth International Conference on Research and Development in Information Retrieval (SIGIR’96), pp. 289–297.
  35. Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In the Proceedings of European Conference on Machine Learning (ECML), Vol 1398, pp. 137-142.
  36. Songbo, T., Cheng, X., Ghanem, M. M., Wnag, B., and Xu, H. 2005. A novel refinement approach for text categorization. In the Proceedings of Fourteenth ACM International Conference on Information and Knowledge Management, pp 469 – 476.
  37. Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Information Retrieval, Vol 1, pp 69 – 90.
  38. Ikonomakis, M., Kotsiantis, S., and Tampakas, V. 2005. Text Classification Using Machine Learning Techniques.2005. Wseas Transactions on Computers, Vol 4 (8), pp 966 – 974.
  39. Ko, Y. J., Park, J., and Seo, J. 2004. Improving text categorization using the importance of sentences. An International Journal Information Processing and Management, Vol. 40, pp. 65 – 79.
  40. Liang, C. Y., Guo, L., Xia, Z. H., Nie, F. G., Li, X. X., Su, L., and Yang, Z. Y. 2006. Dictionary-based text categorization of chemical web pages. An International Journal Information Processing and Management, Vol 42, pp. 1017 – 1029.
  41. Mubaid, H. A., and Umair, S. A. 2006. A New Text Categorization Technique Using Distributional Clustering and Learning Logic. IEEE Transactions on Knowledge and Data Engineering, Vol 18 (9), pp. 1156 – 1165.
  42. Hao, P. Y., Chiang, J. H., and Tu, Y. K. 2007. Hierarchically SVM classification based on support vector clustering method and its application to document categorization. An International Journal Expert Systems with Applications, Vol 33(3), pp. 627-635.
  43. Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., and Wang, Z. 2007. A novel feature selection algorithm for text categorization. An International Journal Expert Systems with Applications, Vol 33(1), pp. 1-5.
  44. Qian, T., Xiong, H., Wang, Y., and Chen, E. 2007. On the strength of hyperclique patterns for text categorization. An International Journal Information Sciences, Vol. 177, pp. 4040–4058.
Index Terms

Computer Science
Information Sciences

Keywords

Text classification Documents Text Representation Classifiers