Call for Paper - January 2023 Edition
IJCA solicits original research papers for the January 2023 Edition. Last date of manuscript submission is December 20, 2022. Read More

Preprocessing Techniques in Text Categorization

IJCA Proceedings on National Conference on Innovative Paradigms in Engineering & Technology 2013
© 2013 by IJCA Journal
NCIPET2013 - Number 3
Year of Publication: 2013
Pritam C. Gaigole
L. H. Patil
P. M Chaudhari

Pritam C Gaigole, L H Patil and P M Chaudhari. Article: Preprocessing Techniques in Text Categorization. IJCA Proceedings on National Conference on Innovative Paradigms in Engineering & Technology 2013 NCIPET 2013(3):1-3, December 2013. Full text available. BibTeX

	author = {Pritam C. Gaigole and L. H. Patil and P. M Chaudhari},
	title = {Article: Preprocessing Techniques in Text Categorization},
	journal = {IJCA Proceedings on National Conference on Innovative Paradigms in Engineering & Technology 2013},
	year = {2013},
	volume = {NCIPET 2013},
	number = {3},
	pages = {1-3},
	month = {December},
	note = {Full text available}


Bulk data is generated in the era ofInformation Technology. If it is not stored in aproperly systematic manner then the generated datacannot be reused. This is because navigation becomes if not impossible, certainly very difficult. The data generated is to analyze so as to maximizethe benefits, for intelligent decision making. Textcategorization is an important and extensively studiedproblem in machine learning. The basic phases in textcategorization include preprocessing features, extractingrelevant features against the features in a database, andfinally categorizing a set of documents into predefinedcategories. Most of the researches in text categorization arefocusing more on the development of algorithms andcomputer techniques.


  • K. Aas "Text categorization: A survey", Technicalreport,Norwegian Computing Center, June, 1999.
  • Katharina, M. and Martin, S. (2004) "The Mining Mart Approach to Knowledge Discovery in Databases", NingZhong and Jiming Liu (editors), Intelligent Technologies for Information Analysis Springer, Pp. 47-65.
  • Xue, X. and Zhou, Z. (2009),"Distributional Features for Text Categorization", IEEE Transactions on Knowledge and Data Engineering,Vol. 21, No. 3, Pp. 428-442.
  • Salton, G. (1989), "Automatic Text Processing: TheTransformation, Analysis, and Retrieval of Information ByComputer", Pennsylvania, Addison-Wesley, Reading.
  • Porter, M. (1980) "An algorithm for suffix stripping, Program",Vol. 14, No. 3, Pp. 130–137.
  • Salton, G. and Buckley, C. (1988) "Term weighting approaches In automatic text retrieval, Information Processing and Management",Vol. 24, No. 5, Pp. 513-523.
  • Karbasi, S. and Boughanem, M. (2006),"Document lengthnormalization using effective level of term frequency in largecollections", Advances in Information Retrieval, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, Vol. 3936/2006, Pp. 72-83.
  • Diao, Q. and Diao, H. (2000) "Three Term Weighting and Classification Algorithms in Text Automatic Classification", The Fourth International Conference on High-Performance Computing in theAsia-Pacific Region,Vol. 2, P. 629.
  • Chisholm, E. and Kolda, T. F. (1998) "New term weighting Formulas for the vector space method in information retrieval",Technical Report, Oak Ridge National Laboratory.
  • C. Apte, F. Damerau and S. Weiss "Towards language independent automated learning of text categorization models". Proceeding of 17th Annual ACM/SIGIR conference,1994.
  • William W. Cohen and Yoram Singer, "Context sensitive learning methods for text categorization", In SIGIR'96: Proceeding of 19th Annual International ACM/SIGIR conference on research and development in information retrieval, 1996.
  • R. H. Creecy, B. M. Masand, S. J. Smith and D. L. Waltz, "Trading mips and memory for knowledge Engineering", classifying census returns on the connection machine comm. . ACM, 35:48-63,1992
  • N. Fuhr, S. Hartmanna, G. Lusting, M. Schwanter and K. Tzeras, " Rule based multistage indexing systems for large subject field", In 606-623, editor, Proceedings of RIAO'91.
  • D. Koller and M. Sahami," Toward optimal feature selection", In proceedings of the 13th international conference on machine learning 1996
  • D. D. Lewis and M. Ringvette, "Comparison of two learning algorithm for text categorization", In Proceeding Analysis and Information Retrieval(SDAIR'94) 1994.