CFP last date
20 May 2024
Reseach Article

A Naive Clustering Algorithm for Text Mining

by Aishwarya Kappala, Sudhakar Godi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 127 - Number 17
Year of Publication: 2015
Authors: Aishwarya Kappala, Sudhakar Godi
10.5120/ijca2015906717

Aishwarya Kappala, Sudhakar Godi . A Naive Clustering Algorithm for Text Mining. International Journal of Computer Applications. 127, 17 ( October 2015), 20-24. DOI=10.5120/ijca2015906717

@article{ 10.5120/ijca2015906717,
author = { Aishwarya Kappala, Sudhakar Godi },
title = { A Naive Clustering Algorithm for Text Mining },
journal = { International Journal of Computer Applications },
issue_date = { October 2015 },
volume = { 127 },
number = { 17 },
month = { October },
year = { 2015 },
issn = { 0975-8887 },
pages = { 20-24 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume127/number17/22821-2015906717/ },
doi = { 10.5120/ijca2015906717 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:18:17.112960+05:30
%A Aishwarya Kappala
%A Sudhakar Godi
%T A Naive Clustering Algorithm for Text Mining
%J International Journal of Computer Applications
%@ 0975-8887
%V 127
%N 17
%P 20-24
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Predefined categories can be assigned to the natural language text using for text classification. It is a “bag-of-word” representation, previous documents have a word with values, it represents how frequently this word appears in the document or not. But large documents may face many problems because they have irrelevant or abundant information is there. This paper explores the effect of other types of values, which express the distribution of a word in the document. These values are called distributional features. All features are calculated by tfidf style equation and these features are combined with machine learning techniques. Term frequency is one of the major factor for distributional features it holds weighted item set. When the need is to minimize a certain score function, discovering rare data correlations is more interesting than mining frequent ones. This paper tackles the issue of discovering rare and weighted item sets, i.e., the infrequent weighted item set mining problem. The classifier which gives the more accurate result is selected for categorization. Experiments show that the distributional features are useful for text categorization.

References
  1. R. Bekkerman, R. Elaine, N.Tishby, and Y.Winter, “Distributional Word Clusters versus Words for Text Categorization,”J. Machine Learning Research, vol. 3, pp. 1182-1208, 2003
  2. G. Narasimha Rao, R. Ramesh, D. Rajesh, D. Chandra sekhar."An Automated Advanced Clustering Algorithm For Text Classification". In International Journal of Computer Science and Technology, vol 3,issue 2-4, June, 2012, eISSN : 0976 - 8491,pISSN : 2229 – 4333.
  3. D.CAI, S.P. Yu, J.R. Wen, and WY. Ma, “VIPS: A Vision-Based Page Segmentation Algorithm” Technical Report MSR-TR-2003-79, Microsoft Seattle, Washington, 2003.
  4. J.P. Calan, “Passage Retrieval Evidence in Document Retrieval,”Proc. ACM SIGIR ’94, pp. 30310, 1994.
  5. Rao, Gudikandhula Narasimha, and P. Jagdeeswar Rao. "A Clustering Analysis for Heart Failure Alert System Using RFID and GPS." ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India-Vol I. Springer International Publishing, 2014.
  6. M. Craven, D. DiPasquo, D. Freitag, A. K. McCallum, T. M. Mitchell, K. Nigam, and S. Slattery, “Learning to extract symbolic knowledge from the world wide web,” in Proceedings of the 15th National Conference for Artificial Intelligence, Madison, WI, 1998, pp. 509–516.
  7. F. Debole and F. Sebastiani, “Supervised term weighting for automated text categorization,” in Proceedings of the 18th ACM Symposium on Applied Computing, Melbourne, FL, 2003, pp. 784–788.
  8. T. G. Dietterich, “Machine learning research: Four current directions,”
  9. D. Lewis, “Reuters-21578 text categorization test colleciton, dist. 1.0,” 1997. AI Magazine, vol. 18, no. 4, pp. 97–136, 1997.
  10. Y. Yang, “An evaluation of statistical approaches to text categorization,” in Inf. Retreival, vol. 1, pp. 69–90, 1999.
  11. S. Shankar and G. Karypis, “A Feature Weight Adjustment Algorithm for Document Classification,” Proc. SIGKDD ’00 Workshop Text Mining, 2000.
  12. K. Sun and F. Bai, “Mining Weighted Association Rules Without Preassigned Weights,” IEEE Trans. Knowledge and Data Eng., vol. 20, no. 4, pp. 489-495, Apr. 2008.
  13. S. Zhu, X. Ji, W. Xu, and Y. Gong, “Multi-labelled classification using maximum entropy method,” in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2005, pp. 1041–1048
  14. J.P. Calan, “Passage Retrieval Evidence in Document Retrieval,”Proc. ACM SIGIR ’94, pp. 30310, 1994.
  15. X. Ling, Q. Mei, C. Zhai, and B. Schatz, “Mining multi-faceted overviews of arbitrary topics in a text collection,” in Proc. 14th ACM SIGKDD Knowl. Discovery Data Mining, 2008, pp. 497–505.
  16. I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” in J. Mach. Learn. Res., vol. 3, no. 1, pp. 1157–1182, 2003.
  17. T. Joachims, “Transductive inference for text classification using support vector machines,” in Proc. Annu. Int. Conf. Mach. Learn., 1999, pp. 200–209.
  18. X.-L. Li, B. Liu, and S.-K. Ng, “Learning to classify documents with only a small positive training set,” in Proc. 18th Eur. Conf. Mach. Learn., 2007, pp. 201–213.
  19. Y. Li, A. Algarni, S.-T. Wu, and Y. Xue, “Mining negative relevance feedback for information filtering,” in Proc. Web Intell. Intell. Agent Technol., 2009, pp. 606–613.
  20. S.-T. Wu, Y. Li, and Y. Xu, “Deploying approaches for pattern refinement in text mining,” in Proc. IEEE Conf. Data Mining, 2006, pp. 1157–1161.
Index Terms

Computer Science
Information Sciences

Keywords

Text Classification Text Mining Machine Learning Compactness tfidi Weighted database.