Call for Paper - March 2023 Edition
IJCA solicits original research papers for the March 2023 Edition. Last date of manuscript submission is February 20, 2023. Read More

An Efficient Text Clustering Framework

Print
PDF
International Journal of Computer Applications
© 2013 by IJCA Journal
Volume 79 - Number 8
Year of Publication: 2013
Authors:
Francis M. Kwale
10.5120/13763-1607

Francis M Kwale. Article: An Efficient Text Clustering Framework. International Journal of Computer Applications 79(8):30-38, October 2013. Full text available. BibTeX

@article{key:article,
	author = {Francis M. Kwale},
	title = {Article: An Efficient Text Clustering Framework},
	journal = {International Journal of Computer Applications},
	year = {2013},
	volume = {79},
	number = {8},
	pages = {30-38},
	month = {October},
	note = {Full text available}
}

Abstract

The amount of data for analysis is increasing at a dramatic rate, for example web data. And so, it's important to improve techniques of searching relevant information from the huge data so as to increase efficiency. One such technique is text clustering, whereby we group (or cluster) text documents into various groups (or clusters), such as clustering web search engine results into meaningful groups. Data mining is a computer science area that can be defined as extraction of useful information from large structured data. Text mining on the other hand is an extension of data mining dealing only with (unstructured) text data. Text clustering is thus a text mining technique. In this paper, we give an insight of text clustering including the text mining related areas, techniques, and application areas. We also propose a framework for doing text clustering based on the K Means algorithm. The paper thus gives guidance to researchers of text mining concerning the state of art of text clustering.

References

  • Alelyani, S. , Tang, J. , and Liu, H. Feature selection for clustering: A review. Online notes, unpublished.
  • Andrews, N. , and Fox, E. Recent developments in document clustering. Technical Report, Department of Computer Science, Virginia Tech, viewed 31 January 2013. unpublished.
  • Bharathi, G. , and Venkatesan, D. 2012. Study of ontology or thesaurus based document clustering and information retrieval. Journal of Theoretical and Applied Information Technology. Vol. 40, no. 1.
  • Boomija, M. , 2008. Comparison of partition based clustering algorithms. Journal of Computer Applications, Vol. 1, no. 4.
  • Chen, C. , Tseng, F. , and Liang, T. 2010. Mining fuzzy frequent item sets for hierarchical document clustering. Information Processing and Management. Vol. 46, no. 2, pp. 193–211.
  • Chifu, E. 2010. Self organizing maps in web mining and semantic web, PhD Thesis, Technical University of Cluj-Napoca.
  • Chitsaz, E. , Taheri, M. , Katebi S. , and Jahromi M. 2009. An improved fuzzy feature clustering and selection based on chi-squared test. Proceedings of the International MultiConference of Engineers and Computer Scientists 2009. Vol. I, IMECS 2009, March 18 - 20, 2009, Hong Kong, viewed 14 July 2013,
  • Chu, S. , Roddick, J. , Pan, J. Improved search strategies and extensions to K-medoids-based algorithms. Technical Report KDM-02-005, School of Informatics and Engineering Flinders University of South Australia, viewed 24 June 2013, unpublished.
  • Fung, B. 1999. Hierarchical document clustering using frequent item sets. MSc Thesis, Simon Fraser University, 1999.
  • Geraci, F. 2008. Fast clustering for web information retrieval. PhD Thesis, Universit' A Degli Studi Di Siena.
  • Gruber, T. 1995. Toward principles for the design of ontologies used for knowledge sharing. International Journal Human-Computer Studies. Vol. 43, nos. 5-6, pp. 907-928.
  • Guduru, N. 2006. Text mining with support vector machines and non-negative matrix factorization algorithms. MSc Thesis, University of Rhode Island.
  • Hao, Z. 2012. A new text clustering method based on KGA. Journal of Software. Vol. 7, no. 5, pp. 1-5.
  • Hotho, A. , Maedche, A. , and Staab, S. 2001. Ontology-based text document clustering. Proceedings of the Workshop "Text Learning: Beyond Supervision" at IJCAI 2001 Seattle WA USA, August 6, 2001. Viewed 05 February 2013, unpublished.
  • Jayabharathy, J. , Kanmani, S. , and Parveen, A. 2011. A survey of document clustering algorithms with topic discovery. Journal of Computing. Vol. 3, no. 2, pp. 1-3.
  • Khan, L. 2000. Ontology-based information selection. PhD Thesis, University of Southern California.
  • Krishna, B. , Satheesh, P. , and Kumar, S. 2012. Comparative study of K-means and Bisecting K-means techniques in Wordnet-based document clustering. International Journal of Engineering and Advanced Technology. Vol 1, no 6, pp 1-4.
  • Langville, A. and Meyer, C. Text mining using the nonnegative matrix factorization. SIAM-SEAS–Charleston, 2005, unpublished.
  • Lasek, P. 2011. Efficient density-based clustering. PhD Thesis, Warsaw University of Technology.
  • Lee, S. , Song, J. , and Kim, Y. An empirical comparison of four text mining methods. Journal of Computer Information Systems, 2010, unpublished.
  • Liu, T. , Liu, S. , Chen, Z. , and Ma, Z. 2003. An evaluation on feature selection for text clustering. Paper presented at proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC.
  • Li, Y. , 2007. High performance text document clustering. PhD Thesis, Wright State University.
  • Li, Y. , Congnan, L. , and Soon, M. 2008. Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, vol. XX, no. YY.
  • Magatti, D. 2010. Graphical models for text mining: knowledge extraction and performance estimation. PhD Thesis, UNIVERSITÀ DEGLI STUDI DI MILANO – BICOCCA.
  • Moldovan, D. , and Novischi, A. 2004. Word sense disambiguation of WordNet glosses. Elsevier Ltd, 2004, viewed 16 June, 2013, unpublished.
  • Mooney, R. , and Nahm, U. 2003. Text mining with information extraction. Paper presented at the Proceeding of the 4th International MIDP Colloquim, Bloemfontein, South Africa, pp. 141-160, September 2003, viewed 29 January 2013,
  • Ng, R. , and Han, J. 2002. CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering. Vol. 14, no. 5.
  • Ning, W. 2005. Text mining and organization in large corpus. MSc Thesis, Technical University of Denmark (DTU).
  • Punitha, S. , and Punithavalli, M. 2012. A comparative study to find a suitable method for text document clustering. IJCSNS International Journal of Computer Science and Network Security. Vol. 12, no. 10.
  • Rehurek, R. 2011. Scalability of semantic analysis in natural language processing. PhD Thesis, Masaryk University.
  • Rai, P. 2010. A survey of clustering techniques. International Journal of Computer Applications. Vol. 7, no 12.
  • Rosell, M. , "Clustering exploration: Swedish text representation and clustering results unraveled", PhD Thesis, Stockholm, Sweden, 2009.
  • Sharma, S. , and Gupta, V. 2012. Recent development in text clustering techniques. International Journal of Computer Applications (0975 – 8887). Vol. 37, no. 6, pp. 1-5.
  • Sree K. , and Murthy J. 2012. Clustering based on cosine similarity measure. International Journal of Engineering Science & Advanced Technology. Vol 2, no 3, pp 1-2.
  • Stefanowski, J. Data mining clustering. Online lecture notes", 2009, viewed 10 June 2013, unpublished.
  • Steinbach, M. , Karypis, G. , and Kumar, V. A comparison of document clustering techniques. Technical Report, Department of Computer Science and Engineering, University of Minnesota, 2000. Viewed 30 July 2012, unpublished.
  • Wanner, L. Introduction to clustering techniques. Online notes, 2004, viewed 10 June, 2013, unpublished.