CFP last date
22 April 2024
Reseach Article

Text Document Clustering based on Semantics

by B. Drakshayani, E V Prasad
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 45 - Number 4
Year of Publication: 2012
Authors: B. Drakshayani, E V Prasad
10.5120/6766-9046

B. Drakshayani, E V Prasad . Text Document Clustering based on Semantics. International Journal of Computer Applications. 45, 4 ( May 2012), 7-12. DOI=10.5120/6766-9046

@article{ 10.5120/6766-9046,
author = { B. Drakshayani, E V Prasad },
title = { Text Document Clustering based on Semantics },
journal = { International Journal of Computer Applications },
issue_date = { May 2012 },
volume = { 45 },
number = { 4 },
month = { May },
year = { 2012 },
issn = { 0975-8887 },
pages = { 7-12 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume45/number4/6766-9046/ },
doi = { 10.5120/6766-9046 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:36:42.993583+05:30
%A B. Drakshayani
%A E V Prasad
%T Text Document Clustering based on Semantics
%J International Journal of Computer Applications
%@ 0975-8887
%V 45
%N 4
%P 7-12
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large sets of documents into a small number of meaningful clusters. Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of K-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. The bag of words representation used for these clustering is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with the problem, we integrate core ontologies as background knowledge into the process of clustering text documents. This model combines phrases analysis as well as words analysis with the use of WordNet as background Knowledge and NLP to explore better ways of document representation for clustering. The Semantic based analysis assigns semantic weights to both document words and phrases. The new weights reflect the semantic relatedness between the documents terms and capture the semantic information in the documents to improve the web document clustering. The method adopted has been evaluated on different data sets with standard performance measures to develop meaningful clusters has been proved.

References
  1. Rui Xu, Donald Wunsch II, « Survey Of Clustering Algorithms » in IEEE Transactions on Neural Networks, Vol. 16, No. 3, May 2005.
  2. James Z. Wang, William Taylor, « Concept Forest : A New Ontology-assisted Text Document Similarity Measurement Method » in 2007 IEEE /WIC/ACM International Conference on Web Intelligence.
  3. Abdelmalek Amine, Zakaria Elberrichi and Michel Simonet, « Evaluation Of Text Clustering Methods Using WordNet », International arab Journal of Information Technology, Vol. 7, No. 4, October 2010.
  4. A. Hotho, S. Staab, and G. Stumme, « Wordnet improve text document clustering », in SIGIR 2003 Semantic Web Workshop, 2003, pp. 541-544.
  5. A. Wong, C S Yang G Salton, "A vector space model for Automaticindexing ," Communication ACM, vol. 18, no. 11, pp. 112-117, 1975.
  6. S. Dumais ,S T Landauer Deerwester, "Indexing by Latent Semantic analysis," Journal of the Society for Information Science, pp. 391-407, 1990.
  7. Thorstan Brants, "statistical POS tagger," in NLP conference, 2000.
  8. Jung Ae Kwak and Hwan- Seung Yong, « Ontology Matching Based On Hypernym, Hyponym, Holonym, and Meronym Sets in WordNet » in International Journal of Web & Semantic Technology(IJWesT), April 2010.
  9. K. Hammouda and M. Kamel, « Efficient Phrase based document indexing for web document clustering », IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 10, pp. 1279-1296, October 2004.
  10. Walaa K. Gad, Mohamed s. Kamel, «PH-SSBM : Phrase Semantic Similarity Based Model for Document Clustering » in 2009 second International Symposium on Knowledge Acquisition and Modeling.
  11. S. Benerjee and T. Pederson, « Adapted Lesk algorithm for word sense disambiguation using wordnet », in Computational Linguistics and Intelligent Text Processing, Feb. 2002.
  12. W. Gad and M. Kamel, « New Semantic Similarity based model for text clustering using extended gloss overlaps, » in International Conference on Machine Learning and Data Mining, July 2009, pp. 663-677.
  13. Tapas Kanungo, Nathan S. Netanyahu, Angela Y. Wu, « An efficient k-means clustering algorithm : Analysis and implémentation »
Index Terms

Computer Science
Information Sciences

Keywords

Document Clustering K-means Semantic Weights Semantic Similarity Pos Tagging Ontologies Wordnet Nlp Similarity Measure