Text Document Clustering based on Semantics

B. Drakshayani; E V Prasad

Call for Paper

March Edition

IJCA solicits high quality original research papers for the upcoming March edition of the journal. The last date of research paper submission is 20 February 2026

Submit your paper

Know more

The week's pick

A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

Jundi Yang Heng Yao

Random Articles

Reseach Article

Text Document Clustering based on Semantics

by B. Drakshayani, E V Prasad

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 45 - Number 4

Year of Publication: 2012

Authors: B. Drakshayani, E V Prasad

10.5120/6766-9046

B. Drakshayani, E V Prasad . Text Document Clustering based on Semantics. International Journal of Computer Applications. 45, 4 ( May 2012), 7-12. DOI=10.5120/6766-9046

@article{ 10.5120/6766-9046,

author = { B. Drakshayani, E V Prasad },

title = { Text Document Clustering based on Semantics },

journal = { International Journal of Computer Applications },

issue_date = { May 2012 },

volume = { 45 },

number = { 4 },

month = { May },

year = { 2012 },

issn = { 0975-8887 },

pages = { 7-12 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume45/number4/6766-9046/ },

doi = { 10.5120/6766-9046 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:36:42.993583+05:30

%A B. Drakshayani

%A E V Prasad

%T Text Document Clustering based on Semantics

%J International Journal of Computer Applications

%@ 0975-8887

%V 45

%N 4

%P 7-12

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large sets of documents into a small number of meaningful clusters. Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of K-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. The bag of words representation used for these clustering is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with the problem, we integrate core ontologies as background knowledge into the process of clustering text documents. This model combines phrases analysis as well as words analysis with the use of WordNet as background Knowledge and NLP to explore better ways of document representation for clustering. The Semantic based analysis assigns semantic weights to both document words and phrases. The new weights reflect the semantic relatedness between the documents terms and capture the semantic information in the documents to improve the web document clustering. The method adopted has been evaluated on different data sets with standard performance measures to develop meaningful clusters has been proved.

References

Rui Xu, Donald Wunsch II, « Survey Of Clustering Algorithms » in IEEE Transactions on Neural Networks, Vol. 16, No. 3, May 2005.
James Z. Wang, William Taylor, « Concept Forest : A New Ontology-assisted Text Document Similarity Measurement Method » in 2007 IEEE /WIC/ACM International Conference on Web Intelligence.
Abdelmalek Amine, Zakaria Elberrichi and Michel Simonet, « Evaluation Of Text Clustering Methods Using WordNet », International arab Journal of Information Technology, Vol. 7, No. 4, October 2010.
A. Hotho, S. Staab, and G. Stumme, « Wordnet improve text document clustering », in SIGIR 2003 Semantic Web Workshop, 2003, pp. 541-544.
A. Wong, C S Yang G Salton, "A vector space model for Automaticindexing ," Communication ACM, vol. 18, no. 11, pp. 112-117, 1975.
S. Dumais ,S T Landauer Deerwester, "Indexing by Latent Semantic analysis," Journal of the Society for Information Science, pp. 391-407, 1990.
Thorstan Brants, "statistical POS tagger," in NLP conference, 2000.
Jung Ae Kwak and Hwan- Seung Yong, « Ontology Matching Based On Hypernym, Hyponym, Holonym, and Meronym Sets in WordNet » in International Journal of Web & Semantic Technology(IJWesT), April 2010.
K. Hammouda and M. Kamel, « Efficient Phrase based document indexing for web document clustering », IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 10, pp. 1279-1296, October 2004.
Walaa K. Gad, Mohamed s. Kamel, «PH-SSBM : Phrase Semantic Similarity Based Model for Document Clustering » in 2009 second International Symposium on Knowledge Acquisition and Modeling.
S. Benerjee and T. Pederson, « Adapted Lesk algorithm for word sense disambiguation using wordnet », in Computational Linguistics and Intelligent Text Processing, Feb. 2002.
W. Gad and M. Kamel, « New Semantic Similarity based model for text clustering using extended gloss overlaps, » in International Conference on Machine Learning and Data Mining, July 2009, pp. 663-677.
Tapas Kanungo, Nathan S. Netanyahu, Angela Y. Wu, « An efficient k-means clustering algorithm : Analysis and implémentation »

Index Terms

Computer Science

Information Sciences

Keywords

Document Clustering K-means Semantic Weights Semantic Similarity Pos Tagging Ontologies Wordnet Nlp Similarity Measure