CFP last date
20 May 2024
Reseach Article

Clustering of Blogs with Enhanced Semantics

by A. K. Singh, R. C. Joshi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 16 - Number 7
Year of Publication: 2011
Authors: A. K. Singh, R. C. Joshi
10.5120/2026-2741

A. K. Singh, R. C. Joshi . Clustering of Blogs with Enhanced Semantics. International Journal of Computer Applications. 16, 7 ( February 2011), 12-16. DOI=10.5120/2026-2741

@article{ 10.5120/2026-2741,
author = { A. K. Singh, R. C. Joshi },
title = { Clustering of Blogs with Enhanced Semantics },
journal = { International Journal of Computer Applications },
issue_date = { February 2011 },
volume = { 16 },
number = { 7 },
month = { February },
year = { 2011 },
issn = { 0975-8887 },
pages = { 12-16 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume16/number7/2026-2741/ },
doi = { 10.5120/2026-2741 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:04:15.233489+05:30
%A A. K. Singh
%A R. C. Joshi
%T Clustering of Blogs with Enhanced Semantics
%J International Journal of Computer Applications
%@ 0975-8887
%V 16
%N 7
%P 12-16
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Blogs are among the fastest growing space among the user generated content over the internet. It is fast becoming the tool for information dissemination, and communication. Blogs provide a platform for information sharing, discussions, and expression of reader’s reactions to the blog post. Clustering of blogs greatly simplify blog searching and browsing by organizing them into similar groups. The Blogs are generally organized using tags. In this paper, we have studied the effect of considering other relevant neighborhood contexts and adding the extracted information to the original tag set carried by the blog. The added semantics is extracted by disambiguating all the synsets for the important terms/ or key phrases within the blog. This work reports the study of measuring similarity, on enhanced blog features and subsequently grouping of all blog articles based on the semantics of the tags they carry. We propose to include the semantics extracted from the title, body, and comments of a blog post to its original tagset in clustering blog documents and evaluate the hypothesis that adding extracted semantics from these blog constituents improves the cluster quality. For clustering k-means algorithm is used. The experimental results obtained confirm our hypothesis that adding the semantics improves better clusters. The approach first extracts the relevant features from the target blog corpus, title and comments. The other senses represented by the relevant keywords are discovered by using a general purpose semantics extractor. All the synsets of the relevant keywords are extracted from the WORDNET. The extracted keyword senses are then appended to the base tagsets. A semantic similarity measure is used for computing the semantic similarity among the documents. Clusters are obtained based on it. The two clusters output are compared.

References
  1. Jain, A.K. and Dubes, R. C. 1988, Algorithms for Clustering Data, Prentice-Hall advanced reference series, Prentice-Hall, Inc., Upper Saddle River, NJ.
  2. Murty, M.N. and Jain, A. K. 1995, Knowledge-based clustering scheme for collection management and retrieval of library books, Pattern Recognition, 28, pp. 949–964.
  3. Mishne, G. 2006, AutoTag: A collaborative approach to automated tag assignment for weblog posts, In Proc. of WWW2006, pp. 953–954.
  4. Haveliwala, T., Gionis, A., Klein, D., and Indyk, P. 2002, Evaluating strategies for similarity search on the web, In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawaii, pp. 432–442.
  5. G. A. Miller. 1995, Wordnet: A lexical database for English, Communications of the ACM, 38(11), pp. 39–41.
  6. Jain, A. K., Murty, M. N., and Flynn, P. J. 1999, Data Clustering a review, ACM Computing Surveys, Vol. 31, No. 3.
  7. Michael Steinbach and George Karypis and Vipin Kumar 2000, A comparison of document clustering techniques, In KDD Workshop on Text Mining, Boston, MA, pp. 109-111.
  8. Y.Zhao and G.Karypis 2002, Comparison of agglomerative and partitional document clustering algorithms, Technical Report #02-014, University of Minnesota.
  9. Andreas Hotho, Steffen Staab, and Gerd Stumme 2003, Wordnet improves Text document Clustering, In Proc. of the SIGIR 2003 Semantic Web Workshop, pp. 541-544.
  10. A.Hotho, A.Maedche and S.Staab 2003, Ontology-based text document clustering, Proc. of the Conf. on Intelligent Information Systems.
  11. Julio Gonzalo, Felisa Verdejo, Irina Chugur, Juan Cigarrin 1998, Indexing with WordNet : synsets can improve text retrieval, pp. 38—44.
  12. Rujiang Bai, Xiaoyue Wang, and Junhua Liao 2009, Folksonomy for the Blogosphere: Blog Identification and Classification, Computer Science and Information Engineering, 2009 WRI World Congress on , vol.3, no., pp.631-635.
  13. Lakshmanan G.T., and Oberhofer M.A. 2010, Knowledge Discovery in the Blogosphere: Approaches and Challenges, Internet Computing, IEEE , vol.14, no.2, pp.24-32.
  14. Liping Jing, M.K. Ng, J. Xu, and J.Z. Huang 2005, Subspace clustering of text documents with feature weighting k-means algorithm, Proc.of PAKDD, volume 3518 of Lecture Notes in Computer Science, pp. 802-812.
  15. G. Salton, A. Wong, and C. S. Yang 1975, A Vector Space Model for Automatic Indexing, Communications of the ACM, vol. 18, no.11, pages 613-620.
  16. Salton G., and Buckley C. 1988, Term Weighting Approaches in Automatic Text Retrieval, Information Processing and Management 24(5), pp. 513-523.
  17. Baeza Yates R., and Ribeiro-Neto B. 1999, Modern information retrieval, Addison Wesley Longman Publishing Co. Inc., Boston, MA, USA.
  18. X. Wu, M. McTear, and P. Ojha 1993, Word sense disambiguation by a higher order connectionist net based on distributed representations, in Proceedings of TENCON '93. IEEE Region 10 International Conference on Computers, Communications and Automation, NY, USA, pp. 893-897.
Index Terms

Computer Science
Information Sciences

Keywords

tagset VSM (vector space model) k-means clustering blog clustering semantic simialrity