CFP last date
20 May 2024
Reseach Article

Improving web page clustering using Probabilistic Latent Semantic Analysis

Published on April 2012 by Lalit A. Patil, S M. Kamalapur
Emerging Trends in Computer Science and Information Technology (ETCSIT2012)
Foundation of Computer Science USA
ETCSIT - Number 4
April 2012
Authors: Lalit A. Patil, S M. Kamalapur
58442d73-f633-49b1-9ef7-4a105e87ad5f

Lalit A. Patil, S M. Kamalapur . Improving web page clustering using Probabilistic Latent Semantic Analysis. Emerging Trends in Computer Science and Information Technology (ETCSIT2012). ETCSIT, 4 (April 2012), 1-4.

@article{
author = { Lalit A. Patil, S M. Kamalapur },
title = { Improving web page clustering using Probabilistic Latent Semantic Analysis },
journal = { Emerging Trends in Computer Science and Information Technology (ETCSIT2012) },
issue_date = { April 2012 },
volume = { ETCSIT },
number = { 4 },
month = { April },
year = { 2012 },
issn = 0975-8887,
pages = { 1-4 },
numpages = 4,
url = { /proceedings/etcsit/number4/5982-1025/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 Emerging Trends in Computer Science and Information Technology (ETCSIT2012)
%A Lalit A. Patil
%A S M. Kamalapur
%T Improving web page clustering using Probabilistic Latent Semantic Analysis
%J Emerging Trends in Computer Science and Information Technology (ETCSIT2012)
%@ 0975-8887
%V ETCSIT
%N 4
%P 1-4
%D 2012
%I International Journal of Computer Applications
Abstract

Traditional clustering algorithms are usually based on the bag-of-words (BOW) approach. A notorious disadvantage of the BOW model is that it ignores the semantic relationship among words. As a result, if two documents use different collections of core words to represent the same topic, they may be assigned to different clusters, even though the core words they use are probably synonyms or semantically associated in other form and other disadvantage of conventional web page clustering technique is often utilized to reveal the functional similarity of web pages. Tagging can be beneficial to improve the clustering performance. Several efforts have been made to explore social tagging for clustering. But there is some drawbacks of tagging web based clustering. To our knowledge, all the existing approaches exploiting tag information for webpage clustering assume that all the WebPages are tagged, which is a somewhat restrictive assumption. In a more realistic setting, one can only expect that the tags will be available for only a small number of WebPages. In this paper, we propose a new web page grouping approach based on Probabilistic Latent Semantic Analysis (PLSA) model. An iterative algorithm based on maximum likelihood principle is employed to overcome the aforementioned computational shortcoming

References
  1. Thomas Hofmann, "Probabilistic Latent Semantic Indexing", Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval.
  2. Dempster, A. , Laird, N. , and Rubin, D. "Maximum likelihood from incomplete data via the EM algorithm. " J. Royal Statist. Soc. B 39 (1977), 138.
  3. Dumais, S. T. Latent semantic indexing", Trec-3 report. In Proceedings of the Text Retrieval Conference (TREC-3) (1995), D. Harman, Ed. , pp. 219.
  4. Gildea, D. , and Hofmann, T. Topic-based language models using em. In Proceedings of the 6th European Conference on Speech Communication and Technology(EUROSPEECH) (1999).
  5. Hofmann, T. Latent class models for collaborative filtering. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI) (1999).
  6. Hofmann, T. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in AI (1999).
  7. Hofmann, T. , Puzicha, J. , and Jordan, M. I. Unsupervised learning from dyadic data. In Advances in Neural Information Processing Systems (1999),vol. 11.
  8. Michael Tipping and Christopher Bishop. 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61(3):611-622.
  9. Anusua Trivedi, Piyush Rai, Scott L. DuVall "Exploiting Tag and Word Correlations for Improved Webpage Clustering "SMUC'10, October 30,2010, Toronto, Ontario, Canada. Copyright 2010 ACM.
  10. http://www. stumbleupon. com
  11. http://www. delicious. com
  12. Open Directory Project (http://www. dmoz. org/)
Index Terms

Computer Science
Information Sciences

Keywords

web page