Article:An Empirical Selection Method for Document Clustering

P.Perumal; R. Nedunchezhian; D.Brindha

Call for Paper

March Edition

IJCA solicits high quality original research papers for the upcoming March edition of the journal. The last date of research paper submission is 20 February 2026

Submit your paper

Know more

The week's pick

A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

Jundi Yang Heng Yao

Random Articles

Reseach Article

Article:An Empirical Selection Method for Document Clustering

by P.Perumal, R. Nedunchezhian, D.Brindha

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 31 - Number 3

Year of Publication: 2011

Authors: P.Perumal, R. Nedunchezhian, D.Brindha

10.5120/3803-5249

P.Perumal, R. Nedunchezhian, D.Brindha . Article:An Empirical Selection Method for Document Clustering. International Journal of Computer Applications. 31, 3 ( October 2011), 15-19. DOI=10.5120/3803-5249

@article{ 10.5120/3803-5249,

author = { P.Perumal, R. Nedunchezhian, D.Brindha },

title = { Article:An Empirical Selection Method for Document Clustering },

journal = { International Journal of Computer Applications },

issue_date = { October 2011 },

volume = { 31 },

number = { 3 },

month = { October },

year = { 2011 },

issn = { 0975-8887 },

pages = { 15-19 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume31/number3/3803-5249/ },

doi = { 10.5120/3803-5249 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:17:09.678962+05:30

%A P.Perumal

%A R. Nedunchezhian

%A D.Brindha

%T Article:An Empirical Selection Method for Document Clustering

%J International Journal of Computer Applications

%@ 0975-8887

%V 31

%N 3

%P 15-19

%D 2011

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Model Selection is a task selecting set of potential models. This method is capable of establishing hidden semantic relations among the observed features, using a number of latent variables. In this paper, the selection of the correct number of latent variables is critical. In the most of the previous researches, the number of latent topics was selected based on the number of invoked classes. This paper presents a method, based on backward elimination approach, which is capable of unsupervised order selection in PLSA. During the elimination process, proper selection of some latent variables which must be deleted is the most essential problem, and its relation to the final performance of the pruned model is straightforward. To treat this problem, we introduce a new combined pruning method which selects the best options for removal, has been used. The obtained results show that this algorithm leads to an optimized number of latent variables. In this paper, we propose a novel approach, namely DPMFS, to address this issue.

References

Tahereh Emami Azadi, FarshadAlmasganj (2009) “Using backward elimination with a new model order reduction algorithm to select best double mixture model for document clustering”, Expert Systems with Applications 36 (2009) 10485–10493
M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixture models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no.3, pp. 381–396, Mar. 2002.
M. W. Graham and D. J. Miller, “Unsupervised learning of parsimoniousmixtures on large feature spaces,” Electrical Engineering Dept., Pennsylvania State, University Park, PA, Tech. Rep., 2004.
Hofmann, T. (1999). Probabilistic latent semantic analysis. In Proceedings of the 22th annual international ACM/SIGIR conference on research and development in information retrieval (pp. 50–57).
D. J. Miller and J. Browning, “A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 25, no. 11, pp. 1468–1483, Nov. 2003.
S.Vaithyanathan and B. Dom, “Generalized model selection for unsupervised learning in high dimensions,” in Adv. Neural Inf. Process. Syst., vol. 11, 1999, pp. 970–976.
S. C. Deerwester, S. T. Dumais, T. KLandauer, G.W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41 (6):391–407, 1990
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proceeding knowledge discovery and data mining (KDD) and workshop text mining. Boston.
E. I. George and R. E. McCulloch. (1992). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88:881-889.
S. Kim. (2006). Variable selection in clustering via Dirichlet process mixture models. Biometrika, 93(4):877-893.
Document Clustering via Dirichlet Process Mixture Model with Feature Selection.GuanYu,Ruizhang Huang,Zhaojun WangKDD’10, July 25-28, 2010, Washington, DC, USA.
Y. W. Teh, M. I. Jordan, M.J. Beal, and D.M. Blei. (2007).Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476):1566-1581.
A. Vlachos, Z. Ghahramani, and A. Korhonen. (2008).Dirichlet process mixture models for verb clustering. ICML Workshop on Prior Knowledge for Text and Language Processing, Helsinki, Finland.

Index Terms

Computer Science

Information Sciences

Keywords

Document clustering Model selection EM algorithm Dirichlet Process Mixture Model Feature Selection