Call for Paper - November 2023 Edition
IJCA solicits original research papers for the November 2023 Edition. Last date of manuscript submission is October 20, 2023. Read More

Comparing K-Value Estimation for Categorical and Numeric Data Clustring

International Journal of Computer Applications
© 2010 by IJCA Journal
Number 3 - Article 2
Year of Publication: 2010

K.Arunprabha and V.Bhuvaneswari. Article:Comparing K-Value Estimation for Categorical and Numeric Data Clustring. International Journal of Computer Applications 11(3):4–7, December 2010. Published By Foundation of Computer Science. BibTeX

	author = {K.Arunprabha and V.Bhuvaneswari},
	title = {Article:Comparing K-Value Estimation for Categorical and Numeric Data Clustring},
	journal = {International Journal of Computer Applications},
	year = {2010},
	volume = {11},
	number = {3},
	pages = {4--7},
	month = {December},
	note = {Published By Foundation of Computer Science}


In Data mining, Clustering is one of the major tasks and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. We present an improved algorithm for learning k while clustering the Categorical clustering. We present a clustering algorithm Gaussian means applied in k-means paradigm that works well for categorical features. For applying Categorical dataset to this algorithm, converting it into numeric dataset. In this paper we present a Heuristic novel techniques are used for conversion and comparing the categorical data with numeric data. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs in k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. G-means only requires one intuitive parameter, the standard statistical significance level α.


  • “Anderson-Darling: A Goodness of Fit Test for Small Samples Assumptions”,START,Vol .10,No.5.
  • Ahmed M. Sultan Hala Mahmoud Khaleel., ”A new modified Goodness of fit tests for type 2 censored sample from Normal population“
  • Blake. C.L. and Merz. C.J. “ UCI repository of machine learning databases”,1998.
  • Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. “Adaptive dimension reduction for clustering high dimensional data”.In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002.
  • Dongmin Cai, and Stephen S-T Yau, ”Categorical Clustering By Converting Associated Information” International Journal of Computer Science 1;1 2006.
  • Greg Hamerly,Charles Elkan, “Learning the k in k means”
  • Gregory James Hamerly,”Learning structure and concepts in data through data clustering”. 2001.
  • Jain,A.K., Murty. M. N., and Flynn. P. J. “Data clustering: a review”. ACM Computing Surveys, 1999.
  • Stephens. M.A. “EDF statistics for goodness of fit and some comparisons”. American Statistical Association, September 1974.
  • Zhang. Y. , Fu. A, Cai. C. and Heng. P., “Clustering categorical data” 2000
  • Zhexue Huang, ”Extensions to the K-means algorithm for clustering Large Data sets with categorical value”, 1998.