CFP last date
20 May 2024
Reseach Article

More work on K -Means Clustering Algorithm: The Dimensionality Problem

by Barileé Barisi Baridam
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 44 - Number 2
Year of Publication: 2012
Authors: Barileé Barisi Baridam
10.5120/6236-8332

Barileé Barisi Baridam . More work on K -Means Clustering Algorithm: The Dimensionality Problem. International Journal of Computer Applications. 44, 2 ( April 2012), 23-30. DOI=10.5120/6236-8332

@article{ 10.5120/6236-8332,
author = { Barileé Barisi Baridam },
title = { More work on K -Means Clustering Algorithm: The Dimensionality Problem },
journal = { International Journal of Computer Applications },
issue_date = { April 2012 },
volume = { 44 },
number = { 2 },
month = { April },
year = { 2012 },
issn = { 0975-8887 },
pages = { 23-30 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume44/number2/6236-8332/ },
doi = { 10.5120/6236-8332 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:34:32.581415+05:30
%A Barileé Barisi Baridam
%T More work on K -Means Clustering Algorithm: The Dimensionality Problem
%J International Journal of Computer Applications
%@ 0975-8887
%V 44
%N 2
%P 23-30
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The K-means clustering algorithm is an old algorithm that has been intensely researched owing to its simplicity of implementation. However, there have also been criticisms on its performance, in particular, for demanding the value of K a priori. It is evident from previous researches that providing the number of clusters a priori does not in any way assist in the production of good quality clusters. The objective of this paper is to investigate the usefulness of the K-means clustering in the clustering of high and multi-dimensional data by applying it to biological sequence data which is known for high and multi-dimension. The squared-Euclidean distance and the cosine measure are used as the similarity measures. The silhouette validity index is used first to show K-means algorithm's inefficiency in the clustering of high and multi-dimensional data irrespective of the distance or similarity measure employed. A further study was to introduce a preprocessor scheme to the K-means algorithm to automatically initialize a suitable value of K prior to the execution of the K-mean algorithm. The dimensionality problem investigated suggests that the use of the preprocessor improves the quality of clusters significantly for the biological data sets considered. Furthermore, it is then shown that the K-means algorithm with preprocessor produces good quality, compact and well-separated clusters of the biological data obtained from a high-dimension-to-low- dimension mapping scheme introduced in the paper.

References
  1. P. Berkhin, 2002. "Survey of clustering data mining techniques," Accrue Software, Inc. , San Jose, California, Tech. Rep. 4, available online: www. citeseer. nj. nec. com/berkhin02survey. html.
  2. D. A. Binder, 1977. "Cluster analysis under parametric models," PhD thesis, University of London.
  3. P. Hansen and B. Jaumard, 1997. "Cluster analysis and mathematical programming," in 16th International Symposiium on Mathematical Programming, vol. 79, pp. 191–215.
  4. J. B. MacQueen, 1967. "Some methods for classification and analysis of multivariate observations," in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, pp. 281–297.
  5. F. D. Smet, J. Mathys, K. Marchal, G. Thijs, B. D. Moor, and Y. Moreau, 2002. "Adaptive quality-based clustering of gene expression profiles," Bioinformatics, vol. 18, no. 6, pp. 735–748.
  6. J. Tou and R. Gonzalez, 1974. Pattern Recognition Principles. Massachusetts, USA: Addison-Wesley,.
  7. K. Huang, 2002. "A synergistic automatic clustering technique (SYN- ERACT) for multispectral image analysis," Photogrammetric Engineering and Remote Sensing, vol. 1, no. 1, pp. 33–40.
  8. J. Tou, 1979. "DYNOC - a dynamic optimal cluster-seeking technique," International Journal of Computer and Information Sciences, vol. 8, no. 6, pp. 541–547.
  9. C. Rosenberger and K. Chehdi, 2000. "Unsupervised clustering method with optimal estimation of the number of clusters: Application to image segmentation," in Proceedings of the International Conference on Pattern Recognition (ICPR'00), vol. 1, pp. 1656–1659.
  10. P. Rousseuw, 1987. "Silhouettes: a practical aid to the interpretation and validation of cluster analysis," Computational and applied mathematics, vol. 20.
  11. J. C. Bezdek, 1980. "A convergence theorem for the fuzzy ISODATA clustering algorithms," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 2, pp. 1–8.
  12. R. H. Turi, 2001. "Clustering-based colour image segmentation," PhD thesis, Monash University.
  13. M. G. H. Omran, 2004. "Particle swarm optimization methods for pattern recognition and image processing," PhD thesis, Univer- sity of Pretoria, Faculty of Engineering, Built Environment and Information Technology, Department of Computer Science, Nov.
  14. R. Xu and D. Wunsch II, 2005. "Survey of clustering algorithms," IEEE Transactions on Neural Networks, vol. 16, no. 3, pp. 601–614
  15. S. K. Gupta, K. S. Rao, and V. Bhatnagar, 1999. "K-means clustering algorithm for categorical attributes," in Proceedings of 1st Inter- national Conference on Data Warehousing and Knowledge Discovery, Florence, Italy, pp. 203–208.
  16. W. Z. Altun, G. Harrison, R. Tai, and P. C. Yi Pan, 2005. "Improved k- means clustering algorithm for exploring local protein sequence motifs representing common structural property," IEEE Trans. on Nanobioscience, vol. 4, no. 3, pp. 255–265.
  17. A. P. Gasch and M. B. Eisen, 2002. "Exploring the conditional correlation of yeast gene expression through fuzzy k-means clustering," Genome Biology, vol. 3, no. 11.
  18. K. F. Han and D. Baker, 1995. "Recurring local sequence motifs in proteins," Journal of Molecular Biology, vol. 251, pp. 176–187.
  19. MATLAB, 2004. The Language of Technical Computing. The Mathworks, Inc. , May vol. version 7. 0.
  20. L. Kaufman and P. Rousseeuw, 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley.
  21. F. Azuaje, 2002. "Cluster validity framework for genome expression data," Bioinformatics, vol. 18, no. 2.
  22. M. D. G. Teledo, 2005. "A comparison in cluster validation techniques," Master of Science thesis, University of Puerto Rico, Department of Mathematics (Statistics).
  23. N. Bolshakova and F. Azuaje, 2003. "Cluster validation techniques for genome expression data," Signal Processing, vol. 83, pp. 825–833.
  24. Spector, A. Z. 1989. Achieving application requirements. In Distributed Systems, S. Mullender
Index Terms

Computer Science
Information Sciences

Keywords

Clustering Dimensionality Categorical Data Silhouette Validity Index