CFP last date
20 May 2024
Reseach Article

A Comprehensive Survey on Centroid Selection Strategies for Distributed K-means Clustering Algorithm

by Poonam Ghuli, Maanas Prabhakar, Rajashree Shettar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 125 - Number 5
Year of Publication: 2015
Authors: Poonam Ghuli, Maanas Prabhakar, Rajashree Shettar
10.5120/ijca2015905919

Poonam Ghuli, Maanas Prabhakar, Rajashree Shettar . A Comprehensive Survey on Centroid Selection Strategies for Distributed K-means Clustering Algorithm. International Journal of Computer Applications. 125, 5 ( September 2015), 36-42. DOI=10.5120/ijca2015905919

@article{ 10.5120/ijca2015905919,
author = { Poonam Ghuli, Maanas Prabhakar, Rajashree Shettar },
title = { A Comprehensive Survey on Centroid Selection Strategies for Distributed K-means Clustering Algorithm },
journal = { International Journal of Computer Applications },
issue_date = { September 2015 },
volume = { 125 },
number = { 5 },
month = { September },
year = { 2015 },
issn = { 0975-8887 },
pages = { 36-42 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume125/number5/22432-2015905919/ },
doi = { 10.5120/ijca2015905919 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:15:16.313597+05:30
%A Poonam Ghuli
%A Maanas Prabhakar
%A Rajashree Shettar
%T A Comprehensive Survey on Centroid Selection Strategies for Distributed K-means Clustering Algorithm
%J International Journal of Computer Applications
%@ 0975-8887
%V 125
%N 5
%P 36-42
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Extremely large data sets often known as ‘Big Data’ are analyzed for interesting patterns, trends, and associations, especially those relating to human behavior and interactions. Extraction of meaningful and useful information needs to be done in parallel using advanced clustering algorithms. In this paper, effort has been made to tweak in changes to the existing K-means algorithm so as to work in parallel using MapReduce paradigm. K-means due to its gradient descent nature is highly sensitive to the initial placement of the cluster centers. This random initialization of cluster centers results in empty clusters and slower convergence. In this paper, an overview of existing methods with emphasis on computational efficiency is presented. Comparison of three well known linear time complexity initialization methods has been presented here. These methods are analyzed on two different data sets. The experimental results are recorded and presented with insights on different initialization methods for practitioners.

References
  1. Min Chen, Shiwen Mao, Yunhao Liu, "Big Data: A Survey", Mobile Networks and Applications, Springer publication, Volume 19, (2), April 2014, pp 171-209
  2. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM, Vol 51, no. 1 (2008): 107-113.
  3. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters." In Sixth Symposium on Operating System Design and Implementation (OSDI 2004), Dec 2004, pp. 137-150.
  4. Mahmud, M. S., Rahman, M. M., & Akhtar, M. N. (2012, December). Improvement of K-means clustering algorithm with better initial centroids based on weighted average. In Electrical & Computer Engineering (ICECE), 2012 7th International Conference on (pp. 647-650). IEEE.
  5. Ding, C., & He, X. (2004, July). K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning (p. 29). ACM.
  6. Ding, C. H., & He, X. (2004, April). Principal Component Analysis and Effective K-Means Clustering. In SDM (pp. 497-501).
  7. D.Napoleon, S.Pavalakodi “A New Method for Dimensionality Reduction using KMeans Clustering Algorithm for High Dimensional Data Set” International Journal of Computer Applications , Vol. 13, (7), January 2011
  8. Nazeer, K., Kumar, S. M., & Sebastian, M. P. (2011, February). Enhancing the k-means clustering algorithm by using a O (n logn) heuristic method for finding better initial centroids. In Emerging Applications of Information Technology (EAIT), 2011 Second International Conference on (pp. 261-264). IEEE.
  9. J. M. Pena, J. A. Lozano, P. Larranaga, 1999, An Empirical Comparison of Four Initialization Methods for the K-Means Algorithm, Pattern Recognition Letters 20 (10) (1999) 1027–1040.
  10. J. He, M. Lan, C. L. Tan, S. Y. Sung, H. B. Low 2004 Initialization of Cluster Refinement Algorithms: A Review and Comparative Study, in: Proc. of the 2004 IEEE Int. Joint Conf. on Neural Networks, pp. 297–302.
  11. A. K. Jain, M. N. Murty, P. J. Flynn 1999 Data Clustering: A Review, ACM Computing Surveys, 31 (3), pp 264–323.
  12. A. K. Jain 2010 Data Clustering: 50 Years Beyond K-means, Pattern Recognition Letters 31 (8) 651–666.
  13. S. Z. Selim, M. A. Ismail, K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality, IEEE Trans. on Pattern Analysis and Machine Intelligence 6 (1) (1984) 81–87.
  14. L. Bottou, Y. Bengio, Advances in Neural Information Processing Systems 7, MIT Press, 1995, Ch. Convergence Properties of the K-Means Algorithms, pp. 585–592.
  15. Celebi, M. Emre, Hassan A. Kingravi, and Patricio A. Vela. "A comparative study of efficient initialization methods for the k-means clustering algorithm." Elsevier, Expert Systems with Applications, 40, no. 1 (2013): 200-210.
  16. E. Dede, Z. Fadika, M. Govindaraju, and L. Ramakrishnan, “Benchmarking MapReduce implementations under different application scenarios,” in Future Generation Computer Systems, Special Section: eScience Infrastructure and Applications, Vol. 36. IEEE Computer Society Washington, DC: Elsevier, 2014, pp. 389-399.
  17. S. Sakr, A. Liu, and A. G. Fayoumi. “The family of MapReduce and large-scale data processing systems,” ACM Comput. Surv. (CSUR), Vol. 46, no. 1, Oct. 2013, Article no. 11.
  18. M. Yoon, H.-il Kim, D. H. Choi, H. Jo, and J.-w. Chang, “Performance analysis of MapReduce-based distributed systems for iterative data processing applications,” Mobile Ubiquit. Intell. Comput., Vol. 274, no. 1, pp. 293_299, Mar. 2014.
  19. Ghuli, P., Shukla, A., Kiran, R., Jason, S., & Shettar, R. (2015). Multidimensional Canopy Clustering on Iterative MapReduce Framework Using Elefig Tool. IETE Journal of Research, 61(1), 14-21.
Index Terms

Computer Science
Information Sciences

Keywords

K-means Clustering Algorithm Hadoop MapReduce PCA HDFS.