CFP last date
20 May 2024
Reseach Article

Adapting k-means for Clustering in Big Data

by Mugdha Jain, Chakradhar Verma
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 101 - Number 1
Year of Publication: 2014
Authors: Mugdha Jain, Chakradhar Verma
10.5120/17652-8457

Mugdha Jain, Chakradhar Verma . Adapting k-means for Clustering in Big Data. International Journal of Computer Applications. 101, 1 ( September 2014), 19-24. DOI=10.5120/17652-8457

@article{ 10.5120/17652-8457,
author = { Mugdha Jain, Chakradhar Verma },
title = { Adapting k-means for Clustering in Big Data },
journal = { International Journal of Computer Applications },
issue_date = { September 2014 },
volume = { 101 },
number = { 1 },
month = { September },
year = { 2014 },
issn = { 0975-8887 },
pages = { 19-24 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume101/number1/17652-8457/ },
doi = { 10.5120/17652-8457 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:30:34.376212+05:30
%A Mugdha Jain
%A Chakradhar Verma
%T Adapting k-means for Clustering in Big Data
%J International Journal of Computer Applications
%@ 0975-8887
%V 101
%N 1
%P 19-24
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Big data if used properly can bring huge benefits to the business, science and humanity. The various properties of big data like volume, velocity, variety, variation and veracity render the existing techniques of data analysis ineffective. Big data analysis needs fusion of techniques for data mining with those of machine learning. The k-means algorithm is one such algorithm which has presence in both the fields. This paper describes an approximate algorithm based on k-means. It is a novel method for big data analysis which is very fast, scalable and has high accuracy. It overcomes the drawback of k-means of uncertain number of iterations by fixing the number of iterations, without losing the precision.

References
  1. http://www-01. ibm. com/software/data/bigdata/
  2. Gartner's Big Data Definition Consists of Three Parts, Not to Be Confused with Three "V"s, By Svetlana Sicular, Gartner, Inc. 27 March 2013. [online] http://www. forbes. com/sites/gartnergroup/2013/03/27/gartners-big-datadefinition-consists-of-three- parts-not-to-be-confused-with-three-vs/.
  3. Italiano G. F. Algorithms for Big Data: Graphs and Memory errors. July 2013. Available online at almada2013. ru/files/courses/italiano/00-Intro. pdf
  4. Forgy, E. W. Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21:768–780, 1965.
  5. MacQueen, J. B. Some methods for classification and analysis of multivariate observations. In Proc. of 5th Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297, 1967.
  6. Lloyd, S. P. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, March 1982.
  7. Extracting Value from Chaos, By Gantz, J. and Reinsel, D. IDC IVIEW June 2011. [online] http://www. emc. com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar. pdf.
  8. The Big Data Long Tail. Blog post by Bloomberg, Jason. On January 17, 2013. [online] http://www. devx. com/blog/the-big-data-long-tail. html.
  9. The Fourth Paradigm: Data-Intensive Scientific Discovery. Edited by Hey, T. , Tansley, S. and Tolle, K. . Microsoft Corporation, October 2009. ISBN 978-0-9825442-0-4.
  10. Demchenko, Y. , Membrey, P. , Grosso, C. de Laat, Addressing Big Data Issues in Scientific Data Infrastructure. First International Symposium on Big Data and Data Analytics in Collaboration (BDDAC 2013). Part of The 2013 Int. Conf. on Collaboration Technologies and Systems (CTS 2013), May 20-24, 2013, San Diego, California, USA.
  11. Kaufman, L. , and Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, Inc. , New York, NY, 1990.
  12. Fahad, A, Alshatri, N. , Tari, Z. , AlAmri, A. , Zomaya, Y. , Khalil, I. , Foufou, S. , Bouras, A, "A Survey of Clustering Algorithms for Big Data: Taxonomy & Empirical Analysis," Emerging Topics in Computing, IEEE Transactions on ,vol. PP, no. 99, pp. 1,1. 2014
  13. Ng, R. T. , and. Han, J. Clarans: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering (TKDE), 14(5):1003–1016, 2002.
  14. Bezdek, J. C. , Ehrlich, R. , and Full, W. Fcm: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2):191–203, 1984.
  15. Zhang, T. , Ramakrishnan, R. , and Livny, M. Birch: an efficient data clustering method for very large databases. ACM SIGMOD Record, volume 25, pp. 103–114, 1996.
  16. Hinneburg, A. , and Keim, D. A. An efficient approach to clustering in large multimedia databases with noise. Proc. of the ACM SIGKDD Conference on Knowledge Discovery ad Data Mining (KDD), pp. 58-65, 1998.
  17. Hinneburg, A. , and Keim, D. A. Optimal Grid-clustering:Towards breaking the curse of dimensionality in high-dimensional clustering. In Proceedings of the 25th Conference on VLDB, 506-517, 1999.
  18. Nister, D. , and Stewenius, H. Scalable recognition with a vocabulary tree. In CVPR, 2006.
  19. Philbin, J. , Chum, O. , Isard, M. , Sivic, J. , and Zisserman, A. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
  20. Philbin, J. Scalable Object Retrieval in Very Large Image Collections. PhD thesis, University of Oxford, 2010.
  21. Zeng, G. Fast Approximate k-Means via Cluster Closures. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Washington, DC, USA, pp 3037-3044.
  22. Ahmad, A. , and Dey, L. A k-mean clustering algorithm for mixed numeric and categorical data, Data & Knowledge Engineering, Vol 63, Issue 2, November 2007, Pages 503-527, ISSN 0169-023X, http://dx. doi. org/10. 1016/j. datak. 2007. 03. 016.
  23. Li, C. , Biswas, G. Unsupervised learning with mixed numeric and nominal data, IEEE Transactions on Knowledge and Data Engineering 14 (4) (2002) 673–690.
  24. Reich, Y. , Fenves, S. J. The formation and use of abstract concepts in design, in: Fisher, D. H. , Pazzani, M. J. , Langley (Eds. ), P. Concept Formation: Knowledge and Experience in Unsupervised Learning, Morgan Kaufman, Los Altos, Calif, 1991, pp. 323–352
  25. Huang, Z. Clustering large data sets with mixed numeric and categorical values, in: Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining, World Scientific, Singapore, 1997.
  26. Ruspini, E. H. Numerical methods for fuzzy clustering. Inform. Sci. 2, 319–350, 1970. Chen, S. , Mulgrew, B. , and Grant, P. M. "A clustering technique for digital communications channel equalization using radial basis function networks," IEEE Trans. on Neural Networks, vol. 4, pp. 570-578, July 1993
Index Terms

Computer Science
Information Sciences

Keywords

Big data mining big data analysis approximate k-means clustering.