Call for Paper - November 2022 Edition
IJCA solicits original research papers for the November 2022 Edition. Last date of manuscript submission is October 20, 2022. Read More

Efficient Big Text Data Clustering Algorithms using Hadoop and Spark

International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2021
Sergios Gerakidis, Sofia Megarchioti, Basilis Mamalis

Sergios Gerakidis, Sofia Megarchioti and Basilis Mamalis. Efficient Big Text Data Clustering Algorithms using Hadoop and Spark. International Journal of Computer Applications 174(15):13-21, January 2021. BibTeX

	author = {Sergios Gerakidis and Sofia Megarchioti and Basilis Mamalis},
	title = {Efficient Big Text Data Clustering Algorithms using Hadoop and Spark},
	journal = {International Journal of Computer Applications},
	issue_date = {January 2021},
	volume = {174},
	number = {15},
	month = {Jan},
	year = {2021},
	issn = {0975-8887},
	pages = {13-21},
	numpages = {9},
	url = {},
	doi = {10.5120/ijca2021921030},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}


Document clustering is a traditional, efficient and yet quite effective, text mining technique when we need to get a better insight of the documents of a collection that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are two of the most known and commonly used clustering algorithms; the former due to its low time cost and the latter due to its accuracy. However, even the use of K-Means in text clustering over large-scale collections can lead to unacceptable time costs. In this paper we first address some of the most valuable approaches for document clustering over such 'big data' (large-scale) collections. We then present two very promising alternatives: (a) a variation of an existing K-Means-based fast clustering technique (known as BigKClustering - BKC) so that it can be applied in document clustering, and (b) a hybrid clustering approach based on a customized version of the Buckshot algorithm, which first applies a hierarchical clustering procedure on a sample of the input dataset and then it uses the results as the initial centers for a K-Means based assignment of the rest of the documents, with very few iterations. We also give highly efficient adaptations of the proposed techniques in the MapReduce model which are then experimentally tested using Apache Hadoop and Spark over a real cluster environment. As it comes out of the experiments, they both lead to acceptable clustering quality as well as to significant time improvements (compared to K-Means - especially the Buckshot-based algorithm), thus constituting very promising alternatives for big document collections.


  1. Rajaraman, A., Leskovec, J., Ullman, J.D., Mining of Massive Datasets, Cambridge University Press 2010.
  2. Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y. Zomaya, et al.A survey of clustering algorithms for big data: taxonomy and empirical analysis, IEEE Trans Emerg Top Comput, 2 (3), 2014.
  3. Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. CURE: An efficient clustering algorithm for large databases. Information Systems, 26(1):35–58, 2001.
  4. Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: An efficient data clustering method for very large databases. In SIGMOD Conference, pages 103–114, 1996.
  5. W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189–206, 1984.
  6. Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient clustering of highdimensional data sets with application to reference matching. In KDD, pages 169–178, 2000.
  7. Raymond T. Ng and Jiawei Han. CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 14(5):1003–1016, 2002.
  8. Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In VLDB, pages 144–155, 1994.
  9. U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. Pegasus: A peta-scale graph mining system. In ICDM, pages 229–238, 2009.
  10. Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, pages 245–260, 1999.
  11. Akthar, N., Ahamad, M.V., Ahmad, S., MapReduce Model of Improved K-Means Clustering Algorithm Using Hadoop MapReduce, Intl. Conf. on Computational Intelligence & Communication Technology, 2016.
  12. Abdelrahman Elsaye, Hoda M. O. Mokhtar, and Osama Ismail. Ontology Based Document Clustering Using MapReduce, International Journal of Database Management Systems Vol.7, No.2, April 2015.
  13. Wang, S., Dutta, H., PARABLE: A PArallel RAndom-partition Based Hierarchical ClustEring Algorithm for MapReduce Framework
  14. Jin, C., Patwary, M.A. Agrawal, A., Hendrix, W., Liao, W., Choudhary, A., DiSC: A Distributed Single-Linkage Hierarchical Clustering Algorithm using MapReduce
  15. V. Rastogi and et al. Finding connected components on map-reduce in logarithmic rounds. In proceedings of IEEE 29th International Conference on Data Engineering (ICDE), 2013.
  16. Eric C. Jensen, Steven M. Beitzel, Angelo J. Pilotto, Nazli Goharian, Ophir Frieder,. Parallelizing the Buckshot Algorithm for Efficient Document Clustering, in Proceedings of ACM CIKM conference, 2002.
  17. Lamari, Y., Slaoui, S.C., Parallel Document Clustering using Iterative MapReduce, in Proceedings of BDAW '16 conference, November 10-11, Blagoevgrad, Bulgaria, 2016.
  18. Spiros Papadimitriou Jimeng SunDisCo: Distributed Co-clustering with Map-Reduce: A Case Study Towards Petabyte-Scale End-to-End Mining, Eighth IEEE International Conference on Data Mining, 2008.
  19. Tanvir Habib Sardar, Ahmed Rimaz Faizabadi, Zahid Ansari. An evaluation of MapReduce framework in cluster analysis. In Proceedings of IEEE International Conference on Intelligent Computing, Instrumentation and Control Technologies, Kannur, India, 2017.
  20. Dweepna Garg, Parth Gohil, Khushboo Trivedi. Modified Fuzzy K-mean Clustering using MapReduce in Hadoop and Cloud, in Proceedings of IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), 2015.
  21. Bowen Wang,  Jun Yin,  Qi Hua, Zhiang Wu, Jie Cao, Parallelizing K-Means-Based Clustering on Spark, in Proceedings of International Conference on Advanced Cloud and Big Data (CBD), 2016.
  22. Jin, C., Liu, R., Chen, Z., Hendrix, W., Agrawal, A., Liao, W., ChoudHary, A., A Scalable Hierarchical Clustering Algorithm Using Spark, in Proceedings of IEEE 1st Intl. Conf. on Big Data Computing Service and Applications, 2015.
  23. Miao, Y., Zhang, J., Feng, H., Qiu, L., Wen, Y., A Fast Algorithm for Clustering with MapReduce, Advances in Neural Networks – Lecture Notes in Computer Science, vol 7951, Springer, 2013.
  24. Satish Muppidi, Ramakrishna Murty, Document Clustering with Map Reduce using Hadoop Framework, International Journal on Recent and Innovation Trends in Computing and Communication , 3(1), 2015.
  25. Jian Wan, Wenming Yu1, and Xianghua Xu, Design and Implement of Distributed Document Clustering Based on MapReduce, in Proc. of the 2nd Intl. Symposium in Computer Science and Computational Technology (ISCSCT ’09), pp. 278-280, 2009.
  26. Zhao, W., Ma, H., and He, Q., Parallel K-Means Clustering Based on MapReduce, Qing_He6/publication
  27. Gautam, B.P., and Shrestha, D, Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents, in Proc. of International Conference of Engineers & Computer Scientists, Vol. I, 2010.
  28. Manning, C.D., Raghavan, P., Schutze, H., Introduction to Information Retrieval, Cambridge University Press, 2008.
  29. Wen Xiao, Juan Hu, "A Survey of Parallel Clustering Algorithms Based on Spark", Scientific Programming, vol. 2020, Article ID 8884926, 12 pages, 2020.
  30. Megarchioti, S. and Mamalis, B., The BigKClustering Approach for Document Clustering using Hadoop MapReduce”, in Proceedings of the 22nd Panhellenic Conference in Informatics (PCI 2018), ACM ICPS Series, Athens, Greece, Nov 29 - Dec 1, pp. 261-266, 2018.
  31. Gerakidis, S. and Mamalis, B., Utilizing the Buckshot Algorithm for Efficient Big Data Clustering in the MapReduce Model, in Proc. of the 23rd Panhellenic Conference in Informatics (PCI 2019), ACM ICPS Series, Nicosia, Cyprus, Nov 28-30, pp. 112-117, 2019.


Document Clustering, Big Data, KMeans, Hierarchical Clustering, MapReduce Model, Hadoop, Spark