Call for Paper - October 2019 Edition
IJCA solicits original research papers for the October 2019 Edition. Last date of manuscript submission is September 20, 2019. Read More

Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means

International Journal of Computer Applications
© 2015 by IJCA Journal
Volume 113 - Number 1
Year of Publication: 2015
Satish Gopalani
Rohan Arora

Satish Gopalani and Rohan Arora. Article: Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means. International Journal of Computer Applications 113(1):8-11, March 2015. Full text available. BibTeX

	author = {Satish Gopalani and Rohan Arora},
	title = {Article: Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means},
	journal = {International Journal of Computer Applications},
	year = {2015},
	volume = {113},
	number = {1},
	pages = {8-11},
	month = {March},
	note = {Full text available}


Big Data has long been the topic of fascination for Computer Science enthusiasts around the world, and has gained even more prominence in the recent times with the continuous explosion of data resulting from the likes of social media and the quest for tech giants to gain access to deeper analysis of their data. This paper discusses two of the comparison of - Hadoop Map Reduce and the recently introduced Apache Spark – both of which provide a processing model for analyzing big data. Although both of these options are based on the concept of Big Data, their performance varies significantly based on the use case under implementation. This is what makes these two options worthy of analysis with respect to their variability and variety in the dynamic field of Big Data. In this paper we compare these two frameworks along with providing the performance analysis using a standard machine learning algorithm for clustering (K-Means).


  • Apache Hadoop Documentation 2014 http://hadoop. apache. org/.
  • Shvachko K. , Hairong Kuang, Radia S, Chansler, R The Hadoop Distributed File System Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium
  • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004.
  • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In 19th Symposium on Operating Systems Principles, pages 29–43, Lake George, New York, 2003.
  • HortonWorks documentation 2014 http://docs. hortonworks. com/HDPDocuments/HDP1/HDP-1. 2. 4/bk_getting-started-guide/content/ch_hdp1_getting_started_chp2_1. html
  • Apache Spark documentation 2014 https://spark. apache. org/documentation. html.
  • Apache Spark Research 2014 https://spark. apache. org/research. html.
  • Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, 2011
  • Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica. Shark: SQL and Rich Analytics at Scale. SIGMOD 2013. June 2013.
  • Tom White, Hadoop the definitive guide chapter 06
  • Spark Internals - Spark Summit 2014 http://spark-summit. org/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson. pdf
  • Spark Job Flow – Databricks https://databricks-training. s3. amazonaws. com/slides/advanced-spark-training. pdf
  • Aaron Davidson, Andrew Or. Optimizing Shuffle Performance in Spark. Technical Report http://www. cs. berkeley. edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report. pdf
  • Machine Learning, Wikipedia, 2014 http://en. wikipedia. org/wiki/Machine_learning
  • Machine learning with Spark - Spark Summit 2013 https://spark-summit. org/2013/exercises/machine-learning-with-spark. html