Call for Paper - January 2024 Edition
IJCA solicits original research papers for the January 2024 Edition. Last date of manuscript submission is December 20, 2023. Read More

Replication Effect over Hadoop MapReduce Performance using Regression Analysis

International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2018
Aisha Shabbir, Kamalrulnizam Abu Bakar, Raja Zahilah Raja Mohd. Radzi

Aisha Shabbir, Kamalrulnizam Abu Bakar and Raja Zahilah Raja Mohd. Radzi. Replication Effect over Hadoop MapReduce Performance using Regression Analysis. International Journal of Computer Applications 181(24):33-38, October 2018. BibTeX

	author = {Aisha Shabbir and Kamalrulnizam Abu Bakar and Raja Zahilah Raja Mohd. Radzi},
	title = {Replication Effect over Hadoop MapReduce Performance using Regression Analysis},
	journal = {International Journal of Computer Applications},
	issue_date = {October 2018},
	volume = {181},
	number = {24},
	month = {Oct},
	year = {2018},
	issn = {0975-8887},
	pages = {33-38},
	numpages = {6},
	url = {},
	doi = {10.5120/ijca2018918034},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}


Hadoop MapReduce is the community accepted platform that deals with the gigantic data in an efficient and cost-effective manner. To cope up with ever growing datasets and shrinking time to analyze them, Hadoop MapReduce leveraged parallelize computations on large distributed clusters consisting of many machines. Careful consideration of the factors affecting the Hadoop MapReduce can enhance its performance. Many researches has been done for improving the total job execution time of MapReduce by optimizing different parameters. The replication factor is still unexplored for its effect on the MapReduce job completion time. This paper focuses on the evaluation of data replication factor on MapReduce job completion time using regression analysis. The performance of the Hadoop MapReduce job in terms of total job completion time is monitored experimentally by changing different values of replication. The evaluation results evidently shows the dependence of the job completion time on the replication factor. The dependence of total job completion time on the replication has been verified both analytically and experimentally.


  1. Gens,F., and Predictions, I. (2015)Team IDC Predictions.
  2. Agneeswaran, V. S. (2014). Big data analytics beyond hadoop: real-time applications with storm, spark, and more hadoop alternatives: FT Press.
  3. Delimitrou, C., and Kozyrakis, C. (2014). Quasar: resource-efficient and QoS-aware cluster management. Paper presented at the ACM SIGPLAN Notices,127-144.
  4. Prajapati, V. (2013). Big data analytics with R and Hadoop: Packt Publishing Ltd.
  5. Balagoni, Y., and Rao, R. R. (2017). Locality-Load-Prediction Aware Multi-Objective Task Scheduling in the Heterogeneous Cloud Environment. Indian Journal of Science and Technology, 10(9).
  6. Althebyan, Q., Jararweh, Y., Yaseen, Q., AlQudah, O., and Al‐Ayyoub, M. (2015). Evaluating map reduce tasks scheduling algorithms over cloud computing infrastructure. Concurrency and Computation: Practice and Experience, 27(18), 5686-5699.
  7. Chen, Q., Zhang, D., Guo, M., Deng, Q., and Guo, S. (2010). Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment. Paper presented at the Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on, 2736-2743.
  8. Tang, Z., Liu, M., Ammar, A., Li, K., and Li, K. (2016). An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. The Journal of Supercomputing, 72(6), 2059-2079.
  9. Ke, H., Li, P., Guo, S., and Guo, M. (2016). On traffic-aware partition and aggregation in mapreduce for big data applications. IEEE Transactions on Parallel and Distributed Systems, 27(3), 818-828.
  10. Tiwari, N., Sarkar, S., Bellur, U., and Indrawan, M. (2015). Classification framework of MapReduce scheduling algorithms. ACM Computing Surveys (CSUR), 47(3), 49.
  11. Neelakandan, S., Divyabharathi, S., Rahini, S., and Vijayalakshmi, G. (2016). Large scale optimization to minimize network traffic using MapReduce in big data applications, 2016 International Conference on Computation of Power, Energy Information and Commuincation (ICCPEIC), 193-199.
  12. Fu, Huansong, Haiquan Chen, Yue Zhu, and Weikuan Yu."Farms: Efficient Mapreduce Speculation for Failure Recovery in Short Jobs."Parallel Computing (2017): 68.
  13. Xu, Huanle and Wing Cheong Lau. "Optimization for Speculative Execution in Big Data Processing Clusters." IEEE Transactions on Parallel and Distributed Systems 28, no. 2 (2017): 530-45.
  14. Yan, W., Xue, Y., and Malin, B. (2013). Scalable and robust key group size estimation for reducer load balancing in MapReduce. Paper presented at the Big Data, 2013 IEEE International Conference on, 156-162.
  15. M. Khan, Yong Jin “Hadoop Performance Modeling for Job Estimation and Resource Provisioning”, IEEE Transactions On Parallel And Distributed Systems, Vol. 27, No. 2, February 2016.
  16. Li, K.-C., Jiang, H., and Zomaya, A. Y. (2017). Big Data Management and Processing: CRC Press.
  18. Bechini, A., Marcelloni, F., and Segatori, A. (2016). A MapReduce solution for associative classification of big data. Information Sciences, 332, 33-55.
  19. Li, R., Hu, H., Li, H., Wu, Y., and Yang, J. (2016). MapReduce parallel programming model: a state-of-the-art survey. International Journal of Parallel Programming, 44(4), 832-866.


Hadoop MapReduce, Big Data, Regression Analysis, Data Replication, Job optimization