Call for Paper - November 2022 Edition
IJCA solicits original research papers for the November 2022 Edition. Last date of manuscript submission is October 20, 2022. Read More

Parallel Computing to Predict Breast Cancer Recurrence on SEER Dataset using Map-Reduce Approach

International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2016
Umesh D. R., B. Ramachandra

Umesh D R. and B Ramachandra. Parallel Computing to Predict Breast Cancer Recurrence on SEER Dataset using Map-Reduce Approach. International Journal of Computer Applications 149(12):31-35, September 2016. BibTeX

	author = {Umesh D. R. and B. Ramachandra},
	title = {Parallel Computing to Predict Breast Cancer Recurrence on SEER Dataset using Map-Reduce Approach},
	journal = {International Journal of Computer Applications},
	issue_date = {September 2016},
	volume = {149},
	number = {12},
	month = {Sep},
	year = {2016},
	issn = {0975-8887},
	pages = {31-35},
	numpages = {5},
	url = {},
	doi = {10.5120/ijca2016911669},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}


Due to the late overpowering development rate of large scale data, the advancement of handling faster processing algorithms with optimal execution has turned into a critical need of the time. In this paper, parallel Map-Reduce algorithm is proposed, that encourages concurrent participation of various computing hubs to develop a classifier on SEER breast cancer data set. Our algorithm can prompt supported models whose speculation execution is near the respective baseline classifier. By exploiting their own parallel architecture the algorithm increases noteworthy speedup. In addition, the algorithm don't require singular processing hubs to communicate with each other, to share their data or to share the knowledge got from their data and consequently, they are powerful in safeguarding privacy of computation also. This paper utilized the Map-Reduce framework to implement the algorithms and experimented onSEER breast cancer data sets to exhibit the execution as far as classification accuracy and speedup.


  1. Bacardit J, Llorà X (2013) Large-scale data mining using genetics-based machine learning. Wiley Interdiscip Rev Data Min Knowl Disc 3(1):37–61.
  2. Chang EY, Bai H, Zhu K (2009) Parallel algorithms for mining large-scale rich-media data. In: Proceedings of the 17th ACM International Conference on Multimedia. ACM, New York, NY, USA. pp 917–918.
  3. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113.
  4. White T (2012) Hadoop: The Definitive Guide. " O’Reilly Media, Inc.", California.
  5. Venner J, Cyrus S (2009) Pro Hadoop. vol. 1. Springer, New York.
  6. Lam C (2010) Hadoop in Action. Manning Publications Co., New York.
  7. Chu C, Kim SK, Lin YA, Yu Y, Bradski G, Ng AY, Olukotun K (2007) Map-reduce for machine learning on multicore. Advance neural Info processing systems 19:281.
  8. Kearns M (1998) efficient noise-tolerant learning from statistical queries. J ACM (JACM) 45(6):983–1006.
  9. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA. pp 135–146.
  10. Bu Y, Howe B, Balazinska M, Ernst MD (2010) Haloop: Efficient iterative data processing on large clusters. Proc of the VLDB Endowment 3(1-2):285–296.
  11. Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, New York, NY, USA. pp 810–818.
  12. Agarwal A, Chapelle O, Dudík M, Langford J (2014) A reliable effective terascale linear learning system. J Mach Learn Res 15:1111–1133.
  13. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, Berkeley, CA, USA. pp 2–2.
  14. Rosen J, Polyzotis N, Borkar V, Bu Y, Carey MJ, Weimer M, Condie T, Ramakrishnan R (2013) Iterative mapreduce for large scale machine learning. arXiv preprint arXiv:1303.3517.
  15. J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.
  16. Y. Freund and R.E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” J. Computer and System Science, vol. 55, no. 1, pp. 119-139, 1997.
  17. J. Friedman, T. Hastie, and R. Tibshirani, “Additive Logistic Regression: A Statistical View of Boosting,” The Annals of Statistics, vol. 38, no. 2, pp. 337-407, 2000.
  18. J. K. Bradley and R. E. Schapire, “Filterboost: Regressionand classification on large datasets,” in NIPS, 2007.
  19. G. Escudero, L. M`arquez, and G. Rigau, “Boosting applied toe word sense disambiguation,” in ECML, 2000, pp. 129–141.
  20. R. Busa-Fekete and B. K´egl, “Bandit-aided boosting,” in Proceedings of 2nd NIPS Workshop on Optimization for Machine Learning, 2009.
  21. G. Wu, H. Li, X. Hu, Y. Bi, J. Zhang, and X. Wu, “Mrec4.5: C4.5 ensemble classification with map-reduce,” in ChinaGrid, Annual Conference, 2009, pp. 249–255.
  22. B. Panda, J. Herbach, S. Basu, and R. J. Bayardo, “Planet: Massively parallel learning of tree ensembles with mapreduce,” PVLDB, vol. 2, no. 2, pp. 1426–1437, 2009.
  23. A. Lazarevic and Z. Obradovic, “Boosting algorithms for parallel and distributed learning,” Distributed and Parallel Databases, vol. 11, no. 2, pp. 203–229, 2002.
  24. W. Fan, S. J. Stolfo, and J. Zhang, “The application of adaboost for distributed, scalable and on-line learning,” in KDD, 1999, pp. 362–366.
  25. S. Gambs, B. K´egl, and E. A¨ımeur, “Privacy-preserving boosting,” Data Min. Knowl. Discov., vol. 14, no. 1, pp. 131–170, 2007.
  26. R. E. Schapire and Y. Singer, “Improved boosting algorithms using confidence-rated predictions,” Machine Learning, vol. 37, no. 3, pp. 297–336, 1999.


Breast cancer; Big dataanalytics, Classification; Parallel Computing; MapReduce, SEER.