Call for Paper - August 2019 Edition
IJCA solicits original research papers for the August 2019 Edition. Last date of manuscript submission is July 20, 2019. Read More

Big Data Analysis with Apache Spark

Print
PDF
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2017
Authors:
Pallavi Singh, Saurabh Anand, Sagar B. M.
10.5120/ijca2017915251

Pallavi Singh, Saurabh Anand and Sagar B M.. Big Data Analysis with Apache Spark. International Journal of Computer Applications 175(5):6-8, October 2017. BibTeX

@article{10.5120/ijca2017915251,
	author = {Pallavi Singh and Saurabh Anand and Sagar B. M.},
	title = {Big Data Analysis with Apache Spark},
	journal = {International Journal of Computer Applications},
	issue_date = {October 2017},
	volume = {175},
	number = {5},
	month = {Oct},
	year = {2017},
	issn = {0975-8887},
	pages = {6-8},
	numpages = {3},
	url = {http://www.ijcaonline.org/archives/volume175/number5/28482-2017915251},
	doi = {10.5120/ijca2017915251},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}
}

Abstract

Manipulating big data distributed over a cluster is one of the big challenges which most of the current big data oriented companies face. This is evident by the popularity of MapReduce and Hadoop, and most recently Apache Spark, a fast, in-memory distributed collections framework which caters to provide solution for big data management. This paper, present a discussion on how technically Apache Spark help us in Big Data Analysis and Management. The paper aims to provide the conclusion stating apache Spark is more beneficial by almost 50 percent while working on big data. As when data size was increased to 5*106 the time taken was drastically reduced by around 50 percent compared to when queried Cassandra without Spark. Cassandra is used as Data Source for conducting our experiment. For this, a experiment is conducted comparing spark with normal Cassandra DataSet or ResultSet. Gradually increased the number of records in Cassandra table and time taken to fetch the records from Cassandra using Spark and traditional Java ResultSet was compared. For the initial stages, when data size was less than 10 percent, Spark showed almost an average response time which was almost equal to the time taken without the use of Spark. As the data size exceeded beyond 10 percent of records Spark response time dropped by almost 50 percent as compared to querying Cassandra without Spark .Final record was analyzed at 5*106 records. As the data size was increased, Spark was proved better than the traditional Cassandra ResultSet approach by almost reducing the time taken by 50 percent for really big dataset as our case of 5*106 records.

References

  1. Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Rec.,34(3):31–36, 2005.
  2. Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010.
  3. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM,51(1):107–113, 2008.
  4. J. Ekanayake, S. Pallickara, and G. Fox. MapReduce for data intensive scientific analyses. In ESCIENCE ’08, pages 277–284, Washington, DC, USA, 2008. IEEE Computer Society.
  5. R. Bose and J. Frew. Lineage retrieval for scientific data. processing: a survey. ACM Computing Surveys, 37:1–28,2005.
  6. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica. HotCloud 2012. June 2012.
  7. H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD ’07, pages 1029–1040. ACM, 2007.

Keywords

Spark, RDD, MapReduce, Hadoop, Cassandra