CFP last date
20 June 2024
Reseach Article

BIG Data: Implementation a Scala Approach for Large Scale Classification

by Yassine Sabri, Najib El Kamoun
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 172 - Number 3
Year of Publication: 2017
Authors: Yassine Sabri, Najib El Kamoun
10.5120/ijca2017915123

Yassine Sabri, Najib El Kamoun . BIG Data: Implementation a Scala Approach for Large Scale Classification. International Journal of Computer Applications. 172, 3 ( Aug 2017), 1-6. DOI=10.5120/ijca2017915123

@article{ 10.5120/ijca2017915123,
author = { Yassine Sabri, Najib El Kamoun },
title = { BIG Data: Implementation a Scala Approach for Large Scale Classification },
journal = { International Journal of Computer Applications },
issue_date = { Aug 2017 },
volume = { 172 },
number = { 3 },
month = { Aug },
year = { 2017 },
issn = { 0975-8887 },
pages = { 1-6 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume172/number3/28228-2017915123/ },
doi = { 10.5120/ijca2017915123 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T00:19:18.725887+05:30
%A Yassine Sabri
%A Najib El Kamoun
%T BIG Data: Implementation a Scala Approach for Large Scale Classification
%J International Journal of Computer Applications
%@ 0975-8887
%V 172
%N 3
%P 1-6
%D 2017
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Many scientic investigations require data-intensive research where big data are collected and analyzed. To get big insights from big data, we need to rst develop our initial hypotheses from the data and then test and validate our hypotheses about the data. We propose FS-S , a flexible and modular Scala based implementation of the Fixed Size Least Squares Support Vector Machine (FS-LSSVM) for large data sets. The framework consists of a set of modules for (gradient and gradient free) optimization, model representation, kernel functions and evaluation of FS-LSSVM models. A kernel based Fixed-Size Least Squares Support Vector Machine (FSLSSVM) model is implemented in the proposed framework, while heavily leveraging the parallel computing capabilities of Apache Spark. Global optimization routines like Coupled Simulated Annealing (CSA) and Grid Search are implemented and used to tune the hyper-parameters of the FS-LSSVM model. Finally, we carry out experiments on benchmark data sets like Magic Gamma, Forest Cover, Susy and higgs etc. and evaluate the performance of various kernel based FS-LSSVM models, all these combine to reveal an effective and ecient way to perform closed-loop big data analysis with visualization and scalable computing.

References
  1. Apache hadoop: Lightning-fast cluster computing, 2005 (accessed July 6, 2015).
  2. Neo4j: The worlds leading graph database, 2007 (accessed July 6, 2015).
  3. Apache spark: Lightning-fast cluster computing, 2010 (accessed July 6, 2015).
  4. Orientdb, 2010 (accessed July 6, 2015).
  5. Titan: Distributed graph database, 2014 (accessed July 6, 2015).
  6. Fs-scala: Apache spark implementation of fixed size least squares support vector machines, 2015 (accessed July 12, 2015).
  7. Dhruba Borthakur, Samuel Rash, Rodrigo Schmidt, Amitanand Aiyer, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, Dmytro Molkov, and Aravind Menon. Apache hadoop goes realtime at Facebook. SIGMOD ’11 - Proceedings of the 2011 international conference on Management of data, page 1071, 2011.
  8. Fay Chang, Jeffrey Dean, Sanjay Ghemawat,Wilson C Hsieh, Deborah AWallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):1–26, 2008.
  9. K. De Brabanter, J. De Brabanter, J. A. K. Suykens, and B. De Moor. Optimized fixed-size kernel models for large data sets. Computational Statistics and Data Analysis, 54(6):1484– 1504, June 2010.
  10. Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8–12, 2009.
  11. M. Lichman. UCI machine learning repository, 2013.
  12. R. Mall, V. Jumutc, R. Langone, and J. A. K. Suykens. Representative subsets for big data learning using k-NN graphs. In Proc. of IEEE BigData, pages 37–42, 2014.
  13. R. Mall, R. Langone, and J. A. K. Suykens. FURS: Fast and Unique Representative Subset selection retaining large scale community structure. Social Network Analysis and Mining, 3(4):1075–1095, 2013.
  14. R. Mall and J. A. K. Suykens. Sparse Reductions for Fixed- Size Least Squares Support Vector Machines on Large Scale Data. In Proc. of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2013), pages 161–173, 2013.
  15. R. Mall and J. A. K. Suykens. Very Sparse LSSVM Reductions for Large-Scale Data. IEEE Transactions on Neural Networks and Learning Systems, 26(5):1086–1097, 2015.
  16. Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. MLlib : Machine Learning in Apache Spark. CoRR, abs/1505.06807, 2015.
  17. J. A. Nelder and R. Mead. A Simplex Method for Function Minimization. The Computer Journal, 7(4):308–313, January 1965.
  18. Martin Odersky and al. An overview of the scala programming language. Technical Report IC/2004/64, EPFL Lausanne, Switzerland, 2004.
  19. J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, Singapore, 2002.
  20. J. A. K. Suykens and J Vandewalle. Least Squares Support Vector Machine Classifiers. Neural processing letters, 9(3):293–300, 1999.
  21. J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, 2002.
  22. Giorgio Valentini, D S I Dipartimento, and Thomas G Dietterich. Bias-Variance Analysis of Support Vector Machines for the Development of SVM-Based Ensemble Methods. Journal of Machine Learning Research, 5:725–775, 2004.
  23. Samuel Xavier-De-Souza, J. A. K. Suykens, J. Vandewalle, and D´esir´e Bolle. Coupled simulated annealing. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 40(2):320–335, 2010.
  24. Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark : Cluster Computing with Working Sets. HotCloud’10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, page 10, 2010.
Index Terms

Computer Science
Information Sciences

Keywords

FS-LSSVM Big Data Large Scale Models Non-linear SVMs