Call for Paper - September 2020 Edition
IJCA solicits original research papers for the September 2020 Edition. Last date of manuscript submission is August 20, 2020. Read More

Misclassification in Big Data Soft Set Environment

Print
PDF
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2017
Authors:
Jyoti Arora, Kamaljit Kaur
10.5120/ijca2017914298

Jyoti Arora and Kamaljit Kaur. Misclassification in Big Data Soft Set Environment. International Journal of Computer Applications 168(2):23-29, June 2017. BibTeX

@article{10.5120/ijca2017914298,
	author = {Jyoti Arora and Kamaljit Kaur},
	title = {Misclassification in Big Data Soft Set Environment},
	journal = {International Journal of Computer Applications},
	issue_date = {June 2017},
	volume = {168},
	number = {2},
	month = {Jun},
	year = {2017},
	issn = {0975-8887},
	pages = {23-29},
	numpages = {7},
	url = {http://www.ijcaonline.org/archives/volume168/number2/27849-2017914298},
	doi = {10.5120/ijca2017914298},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}
}

Abstract

In order to deal with classification for large data, data filtering and data cleansing are used as preprocessing methods. Generally it remove noisy data, misclassified data, errors and inconsistent data and results unreliable classification. Because sometimes cleaned data can also affect the prediction accuracy or other testing. In this paper, we performed analysis of misclassified data and identify how much data has been wrong classified. For future aspect, This misclassified data is need to be rectified to get valuable information. To demonstrate this concept, we have used Air Traffic dataset from Statistical Computing Statistical Graphics (SCSG) to examine misclassified content in data set. Five supervised classifiers are used: Support vector Machine, decision procedure, k-nearest neighbor, random forest and logistic regression. The results shows that out of these classifiers, SVM classify 86% of the data correctly and only 14% of data has misclassification.

References

  1. Villars, Richard L., Carl W. Olofson, and Matthew Eastwood. "Big data: What it is and why you should care." White Paper, IDC (2011)
  2. Bello-Orgaz, Gema, Jason J. Jung, and David Camacho. "Social big data: Recent achievements and new challenges." Information Fusion 28 (2016): 45-59.
  3. IBM, Big Data and Analytics, URL http://www-01.ibm.com/software/data/bigdata/what-isbig-data.html (2015)
  4. Infographic, The Data Explosion in 2014 Minute by Minute, 2015. URL http://aci.info/2014/07/12/the-data-explosion-in-2014-minute-by-minute-infographic
  5. Tole, Alexandru Adrian. "Big data challenges." Database Syst J 4, no. 3 (2013): 31-40.
  6. Herzig, Kim, Sascha Just, and Andreas Zeller. "It's not a bug, it's a feature: how misclassification impacts bug prediction." In Proceedings of the 2013 International Conference on Software Engineering, pp. 392-401. IEEE Press, 2013.
  7. Kochhar, Pavneet Singh, Tien-Duy B. Le, and David Lo. "It's not a bug, it's a feature: does misclassification affect bug localization?." In Proceedings of the 11th Working Conference on Mining Software Repositories, pp. 296-299. ACM, 2014.
  8. Labrinidis, Alexandros, and Hosagrahar V. Jagadish. "Challenges and opportunities with big data." Proceedings of the VLDB Endowment 5, no. 12 (2012): 2032-2033.
  9. Wu, Xindong, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. "Data mining with big data." ieee transactions on knowledge and data engineering 26, no. 1 (2014): 97-107.
  10. Fayyad, Usama M. "Data mining and knowledge discovery: Making sense out of data." IEEE Expert: Intelligent Systems and Their Applications 11, no. 5 (1996): 20-25.
  11. Nodarakis, Nikolaos, Evaggelia Pitoura, Spyros Sioutas, Athanasios Tsakalidis, Dimitrios Tsoumakos, and Giannis Tzimas. "kdann+: A rapid aknn classifier for big data." In Transactions on Large-Scale Data and Knowledge-Centered Systems XXIV, pp. 139-168. Springer Berlin Heidelber 2016
  12. Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine learning 20, no. 3 (1995): 273-297.
  13. Caudill, Steven B., and Franklin G. Mixon. "Analysing misleading discrete responses: A logit model based on misclassified data." Oxford Bulletin of Economics and Statistics 67, no. 1 (2005): 105-113.
  14. Brodley, Carla E., and Mark A. Friedl. "Identifying mislabeled training data." Journal of Artificial Intelligence Research 11 (1999): 131-167.
  15. Miranda, André LB, Luís Paulo F. Garcia, André CPLF Carvalho, and Ana C. Lorena. "Use of classification algorithms in noise detection and elimination." In International Conference on Hybrid Artificial Intelligence Systems, pp. 417-424. Springer Berlin Heidelberg, 2009.
  16. Van den Hout, Ardo, and Peter GM Van der Heijden. "The analysis of multivariate misclassified data with special attention to randomized response data." Sociological Methods & Research 32, no. 3 (2004): 384-410.
  17. Bilgic, Mustafa, and Lise Getoor. "Reflect and correct: A misclassification prediction approach to active inference." ACM Transactions on Knowledge Discovery from Data (TKDD) 3, no. 4 (2009): 20.
  18. Ciraco, Michelle, Michael Rogalewski, and Gary Weiss. "Improving classifier utility by altering the misclassification cost ratio." In Proceedings of the 1st international workshop on Utility-based data mining, pp. 46-52. ACM, 2005.
  19. Nodarakis, Nikolaos, Evaggelia Pitoura, Spyros Sioutas, Athanasios Tsakalidis, Dimitrios Tsoumakos, and Giannis Tzimas. "kdann+: A rapid aknn classifier for big data." In Transactions on Large-Scale Data-and Knowledge-Centered Systems XXIV, pp. 139-168. Springer Berlin Heidelberg, 2016.
  20. Gong, Ke, Panpan Wang, and Yi Peng. "Fault-tolerant enhanced bijective soft set with applications." Applied Soft Computing (2016).
  21. O. Okun, G. Valentini, (Eds.), Supervised and Unsupervised Ensemble Methods and their Applications Studies in Computational Intelligence, vol. 126, Springer, Heidelberg, 2008.
  22. Nodarakis, Nikolaos, Evaggelia Pitoura, Spyros Sioutas, Athanasios Tsakalidis, Dimitrios Tsoumakos, and Giannis Tzimas. "kdann+: A rapid aknn classifier for big data."In Transactions on Large-Scale Data-and Knowledge-Centered Systems XXIV, pp. 139-168. Springer Berlin Heidelberg, 2016.
  23. Breiman L: Random forests. Machine Learning 2001, 45:5-32.
  24. Mood, Carina. "Logistic regression: Why we cannot do what we think we can do, and what we can do about it." European sociological review 26, no. 1 (2010): 67-82.
  25. Lior Rokach and Oded Maimon,IEEE Transaction On System, Man and Cybernetics Part C, Vol 1, No. 11, November Top Down Induction Of Decision Tree Classifier-A Survey,2002
  26. Kotsiantis, Sotiris B., I. Zaharakis, and P. Pintelas. "Supervised machine learning: A review of classification techniques." (2007): 3-24.
  27. F. Salfner, M. Lenk and M. Malek, “A Survey of Online Prediction Methods,” ACM Computing Surveys, vol. 22, no. 3, pp. 1-68, 2010.
  28. R. Jhawar, V. Piuri, and M. D. Santambrogio, "Fault tolerance management in IaaS clouds.” In Satellite Telecommunications (ESTEL), 2012 IEEE 1st AESS European Conference, pp. 1-6, 2012.
  29. A. Avižienis, J.C. Laprie, B. Randell, and C. Landwehr. “Basic concepts and taxonomy of dependable and secure computing,” IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp. 11–33, 2004.
  30. Finch, W. H., & Schneider, M. K. (2006). Misclassification rates for four methods of group classification:Impact of predictor distribution, covariance inequality, effect size, sample size, and group size ratio. Educational and Psychological Measurement, 66, 240-257.
  31. Statistical Computing Statistical Graphics http://stat-computing.org/dataexpo/2009/the-data.html

Keywords

Misclassification, Big Data, Classification