CFP last date
20 May 2024
Reseach Article

(ISSBM) Improved Synthetic Sampling based on Model for Imbalance Data

by Ragini Gour, Ramratan Ahirwal
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 183 - Number 6
Year of Publication: 2021
Authors: Ragini Gour, Ramratan Ahirwal
10.5120/ijca2021921342

Ragini Gour, Ramratan Ahirwal . (ISSBM) Improved Synthetic Sampling based on Model for Imbalance Data. International Journal of Computer Applications. 183, 6 ( Jun 2021), 29-35. DOI=10.5120/ijca2021921342

@article{ 10.5120/ijca2021921342,
author = { Ragini Gour, Ramratan Ahirwal },
title = { (ISSBM) Improved Synthetic Sampling based on Model for Imbalance Data },
journal = { International Journal of Computer Applications },
issue_date = { Jun 2021 },
volume = { 183 },
number = { 6 },
month = { Jun },
year = { 2021 },
issn = { 0975-8887 },
pages = { 29-35 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume183/number6/31932-2021921342/ },
doi = { 10.5120/ijca2021921342 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:16:02.836063+05:30
%A Ragini Gour
%A Ramratan Ahirwal
%T (ISSBM) Improved Synthetic Sampling based on Model for Imbalance Data
%J International Journal of Computer Applications
%@ 0975-8887
%V 183
%N 6
%P 29-35
%D 2021
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In the data mining research domain imbalanced data is characterized by the rigorous variation in scrutiny frequency between classes and has expected a lot of consideration. The forecast performances usually depreciate as classifiers learn from data imbalanced, as most of classifiers presume the class division is balanced or the costs for different types of classification errors are the same. Although several methods have been analyzed to deal with imbalance problems, it is still difficult to oversimplify those methods to achieve stable improvement in most cases. In this study, we propose a novel framework called Improved Synthetic Sampling Based on Model (ISSBM) to deal with imbalance problems, in which we integrate improved modeling and sampling techniques to generate synthetic data. The key inspiration behind the proposed method is to use deterioration models to capture the relationship between features and to consider data multiplicity in the process of data generation. We conduct experiments on many datasets and compare the proposed method with 5 methods. The experimental results indicate that the proposed method is not only qualified or comparative but also very stable. We also provide detailed analysis of the proposed method to empirically demonstrate why it could generate good data samples.

References
  1. Bauder RA, Khoshgoftaar TM. The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced Big Data. Health Inf Sci Syst. 2018.
  2. Triguero I, Rio S, Lopez V, Bacardit J, Benítez J, Herrera F. ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowl Based Syst. 2015.
  3. Van Hulse J. A study on the relationships of classifier performance metrics. In: 21st international conference on tools with artificial intelligence (ICTAI 2009). IEEE. 2019.
  4. Katal A, Wazid M, Goudar R. Big data: issues, challenges, tools, and good practices. In: Sixth international conference on contemporary computing. 2013.
  5. Herland M, Khoshgoftaar TM, Bauder RA. Big Data fraud detection using multiple medicare data sources. J Big Data. 2018;5:29
  6. Bauder RA, Khoshgoftaar TM. Medicare fraud detection using random forest with class imbalanced Big Data. In: 2018 IEEE international conference on information reuse and integration (IRI), IEEE. 2018. pp. 80–7.
  7. Ali A, Shamsuddin SM, Ralescu AL. Classification with class imbalance problem: a review. Int J Adv Soft Comput Appl. 2015;7(3):176–204.
  8. Lopez V, Rio S, Benitez J, Herrera F. Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced Big Data. Fuzzy Sets Syst. 2015;258:5–38.
  9. Wang D, Wu P, Zhao P, Hoi S. A framework of sparse online learning and its applications. Comput Sci. 2015.
  10. Zhang T. Sparse online learning via truncated gradient. J Mach Learn Res. 2018.;10:777–801.
  11. Maurya A. Bayesian optimization for predicting rare internal failures in manufacturing processes. In: IEEE international conference on Big Data. 2016.
  12. Galpert D, del Río S, Herrera F, Ancede-Gallardo E, Antunes A, Agüero-Chapin G. An effective Big Data supervised imbalanced classification approach for ortholog detection in related yeast species. BioMed Res Int. 2015;2015:748681.
  13. Tsai C, Lin W, Ke S. Big Data mining with parallel computing: a comparison of distributed and MapReduce methodologies. J Syst Softw. 2016;122:83–92.
  14. Triguero I, Galar M, Merino D, Maillo J, Bustince H, Herrera F. Evolutionary undersampling for extremely imbalanced Big Data classification under Apache Spark. In: IEEE congress on evolutionary computation (CEC). 2016.
  15. Khoshgoftaar TM, Seiffert C, Van Hulse J, Napolitano A, Folleco A. Learning with limited minority class data. In: Sixth international conference on machine learning and applications (ICMLA 2007), IEEE. 2007. pp. 348–53.
  16. Malhotra R. A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput. 2015;27:504–18.
  17. Wang H, Khoshgoftaar TM, Napolitano A. An empirical investigation on Wrapper-Based feature selection for predicting software quality. Int J Softw Eng Knowl Eng. 2015;25(1):93–114.
  18. Yin L, Ge Y, Xiao K, Wang X, Quan X. Feature selection for high-dimensional imbalanced data. Neurocomputing. 2013;105:3–11.
  19. Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data. Explor Newsletter. 2014;6(1):80–9.
  20. Seiffert C, Khoshgoftaar TM. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A. 2010;40(1):185–97.
  21. Graczyk M, Lasota T, Trawinski B, Trawinski K. Comparison of bagging, boosting and stacking ensembles applied to real estate appraisal. In: Asian conference on intelligent information and database systems. 2010. pp. 340–50.
  22. Breiman L. Random forests. Mach Learn. 2015.;45(1):5–32.
  23. Ho T. Random decision forests. In: Proceedings of the third international conference on document analysis and recognition. 2016..
  24. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
  25. Chawla N, Lazarevic A, Hall L, Bowyer K. SMOTEBoost: improving prediction of the minority class in boosting. In: 7th European conference on principles and practice of knowledge discovery in databases. 2013.
  26. Rodriguez D, Herraiz I, Harrison R, Dolado J, Riquelme J. Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In: Proceedings of the 18th international conference on evaluation and assessment in software engineering. Article no. 43. 2014.
  27. Fernandez A, Rio S, Chawla N, Herrera F. An insight into imbalanced Big Data classification: outcomes and challenges. Complex Intell Syst. 2017;3:105–20.
  28. Cao P, Zhao D, Zaiane O. An optimized cost-sensitive SVM for imbalanced data learning. In: Pacific-Asia conference on knowledge discovery and data mining. 2013. pp. 280–92.
  29. Cao P, Zhao D, Zaiane O. A PSO-based cost-sensitive neural network for imbalanced data classification. In: Pacific-Asia conference on knowledge discovery and data mining. 2013. pp. 452–63.
  30. Li N, Tsang IW, Zhou Z-H. Efficient optimization of performance measures by classifier adaptation. IEEE Trans Pattern Anal Mach Intell. 2013;35(6):1370–82.
  31. López V, Fernandez A, Moreno-Torres J, Herrera F. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl. 2012;39(7):6585–608.
  32. Kaminski B, Jakubczyk M, Szufel P. A framework for sensitivity analysis of decision trees. CEJOR. 2017;26(1):135–59.
  33. Akbani R, Kwek S, Japkowicz N. Applying support vector machines to imbalanced datasets. In: European conference on machine learning. 2014.. pp. 39–50.
  34. Tang Y, Chawla N. SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern. 2019.;39(1):281–8.
  35. Bekkar M, Alitouche T. Imbalanced data learning approaches review. Int J Data Mining Knowl Manag Process. 2013;3(4):15–33.
  36. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C Appl Rev. 2012;42(4):463–84.
  37. H. Guo and H. L. Viktor, “Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach,” ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 2014.
Index Terms

Computer Science
Information Sciences

Keywords

Imbalance data random over sampling random under sampling synthetic minority over sampling technique