CFP last date
20 September 2024
Reseach Article

iHHO-SMOTe: A Cleansed Approach for Handling Outliers and Reducing Noise to Improve Imbalanced Data Classification

by Khaled SH. Raslan, Almohammady S. Alsharkawy, K.R. Raslan
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 186 - Number 32
Year of Publication: 2024
Authors: Khaled SH. Raslan, Almohammady S. Alsharkawy, K.R. Raslan
10.5120/ijca2024923849

Khaled SH. Raslan, Almohammady S. Alsharkawy, K.R. Raslan . iHHO-SMOTe: A Cleansed Approach for Handling Outliers and Reducing Noise to Improve Imbalanced Data Classification. International Journal of Computer Applications. 186, 32 ( Aug 2024), 1-10. DOI=10.5120/ijca2024923849

@article{ 10.5120/ijca2024923849,
author = { Khaled SH. Raslan, Almohammady S. Alsharkawy, K.R. Raslan },
title = { iHHO-SMOTe: A Cleansed Approach for Handling Outliers and Reducing Noise to Improve Imbalanced Data Classification },
journal = { International Journal of Computer Applications },
issue_date = { Aug 2024 },
volume = { 186 },
number = { 32 },
month = { Aug },
year = { 2024 },
issn = { 0975-8887 },
pages = { 1-10 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume186/number32/ihho-smote-a-cleansed-approach-for-handling-outliers-and-reducing-noise-to-improve-imbalanced-data-classification/ },
doi = { 10.5120/ijca2024923849 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-08-05T23:36:31+05:30
%A Khaled SH. Raslan
%A Almohammady S. Alsharkawy
%A K.R. Raslan
%T iHHO-SMOTe: A Cleansed Approach for Handling Outliers and Reducing Noise to Improve Imbalanced Data Classification
%J International Journal of Computer Applications
%@ 0975-8887
%V 186
%N 32
%P 1-10
%D 2024
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Classifying imbalanced datasets remains a significant challenge in machine learning, particularly with big data where instances are unevenly distributed among classes, leading to class imbalance issues that impact classifier performance. While Synthetic Minority Over-sampling Technique (SMOTE) addresses this challenge by generating new instances for the under-represented minority class, it faces obstacles in the form of noise and outliers during the creation of new samples. In this paper, a proposed approach, iHHO-SMOTe, which addresses the limitations of SMOTE by first cleansing the data from noise points. This process involves employing feature selection using a random forest to identify the most valuable features, followed by applying the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to detect outliers based on the selected features. The identified outliers from the minority classes are then removed, creating a refined dataset for subsequent oversampling using the hybrid approach called iHHO-SMOTe. The comprehensive experiments across diverse datasets demonstrate the exceptional performance of the proposed model, with an AUC score exceeding 0.99, a high G-means score of 0.99 highlighting its robustness, and an outstanding F1-score consistently exceeding 0.967. These findings collectively establish Cleansed iHHO-SMOTe as a formidable contender in addressing imbalanced datasets, focusing on noise reduction and outlier handling for improved classification models.

References
  1. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research 16, 321–357 (2002)
  2. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6(1), 1–6 (2004) https://doi.org/10.1145/1007730.1007733
  3. Elreedy, D., Atiya, A.F., Kamalov, F.: A theoretical distribution analysis of synthetic minority oversampling technique (smote) for imbalanced learning. Machine Learning (2023) https://doi.org/10.1007/s10994-022-06296-4
  4. FernÅLandez, A., GarcÅLıa, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets vol. 10. Springer, (2018)
  5. Krasic, I., Celar, S.: Telecom fraud detection with machine learning on imbalanced dataset, 1–6 (2022) https://doi.org/10.23919/SoftCOM55329.2022.9911518
  6. Alex, S.A., Jesu Vedha Nayahi, J., Kaddoura, S.: Deep convolutional neural networks with genetic algorithm-based synthetic minority over-sampling technique for improved imbalanced data classification. Applied Soft Computing 156 (2024) https://doi.org/10.1016/j.asoc.2024.111491
  7. Wu, Z., Guo, K., Luo, E., Wang, T., Wang, S., Yang, Y., Zhu, X., Ding, R.: Medical long-tailed learning for imbalanced data: Bibliometric analysis. Computer Methods and Programs in Biomedicine 247, 108–106 (2024) https://doi.org/10.1016/j.cmpb.2024.108106
  8. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)
  9. Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96, pp.226–231 (1996).
  10. Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: Optics: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. SIGMOD ’99, pp. 49–60. Association for Computing Machinery, New York, NY, USA (1999).
  11. Campello, R.J., KrÅNoger, P., Sander, J., Zimek, A.: Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10(2), 1343 (2020).
  12. Tanuja Das, A.H., Saha, G.: Application of density-based clustering approaches for stock market analysis. Applied Artificial Intelligence 38(1), 2321550 (2024).
  13. Ebenuwa, S.H., Sharif, M.S., Alazab, M., Al-Nemrat, A.: Variance ranking attributes selection techniques for binary classification problem in imbalance data. IEEE Access 7, 24649–24666 (2019).
  14. Rekha, G., Tyagi, A.K., Reddy, V.K.: Solving class imbalance problem using bagging, boosting techniques, with and without using noise filtering method. Int. J. Hybrid Intell. Syst. 15, 67–76 (2019)
  15. Nnamoko, N., Korkontzelos, I.: Efficient treatment of outliers and class imbalance for diabetes prediction. Artificial Intelligence in Medicine 104, 101815 (2020) https://doi.org/10.1016/j.artmed.2020.101815
  16. Ma, J., Afolabi, D.O., Ren, J., Zhen, A.: Predicting seminal quality via imbalanced learning with evolutionary safe-level synthetic minority over-sampling technique. Cognitive Computation 13(4), 833–844 (2021) https://doi.org/10.1007/s12559-019-09657-9
  17. Yi, X., Tang, K., Hua, X.-S., Lim, J.-H., Zhang, H.: Identifying hard noise in longtailed sample distribution. In: Avidan, S., Brostow, G., CissÅLe, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022, pp. 739–756. Springer, Cham (2022)
  18. Li, J., Cao, H.,Wang, J., Liu, F., Dou, Q., Chen, G., Heng, P.-A.: Learning robust classifier for imbalanced medical image dataset with noisy labels by minimizing invariant risk. In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2023, pp. 306–316. Springer, Cham (2023)
  19. Liu, H., Sheng, M., Sun, Z., Yao, Y., Hua, X.-S., Shen, H.-T.: Learning with Imbalanced Noisy Data by Preventing Bias in Sample Selection (2024)
  20. Liu, Y., Liu, Y., Yu, B.X.B., Zhong, S., Hu, Z.: Noise-robust oversampling for imbalanced data classification. Pattern Recognition 133, 109008 (2023) https://doi.org/10.1016/j.patcog.2022.109008
  21. Asniar, Maulidevi, N.U., Surendro, K.: Smote-lof for noise identification in imbalanced data classification. Journal of King Saud University - Computer and Information Sciences 34(6, Part B), 3413–3423 (2022) https://doi.org/10.1016/j.jksuci.2021.01.014
  22. Kim, K.: Noise avoidance smote in ensemble learning for imbalanced data. IEEE Access 9, 143250–143265 (2021) https://doi.org/10.1109/ACCESS.2021.3120738
  23. Revathi, M., Ramyachitra, D.: A modified borderline smote with noise reduction in imbalanced datasets. Wireless Personal Communications 121(3), 1659–1680 (2021) https://doi.org/10.1007/s11277-021-08690-y
  24. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001) https://doi.org/10.1023/A:1010933404324
  25. Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Burges, C.J., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26. Curran Associates, Inc., (2013).
  26. Louppe, G.: Understanding Random Forests: From Theory to Practice (2015)
  27. Duan, L., Xiong, D., Lee, J., Guo, F.: A local density based spatial clustering algorithm with noise, vol. 5, pp. 4061–4066 (2006).
  28. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: Lof: identifying densitybased local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000)
  29. Jin, W., Tung, A.K.H., Han, J., Wang, W.: Ranking outliers using symmetric neighborhood relationship. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 577–593. Springer, Berlin, Heidelberg (2006)
  30. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 160–172. Springer Berlin, Heidelberg (2013)
  31. Heidari, A.A., Mirjalili, S., Faris, H., Aljarah, I., Mafarja, M., Chen, H.: Harris hawks optimization: Algorithm and applications. Future Generation Computer Systems 97, 849–872 (2019).
  32. Shehab, M., Mashal, I., Momani, Z., Shambour, M.K.Y., AL-Badareen, A., Al-Dabet, S., Bataina, N., Alsoud, A.R., Abualigah, L.: Harris hawks optimization algorithm: Variants and applications. Archives of Computational Methods in Engineering 29(7), 5579–5603 (2022) https://doi.org/10.1007/s11831-022-09780-1
  33. Tripathy, B.K., Reddy Maddikunta, P.K., Pham, Q.-V., Gadekallu, T.R., Dev, K., Pandya, S., ElHalawany, B.M.: Harris hawk optimization: A survey onvariants and applications. Computational Intelligence and Neuroscience 2022, 2218594 (2022).
  34. Sumathi, M., Vijayaraj, N., Raja, S.P., Rajkamal, M.: Hho-aco hybridized load balancing technique in cloud computing. International Journal of Information Technology 15(3), 1357–1365 (2023).
  35. Porter, J., Berkhahn, J., Zhang, L.: Chapter 29 - a comparative analysis of read mapping and indel calling pipelines for next-generation sequencing data, 521–535 (2015).
  36. Bisong, E.: Building machine learning and deep learning models on google cloud platform: A comprehensive guide for beginners (2019).
Index Terms

Computer Science
Information Sciences

Keywords

Noised Data Data Cleansing Imbalanced Datasets HHO SMOTE DBSCAN Random Forest.