CFP last date
20 May 2024
Reseach Article

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

by Mehdi Naseriparsa, Mohammad Mansour Riahi Kashani
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 77 - Number 3
Year of Publication: 2013
Authors: Mehdi Naseriparsa, Mohammad Mansour Riahi Kashani
10.5120/13376-0987

Mehdi Naseriparsa, Mohammad Mansour Riahi Kashani . Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset. International Journal of Computer Applications. 77, 3 ( September 2013), 33-38. DOI=10.5120/13376-0987

@article{ 10.5120/13376-0987,
author = { Mehdi Naseriparsa, Mohammad Mansour Riahi Kashani },
title = { Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset },
journal = { International Journal of Computer Applications },
issue_date = { September 2013 },
volume = { 77 },
number = { 3 },
month = { September },
year = { 2013 },
issn = { 0975-8887 },
pages = { 33-38 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume77/number3/13376-0987/ },
doi = { 10.5120/13376-0987 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:49:18.578266+05:30
%A Mehdi Naseriparsa
%A Mohammad Mansour Riahi Kashani
%T Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset
%J International Journal of Computer Applications
%@ 0975-8887
%V 77
%N 3
%P 33-38
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Classification algorithms are unable to make reliable models on the datasets with huge sizes. These datasets contain many irrelevant and redundant features that mislead the classifiers. Furthermore, many huge datasets have imbalanced class distribution which leads to bias over majority class in the classification process. In this paper combination of unsupervised dimensionality reduction methods with resampling is proposed and the results are tested on Lung-Cancer dataset. In the first step PCA is applied on Lung-Cancer dataset to compact the dataset and eliminate irrelevant features and in the second step SMOTE resampling is carried out to balance the class distribution and increase the variety of sample domain. Finally, Naïve Bayes classifier is applied on the resulting dataset and the results are compared and evaluation metrics are calculated. The experiments show the effectiveness of the proposed method across four evaluation metrics: Overall accuracy, False Positive Rate, Precision, Recall.

References
  1. Han, J. , Kamber, M. , 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann, USA.
  2. Assareh, A. , Moradi, M. , and Volkert, L. , 2008. A hybrid random subspace classifier fusion approach for protein mass spectra classification, Springer, LNCS, Volume 4973, Page(s) 1–11, Heidelberg.
  3. Duangsoithong, R. , Windeatt, T. , 2009. Relevance and Redundancy Analysis for Ensemble Classifiers, Springer-Verlag, Berlin, Heidelberg.
  4. Dhiraj, K. , Santanu Rath, K. , and Pandey, A. , 2009. Gene Expression Analysis Using Clustering, 3rd international Conference on Bioinformatics and Biomedical Engineering.
  5. Jiang, B. , Ding, X. , Ma, L. , He, Y. , Wang, T. , and Xie, W. , 2008. A Hybrid Feature Selection Algorithm: Combination of Symmetrical Uncertainty and Genetic Algorithms, The Second International Symposium on Optimization and Systems Biology, Page(s) 152–157, Lijiang, China, October 31– November 3.
  6. Zhou, J. , Peng, H. , and Suen, C. , 2008. Data-driven decomposition for multi-class classification, Journal of Pattern Recognition, Volume 41, Page(s) 67 – 76.
  7. Naseriparsa, M. , Bidgoli, A. , and Varaee, T. , 2013. A Hybrid Feature Selection method to improve performance of a group of classification algorithms , International Journal of Computer Applications, Volume 69, No 17, Page(s) 28-35.
  8. Krzysztof, J. , Witold, P. , Roman, W. , and Lukasz, A. , 2007. Data Mining A Knowledge Discovery Approach, Springer Science, New York.
  9. Jolliffe, I. , 1986. Principal Component Analysis, Springer-Verlag,NewYork.
  10. Domingos, P. , Pazzani, M. , November/December 1997. On the Optimality of the Simple Bayesian Classifier under Zero-One loss, Machine Learning, Volume 29, No 2, Page(s) 103-130.
  11. Chawla, N. V. , Bowyer, K. W. , Hall, L. O. , and Kegelmeyer, W. P. , 2002. SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research, Volume 16, Page(s) 321-357.
  12. Mertz C. J. , and Murphy, P. M. , 2013. UCI Repository of machine learning databases, http://www. ics. uci. edu/~mlearn/MLRepository. html, University of California.
  13. He, H. , Garcia, E. , 2009. Learning from Imbalanced Data, IEEE Transactions on Knowledge and Data Engineering, Volume 21, No 9, Page(s) 1263-1284.
  14. Chawla, N. V. , 2005. Data Mining for Imbalanced Datasets: An Overview, Data Mining and Knowledge Discovery Handbook, Springer, Page(s) 853-867.
Index Terms

Computer Science
Information Sciences

Keywords

PCA Irrelevant Features Unsupervised Dimensionality Reduction SMOTE.