Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Mehdi Naseriparsa; Mohammad Mansour Riahi Kashani

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

by Mehdi Naseriparsa, Mohammad Mansour Riahi Kashani

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 77 - Number 3

Year of Publication: 2013

Authors: Mehdi Naseriparsa, Mohammad Mansour Riahi Kashani

10.5120/13376-0987

Mehdi Naseriparsa, Mohammad Mansour Riahi Kashani . Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset. International Journal of Computer Applications. 77, 3 ( September 2013), 33-38. DOI=10.5120/13376-0987

@article{ 10.5120/13376-0987,

author = { Mehdi Naseriparsa, Mohammad Mansour Riahi Kashani },

title = { Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset },

journal = { International Journal of Computer Applications },

issue_date = { September 2013 },

volume = { 77 },

number = { 3 },

month = { September },

year = { 2013 },

issn = { 0975-8887 },

pages = { 33-38 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume77/number3/13376-0987/ },

doi = { 10.5120/13376-0987 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T21:49:18.578266+05:30

%A Mehdi Naseriparsa

%A Mohammad Mansour Riahi Kashani

%T Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

%J International Journal of Computer Applications

%@ 0975-8887

%V 77

%N 3

%P 33-38

%D 2013

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Classification algorithms are unable to make reliable models on the datasets with huge sizes. These datasets contain many irrelevant and redundant features that mislead the classifiers. Furthermore, many huge datasets have imbalanced class distribution which leads to bias over majority class in the classification process. In this paper combination of unsupervised dimensionality reduction methods with resampling is proposed and the results are tested on Lung-Cancer dataset. In the first step PCA is applied on Lung-Cancer dataset to compact the dataset and eliminate irrelevant features and in the second step SMOTE resampling is carried out to balance the class distribution and increase the variety of sample domain. Finally, Naïve Bayes classifier is applied on the resulting dataset and the results are compared and evaluation metrics are calculated. The experiments show the effectiveness of the proposed method across four evaluation metrics: Overall accuracy, False Positive Rate, Precision, Recall.

References

Han, J. , Kamber, M. , 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann, USA.
Assareh, A. , Moradi, M. , and Volkert, L. , 2008. A hybrid random subspace classifier fusion approach for protein mass spectra classification, Springer, LNCS, Volume 4973, Page(s) 1–11, Heidelberg.
Duangsoithong, R. , Windeatt, T. , 2009. Relevance and Redundancy Analysis for Ensemble Classifiers, Springer-Verlag, Berlin, Heidelberg.
Dhiraj, K. , Santanu Rath, K. , and Pandey, A. , 2009. Gene Expression Analysis Using Clustering, 3rd international Conference on Bioinformatics and Biomedical Engineering.
Jiang, B. , Ding, X. , Ma, L. , He, Y. , Wang, T. , and Xie, W. , 2008. A Hybrid Feature Selection Algorithm: Combination of Symmetrical Uncertainty and Genetic Algorithms, The Second International Symposium on Optimization and Systems Biology, Page(s) 152–157, Lijiang, China, October 31– November 3.
Zhou, J. , Peng, H. , and Suen, C. , 2008. Data-driven decomposition for multi-class classification, Journal of Pattern Recognition, Volume 41, Page(s) 67 – 76.
Naseriparsa, M. , Bidgoli, A. , and Varaee, T. , 2013. A Hybrid Feature Selection method to improve performance of a group of classification algorithms , International Journal of Computer Applications, Volume 69, No 17, Page(s) 28-35.
Krzysztof, J. , Witold, P. , Roman, W. , and Lukasz, A. , 2007. Data Mining A Knowledge Discovery Approach, Springer Science, New York.
Jolliffe, I. , 1986. Principal Component Analysis, Springer-Verlag,NewYork.
Domingos, P. , Pazzani, M. , November/December 1997. On the Optimality of the Simple Bayesian Classifier under Zero-One loss, Machine Learning, Volume 29, No 2, Page(s) 103-130.
Chawla, N. V. , Bowyer, K. W. , Hall, L. O. , and Kegelmeyer, W. P. , 2002. SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research, Volume 16, Page(s) 321-357.
Mertz C. J. , and Murphy, P. M. , 2013. UCI Repository of machine learning databases, http://www. ics. uci. edu/~mlearn/MLRepository. html, University of California.
He, H. , Garcia, E. , 2009. Learning from Imbalanced Data, IEEE Transactions on Knowledge and Data Engineering, Volume 21, No 9, Page(s) 1263-1284.
Chawla, N. V. , 2005. Data Mining for Imbalanced Datasets: An Overview, Data Mining and Knowledge Discovery Handbook, Springer, Page(s) 853-867.

Index Terms

Computer Science

Information Sciences

Keywords

PCA Irrelevant Features Unsupervised Dimensionality Reduction SMOTE.