CFP last date
22 April 2024
Reseach Article

Improving the Classification accuracy of Noisy Dataset by Effective Data Preprocessing

by K. V. Uma
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 180 - Number 36
Year of Publication: 2018
Authors: K. V. Uma
10.5120/ijca2018916908

K. V. Uma . Improving the Classification accuracy of Noisy Dataset by Effective Data Preprocessing. International Journal of Computer Applications. 180, 36 ( Apr 2018), 37-46. DOI=10.5120/ijca2018916908

@article{ 10.5120/ijca2018916908,
author = { K. V. Uma },
title = { Improving the Classification accuracy of Noisy Dataset by Effective Data Preprocessing },
journal = { International Journal of Computer Applications },
issue_date = { Apr 2018 },
volume = { 180 },
number = { 36 },
month = { Apr },
year = { 2018 },
issn = { 0975-8887 },
pages = { 37-46 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume180/number36/29302-2018916908/ },
doi = { 10.5120/ijca2018916908 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:02:52.875987+05:30
%A K. V. Uma
%T Improving the Classification accuracy of Noisy Dataset by Effective Data Preprocessing
%J International Journal of Computer Applications
%@ 0975-8887
%V 180
%N 36
%P 37-46
%D 2018
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Decision tree is a technique commonly used in data mining. Issues in decision tree algorithms are working with continuous attributes and missing values, avoiding over fitting, super attributes. Handling noisy data is the challenging factor in data mining research. Noisy data is meaningless data. It unnecessarily increases the amount of storage space required and can also adversely affect the results of any data mining analysis. Predicting the result from such noisy data is the complicated factor. The commonly used algorithm for classification problems are decision stump, ensemble models, SVM, and decision tree algorithms. The performance of the algorithm resulted in lower accuracy when comparing with the noiseless data result. Thus in this paper, data is collected and noise is added to the data, and then it is preprocessed for handling missing values. The preprocessed data is then provided as the input for the feature selection technique. Most relevant features are selected using correlation based subset feature selection technique. The selected features are provided as the input of Credal C4.5 algorithm and decision tree is constructed. The result is analyzed with various data with (5,10,20,30)% noise level. This technique improves the performance of the algorithm with (1-5)% improvement in accuracy compared to the existing result.

References
  1. Jose A. Saez, Mikel Galar, Julian Luengo and Francisco Herrera.2013.Tackling the problem of classification with noisy data using Multiple Classifier Systems: Analysis of the performance and robustness. Information Sciences.
  2. Carlos J. Mantasand JoaquinAbellan.2014.Credal-C4.5: Decision tree based on imprecise probabilities to classify noisy data. Expert Systems with Applications, 4625–4637.
  3. Joaquin Abellan and Javier G. Castellano.2017.A comparative study on base classifiers in ensemble methods for credit scoring, Expert Systems with Applications, 1–10.
  4. Yisen Wang, Shu-Tao Xia and JiaWu .2016.A less-greedy two-term Tsallis Entropy Information Metric approach for decision tree classification. Knowledge-Based Systems, 1–9.
  5. Dewan Md. Farid, Mohammad Abdullah Al-Mamun and Bernard Manderick, Ann Nowe.2016.An adaptive rule-based classifier for mining big biological data. Expert Systems With Applications, 64, 305–316.
  6. FarhadPourpanah, CheePeng Limb and Junita MohamadSaleh. 2015.A hybrid model of fuzzy ARTMAP and genetic algorithm for data classification and rule extraction. Expert Systems With Applications .
  7. AbeerM.Mahmoud.2016.Suitability of Various Intelligent Tree Based Classifiers for Diagnosing Noisy Medical Data. Egyptian Computer Science Journal Vol. 40 No.2 .
  8. Hong Zhao and Xiangju Li. 2016.A cost-sensitive decision tree algorithm based on weighted class distribution with batch deleting attribute mechanism. Information Sciences, 1–14 .
  9. Carlos J. Mantas, JoaquinAbellan and Javier G. Castellano.2016.Analysis of Credal-C4.5 for classification in noisy domains. Expert Systems With Applications, 61, 314–326
  10. Moloud Abdar , Mariam Zomorodi-Moghadam , Resul Das and I-Hsien Ting.2016.Performance analysis of classification algorithms on early detection of Liver disease. Expert Systems With Applications.
  11. Jinghua Liu, Yaojin Lin, Menglei Lin, Shunxiang Wu and JiaZhang.2016.Feature selection based on quality of information. Neurocomputing.
  12. Jose A. Saez, Mikel Galar, Julian Luengo and Francisco Herrera.2013.Tackling the problem of classification with noisy data using Multiple Classifier Systems: Analysis of the performance and robustness. Information Sciences.
  13. Abeer M.Mahmoud.2016.Suitability of Various Intelligent Tree Based Classifiers for Diagnosing Noisy Medical Data. Egyptian Computer Science Journal Vol. 40 No.2.
  14. Hong Zhao and Xiangju Li.2016.A cost-sensitive decision tree algorithm based on weighted class distribution with batch deleting attribute mechanism. Information Sciences,1–14 .
Index Terms

Computer Science
Information Sciences

Keywords

Classification Noisy Data Feature Selection Data Preprocessing.