Call for Paper - August 2022 Edition
IJCA solicits original research papers for the August 2022 Edition. Last date of manuscript submission is July 20, 2022. Read More

Data Classification based on Decision Tree, Rule Generation, Bayes and Statistical Methods: An Empirical Comparison

International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2015
Sanjib Saha, Debashis Nandi

Sanjib Saha and Debashis Nandi. Article: Data Classification based on Decision Tree, Rule Generation, Bayes and Statistical Methods: An Empirical Comparison. International Journal of Computer Applications 129(7):36-41, November 2015. Published by Foundation of Computer Science (FCS), NY, USA. BibTeX

	author = {Sanjib Saha and Debashis Nandi},
	title = {Article: Data Classification based on Decision Tree, Rule Generation, Bayes and Statistical Methods: An Empirical Comparison},
	journal = {International Journal of Computer Applications},
	year = {2015},
	volume = {129},
	number = {7},
	pages = {36-41},
	month = {November},
	note = {Published by Foundation of Computer Science (FCS), NY, USA}


In this paper, twenty well known data mining classification methods are applied on ten UCI machine learning medical datasets and the performance of various classification methods are empirically compared while varying the number of categorical and numeric attributes, the types of attributes and the number of instances in datasets. In the performance study, Classification Accuracy (CA), Root Mean Square Error (RMSE) and Area Under Curve (AUC) of Receiver’s Operational Characteristics (ROC) is used as the metric and come up with some findings: (i) performance of classification methods depends upon the type of dataset variables or attributes such as categorical, numeric and both (mixed), (ii) performance of classification methods on categorical attributes is superior than on numeric attributes of a dataset, (iii) classification accuracy, RMSE and AUC of a classification method depends on the number of instances in datasets, (iv) classification performance decreases in case of instances decreases for both categorical as well as numeric datasets, (v) top three classification methods are established after comparing the performance of twenty different classification methods for the categorical, numeric and both (mixed) attribute datasets, (vi) out of these twenty different classification methods Bayes Net, Naïve Bayes, Classification Via Regression, Logistic Regression and Random Forest method performs best on these medical datasets.


  1. Jiawei Han, Micheline Kember, Jian Pei, “Data Mining Concepts and Techniques”, 3rd Edition, Morgan Kaufmann, 2012.
  2. Andrew P. Bradley, “The use of area under ROC curve in evaluation of machine learning algorithms”, Pattern Recognition Society, 1997.
  3. Stehman, Stephen V, “Selecting and interpreting measures of thematic classification accuracy”, Remote Sensing of Environment, 1997.
  4. Sholom M. Weiss, Ioannis Kapouleas, “An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods”, Machine Learning.
  5. Aik Choon Tan, David Gilbert, “An empirical comparison of supervised machine learning techniques in bioinformatics”, Proceedings of 1st Asia Pacific Bioinformatics Conference, 2003.
  6. Reza Entezari-Maleki, Seyyed Mehdi Iranmanesh, Behrouz Minaei-Bidgoli, “An Experimental Investigation of the Effect of Discrete Attributes on the Precision of classification Methods”, World Applied Sciences Journal 7 (Special Issue of Computer & IT), 2009, 216-223.
  7. Jin Huang, Jingjing Lu, Charles X. Ling, “Comparing Naive Bayes, Decision Trees, and SVM with AUC and Accuracy”, 3rd IEEE International Conference on Data Mining, 2003.
  8. Yong Soo Kim, “Comparison of the decision tree, artificial neural network and linear regression methods based on the number and types of independent variables and sample size”, Expert Systems with Applications, 2008, 1227–1234.
  9. Jin Huang, Charles X. Ling, “Using AUC and Accuracy in Evaluating Learning Algorithms”, IEEE Transactions on Knowledge and Data Engineering, March, 2005, Vol. 17, No. 3.
  10. Jae H. Song, Santosh S. Venkatesh, Emily A. Conant, Peter H. Arger, Chandra M. Sehgal, “Comparative Analysis of Logistic Regression and Artificial Neural Network for Computer-Aided Diagnosis of Breast Masses”, Academic Radiology, April, 2005, Vol. 12.
  11. R P Datta, Sanjib Saha, “Applying rule based classification techniques to medical databases: An empirical study”, International Journal of Business Intelligence and Systems Engineering (IJBISE), Inderscience Publishers, 2015.
  12. Subhankar Das, Sanjib Saha, “Data Mining and Soft Computing using Support Vector Machine: A Survey”, International Journal of Computer Applications (0975-8887), Volume 77-No.14, September 2013.
  13. Blake C, Merz C, “UCI repository of machine learning datasets”, 2000.
  14. WEKA 3.6.9 java based GUI tool popularly used for machine learning and knowledge analysis ( Provided by the Machine Learning Group at the University of Waikato, Hamilton, New Zealand, 1999-2013.
  15. N. Friedman, D. Geiger, M. Goldszmidt, “Bayesian network classifiers”, Machine Learning, 1997, 29:131-163.
  16. George H. John, Pat Langley, “Estimating Continuous Distributions in Bayesian Classifiers”, Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, 1995, pp. 338-345.
  17. J R. Quinlan, “C4.5: Programs for Machine Learning”, Morgan Kaufmann Publishers, 1993, San Mateo, CA.
  18. Leo Breiman, “Random Forests”, Machine Learning, 2001, 45(1):5-32.
  19. Ron Kohavi, “Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid”, In: 2nd International Conference on Knowledge Discovery and Data Mining, 1996, 202-207.
  20. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone, “Classification and Regression Trees, 1984, Wadsworth International Group, Belmont, California.
  21. Iba. Wayne, Langley. Pat, “Induction of One-Level Decision Trees”, Proceedings of 9th International Conference on Machine Learning, Aberdeen, Scotland, 1992, San Francisco, CA: Morgan Kaufmann.
  22. E. Frank, Y. Wang, S. Inglis, G. Holmes, I.H. Witten, “Using model trees for classification”, Machine Learning, 1998, 32(1):63-76.
  23. Eric Bauer, Ron Kohavi, “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting and Variants”, Machine Learning, 1998, vv, 1-38.
  24. Y. Freund, R. E. Schapire, “Large margin classification using the perceptron algorithm”, In 11th Annual Conference on Computational Learning Theory, New York, 1998, 209-217.
  25. David M.J. Tax, Robert P.W. Duin, “Using two-class classifiers for multiclass classification”, IEEE International Conference on Pattern Recognition, 2002.
  26. G. Demiroz, A. Guvenir, “Classification by voting feature intervals”, In 9th European Conference on Machine Learning, 1997, 85-92.
  27. Le Cessie, Van Houwelingen, “Ridge Estimators in Logistic Regression”, Appl. Statist, 1992, 41, No. 1, pp. 191-201.
  28. D. Aha, D. Kibler, “Instance-based learning algorithms”, Machine Learning, 1991, 6:37-66.
  29. Kohavi Ron, “The Power of Decision Tables”, In: 8th European Conference on Machine Learning, 1995, 174-189.
  30. Cohen William W, “Fast Effective Rule Induction”, In: 12th International Conference on Machine Learning, 1995, 115-123.
  31. Martin Brent, “Instance-Based learning: Nearest Neighbor With Generalization”, Hamilton, New Zealand, 1995.
  32. Frank Eibe, Ian H. Witten, “Generating Accurate Rule Sets Without Global Optimization”, In: 15th International Conference on Machine Learning, 1998, 144-151.
  33. Gaines Brian R, Compton Paul, “Induction of Ripple-Down Rules Applied to Modeling Large Databases”, J. Intell. Inf. Syst., 1995.


Data Mining; Classification; Classification Accuracy; RMSE; ROC; Confusion Matrix