CFP last date
20 May 2024
Reseach Article

An Optimistic Data Mining Approach for Handling Large Data Set using Data Partitioning Techniques

by Dipak V. Patil, R. S. Bichkar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 24 - Number 3
Year of Publication: 2011
Authors: Dipak V. Patil, R. S. Bichkar
10.5120/2930-3878

Dipak V. Patil, R. S. Bichkar . An Optimistic Data Mining Approach for Handling Large Data Set using Data Partitioning Techniques. International Journal of Computer Applications. 24, 3 ( June 2011), 29-33. DOI=10.5120/2930-3878

@article{ 10.5120/2930-3878,
author = { Dipak V. Patil, R. S. Bichkar },
title = { An Optimistic Data Mining Approach for Handling Large Data Set using Data Partitioning Techniques },
journal = { International Journal of Computer Applications },
issue_date = { June 2011 },
volume = { 24 },
number = { 3 },
month = { June },
year = { 2011 },
issn = { 0975-8887 },
pages = { 29-33 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume24/number3/2930-3878/ },
doi = { 10.5120/2930-3878 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:10:01.820973+05:30
%A Dipak V. Patil
%A R. S. Bichkar
%T An Optimistic Data Mining Approach for Handling Large Data Set using Data Partitioning Techniques
%J International Journal of Computer Applications
%@ 0975-8887
%V 24
%N 3
%P 29-33
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The use of the Internet for various purposes leads to collection of large volume of data. The knowledge contents of large data can be utilized to improve decision-making process of an organization. The knowledge discovery on this high volume data becomes very slow, as it has to be done serially on currently available terabyte plus data sets. In some cases, mining of large data set may become impossible due to limitations of processor and memory. The proposed algorithm is based on Tim Oates and Davis Jensen’s [1] findings which state that increasing size of training data does not considerably increase classification accuracy of a classifier. The proposed algorithm also follows survival of the fittest principal used in genetic algorithm. The solution provides partitioning algorithm wherein decision trees can be learned on partitioned data that are disjoint subsets of a complete data set. These learned decision trees have comparable accuracies with each other and that is equivalent to the tree learned on complete data set. The algorithm finds a single tree with highest accuracy amongst the learned decision trees. The selected decision tree is used for classification of unseen data. The results on 12 benchmark data sets from UCI data repository indicate that the final learned decision tree have equal accuracy and in many cases, significant improvement in classification accuracy is observed, improvement in classification performance as compared to decision trees learned on the entire data set. An experiment on big data set Census-income (KDD) also supports the claim. The most important aspect of this approach is that it is very simple as compared to other methods with enhanced classification performance.

References
  1. Tim Oates, David Jensen (1997). “The Effect of Training Set Size On Decision Tree Complexity”, Proc. 14th International Conference on Machine Learning.1997.pp. 254-262.
  2. Tim Mitchell, (1997). Machine Learning, The McGraw-Hill Companies, Inc.
  3. S. Rajasekaran, G.A. Vijayalakshmi Pai (2004). Neural Networks, Fuzzy Logic and Genetic Algorithms Synthesis and Applications. Prentice-Hall of India.
  4. D. E. Goldberg, (1999). Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley.
  5. M. Mehta, R. Agrawal and J. Rissanen (1996). SLIQ: A fast scalable classifier for data mining. Proc. of the Fifth International Conference on Extending Database Technology, Avignon, France. pp. 18-32.
  6. J. R. Quinlan, (1993). C4.5: Programming for Machine Learning. San Francisco, CA: Morgan Kaufman.
  7. S. Ruggieri (2002). Efficient C4.5. IEEE Transaction On Knowledge and Data Engineering, Vol. 14, No. 2, pp. 438-444.
  8. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone (1984). Classification and Regression Trees. Wadsworth International Group, Belmont, CA.
  9. Murthy S. K (1998). Automatic construction of decision trees from data: A multidisciplinary survey. Data Mining and Knowledge Discovery, Vol. 2, No. 4, pp. 345-389.
  10. A. Papagelis and D. Kalles (2000). GATree: Genetically evolved decision trees. Proc. 12th International Conference On Tools With Artificial Intelligence, pp. 203-206.
  11. Tim Oates and David Jensen (1998). Large Datasets Lead to Overly Complex Models: An Explanation and a Solution., Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. August 1998.
  12. L. O. Hall, N. Chawla and K. Bowyer (1998). Combining decision trees learned in parallel. Distributed Data Mining Workshop at International Conference of Knowledge Discovery and Data Mining. pp. 77-83.
  13. Chhanda Ray (2009). Distributed database systems, Pearson Education India pp. 26-29.
  14. Frank, A. and Asuncion, A. (2010). UCI Machine Learning Repository Irvine, CA.
  15. http://archive.ics.uci.edu/ml. University of California, School of Information and Computer Science.
  16. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
Index Terms

Computer Science
Information Sciences

Keywords

Data partitioning decision tree survival of fittest