CFP last date
20 May 2024
Reseach Article

Comparative Analysis of Different Imputation Methods to Treat Missing Values in Data Mining Environment

by Rahul Singhai
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 82 - Number 6
Year of Publication: 2013
Authors: Rahul Singhai
10.5120/14122-2236

Rahul Singhai . Comparative Analysis of Different Imputation Methods to Treat Missing Values in Data Mining Environment. International Journal of Computer Applications. 82, 6 ( November 2013), 34-42. DOI=10.5120/14122-2236

@article{ 10.5120/14122-2236,
author = { Rahul Singhai },
title = { Comparative Analysis of Different Imputation Methods to Treat Missing Values in Data Mining Environment },
journal = { International Journal of Computer Applications },
issue_date = { November 2013 },
volume = { 82 },
number = { 6 },
month = { November },
year = { 2013 },
issn = { 0975-8887 },
pages = { 34-42 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume82/number6/14122-2236/ },
doi = { 10.5120/14122-2236 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:57:05.885469+05:30
%A Rahul Singhai
%T Comparative Analysis of Different Imputation Methods to Treat Missing Values in Data Mining Environment
%J International Journal of Computer Applications
%@ 0975-8887
%V 82
%N 6
%P 34-42
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Data cleaning is one of the important step of KDD (Knowledge discovery in database) process. One critical problem in data cleaning is the presence of missing values. Various approaches have proposed to find & replace such missing data including use of mean value, use of global constant, replace by more probable value etc. Imputation is one of the important procedures in statistics that is used to replace the missing values in a data set. One advantage of this approach is that the missing data treatment is independent of the learning algorithms that are used. This allows the user to select the most suitable and appropriate imputation method for each situation. This paper analyze the six different imputation methods proposed in the field of statistics and implement them in Data mining environment. An artificial data set of 1000 records is used to analyze the performance of these methods. For testing the significance of these methods Z-test approach were used. Exhaustive experiments show the effectiveness of the proposed methods. It is assumed that all the attributes of input data are of numeric data type.

References
  1. Ahmed, M. S. , Al-Titi, O. , Al-Rawi, Z. and Abu-Dayyeh, W. 2006. Estimation of a population mean using different imputation methods, Statistics in Transition, 7, 6, 1247-1264.
  2. Cochran, W. G. 2005. Sampling Techniques, John Wiley and Sons, New York.
  3. G. E. A. P. A. Batista and M. C. Monard. K-Nearest Neighbour as Imputation Method 2002. Experimental Results. Technical report, ICMC-USP, ISSN-0103-2569.
  4. Heitjan, D. F. and Basu, S. 1996. Distinguishing 'Missing at random' and 'missing completely at random', The American Statistician, 50, 207-213.
  5. J. W. Grzymala-Busse and M. Hu. A Comparison of Several Approaches to Missing Attribute Values in Data Mining 2000. In RSCTC'2000, pages 340–347.
  6. K. Lakshminarayan, S. A. Harp, and T. Samad. 1999. Imputation of Missing Data in Industrial Databases. Applied Intelligence, 11:259–275.
  7. R. J. Little and D. B. Rubin. 1987. Statistical Analysis with Missing Data. John Wiley and Sons, New York, 1987.
  8. Rao, J. N. K. and Sitter, R. R. 1995. Variance estimation under two-phase sampling with application to imputation for missing data, Biometrica, 82, 453-460.
  9. Reddy, V. N. 1978. A study on the use of prior knowledge on certain population parameters in estimation, Sankhya, C, 40, 29-37.
  10. Rubin, D. B. 1976. Inference and missing data, Biometrica, 63, 581-593.
  11. Shukla, D. 2002. F-T estimator under two-phase sampling, Metron, 59, 1-2, 253-263.
  12. Shukla, D. and Thakur, N. S. 2008. Estimation of mean with imputation of missing data using factor-type estimator, Statistics in Transition, 9, 1, 33-48.
  13. Thakur, N. S. , Yadav Kalpana, and Pathak S. 2012. Some imputation methods in double sampling scheme for estimation of population mean, IJMER, Vol. 2, Issue. 1 Jan-Feb 2012 pp-200-207.
  14. Thakur, N. S. , Yadav Kalpana, and Pathak S. 2011. Estimation of mean in presence of missingdata under two-phase sampling scheme, JRSS,Vol 4, issue 2,93-104.
  15. Singh, S. 2009. A new method of imputation in survey sampling, Statistics, Vol. 43, 5 , 499 - 511.
  16. Singh, S. and Horn, S. 2000. Compromised imputation in survey sampling, Metrika, 51, 266-276.
  17. Singh, V. K. and Shukla, D. 1993. An efficient one parameter family of factor - type estimator in sample survey, Metron, 51, 1-2, 139-159.
  18. Singhai, R 2013. Comparative Study of Three Imputation Methods to Treat Missing Values, IJCT, Council of Inovative Research, 2013.
Index Terms

Computer Science
Information Sciences

Keywords

KDD Data mining Imputation methods Data pre-processing sampling attribute missing values.