Call for Paper - June 2019 Edition
IJCA solicits original research papers for the June 2019 Edition. Last date of manuscript submission is May 20, 2019. Read More

Prediction of Fault-Prone Software Modules using Statistical and Machine Learning Methods

International Journal of Computer Applications
© 2010 by IJCA Journal
Number 22 - Article 2
Year of Publication: 2010
Yogesh Singh
Arvinder Kaur
Ruchika Malhotra

Yogesh Singh, Arvinder Kaur and Ruchika Malhotra. Article: Prediction of Fault-Prone Software Modules using Statistical and Machine Learning Methods. International Journal of Computer Applications 1(22):6–13, February 2010. Published By Foundation of Computer Science. BibTeX

	author = {Yogesh Singh and Arvinder Kaur and Ruchika Malhotra},
	title = {Article: Prediction of Fault-Prone Software Modules using Statistical and Machine Learning Methods},
	journal = {International Journal of Computer Applications},
	year = {2010},
	volume = {1},
	number = {22},
	pages = {6--13},
	month = {February},
	note = {Published By Foundation of Computer Science}


Demand for producing quality software has rapidly increased during the last few years. This is leading to increase in development of machine learning methods for exploring data sets, which can be used in constructing models for predicting quality attributes such as fault proneness, maintenance effort, testing effort, productivity and reliability. This paper examines and compares logistic regression and six machine learning methods (Artificial neural network, decision tree, support vector machine, cascade correlation network, group method of data handling polynomial method, gene expression programming). These methods are explored empirically to find the effect of static code metrics on the fault proneness of software modules. We use publicly available data set AR1 to analyze and compare the regression and machine learning methods in this study. The performance of the methods is compared by computing the area under the curve using Receiver Operating Characteristic (ROC) analysis. The results show that the area under the curve (measured from the ROC analysis) of model predicted using decision tree modeling is 0.865 and is a better model than the model predicted using regression and other machine learning methods. The study shows that the machine learning methods are useful in constructing software quality models.


    [1] Halstead, M., Elements of Software Science. Elsevier, 1977.
    [2] McCabe, T. A Complexity Measure. IEEE Trans. Software Eng., 1976, 2(4): 308-320.
    [3] Henry, S., Kafura, D.: Software structure metrics based on information flow. IEEE Transactions on Software Engineering, 1981, 7(5) 510-518.
    [4] Singh Y., Kaur, A., and Malhotra, R. Application of Decision Trees for Predicting Fault Proneness. International Conference on Information Systems, Technology and Management-Information Technology, Ghaziabad, India, 2009.
    [5] Kaur, A., and Malhotra, R. Application of Random Forest for Predicting Fault Prone classes. International Conference on Advanced Computer Theory and Engineering, Thailand, December 20-22, 2008
    [6] Chapman, M. and D. Solomon, D. The Relationship of Cyclomatic Complexity, Essential Complexity and Error Rates. Proc. NASA Software Assurance Symp.,, 2002.
    [7] Menzies, T. DiStefano, J., Orrego, A. and Chapman R. Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Trans. Software Eng., 2007, 32(11): 1-12.
    [8] Polyspace verifier1,, 2005.
    [9] Nagappan, N. and Ball, T. Static Analysis Tools as Early Indicators of Pre-Release Defect Density. Proc. Int’l Conf. Software Eng., 2005.
    [10] Hall, G. and Munson, J. Software Evolution: Code Delta and Code Churn. J. Systems and Software, 2000, 111-118.
    [11] Nikora, A. and Munson, J. Developing Fault Predictors for Evolving Software Systems. Proc. Ninth Int’l Software Metrics Symp. (METRICS ’03), 2003.
    [12] Nagappan, N. and Ball, T. Static Analysis Tools as Early Indicators of Pre-Release Defect Density. Proc. Int’l Conf. Software Eng., pp. 580-586, 2005.
    [13] Khoshgoftaar, T. An Application of Zero-Inflated Poisson Regression for Software Fault Prediction. Proc. 12th Int’l Symp. Software Reliability Eng., pp. 66-73, Nov. 2001.
    [14] Tang W. and Khoshgoftaar, T. Noise Identification with the KMeans Algorithm. Proc. Int’l Conf. Tools with Artificial Intelligence (ICTAI), pp. 373-378, 2004.
    [15] Khoshgoftaar T. and Seliya, N. Fault Prediction Modeling for Software Quality Estimation: Comparing Commonly Used Techniques. Empirical Software Eng., 2003, 8(3): 255-283
    [16] Menzies, T., Stefano, J. and Chapman, M. Learning Early Lifecycle IV and V Quality Indicators. Proc. IEEE Software Metrics Symp., 2003.
    [17] Porter, A. and Selby, R. Empirically Guided Software Development Using Metric-Based Classification Trees. IEEE Software, 1990, 46-54.
    [18] Tian J. and Zelkowitz, M. Complexity Measure Evaluation and Selection. IEEE Trans. Software Eng., 1995, 21(8): 641-649.
    [19] Khoshgoftaar, T., and Allen, E. Model Software Quality with Classification Trees. Recent Advances in Reliability and Quality Eng., pp. 247-270, 2001. [20] Srinivasan, K. and Fisher, D. Machine Learning Approaches to Estimating Software Development Effort. IEEE Trans. Software Eng., 1995, 126-137.
    [21] Dreiseitl, S. and Ohno-Machado, L. Logistic Regression and Artificial Neural Network Classification models: a methodology review. Journal of Biomedical Informatics, 2002, 35: 352-359.
    [22] Duman, E. Comparison of decision tree algorithms in identifying bank customers who are likely to buy credit cards. Seventh International Baltic Conference on Databases and Information Systems, Kaunas, Lithuania, July 3-6, 2006.
    [23] Eftekhar, B., .Mohammad, K , Ardebili,, H., Ghodsi, M., and Ketabchi, E. (2005). Comparision of Artificial Neural Network and Logistic Regression Models for Prediction of Mortality in head truma based on initial Clinical data, BMC Medical Informatics and Decision Making.
    [24] Marini, F., Bucci, R., Magri, A.L., Magri A. D. Artificial neural networks in chemometrics: History, examples and perspectives. Microchemical journal, 2008, 88(2) 178-185.
    [25] Aggarwal K.K., Singh Y., Kaur A., Malhotra R. Empirical Analysis for Investigating the Effect of Object-Oriented Metrics on Fault Proneness: A Replicated Case Study. Software Process Improvement and Practice, John Wiley & Sons, 2009, 16(1) 39-62.
    [26] El Emam, K., Benlarbi, S., Goel, N. and Rai, S. A Validation of Object-Oriented Metrics, Technical Report ERB-1063, NRC, 1999.
    [27] promise.
    [28] Briand, L., Daly, W., and Wust J. Exploring the relationships between design measures and software quality. Journal of Systems and Software, 2000, 51(3) 245-273.
    [30] Prest Metrics Extraction and Analysis Tool, available at
    [31] Barnett V., Price T. Outliers in Statistical Data. John Wiley & Sons, 1995.
    [32] Hanley, J., McNeil, BJ. The meaning and use of the area under a Receiver Operating Characteristic ROC curve. Radiology, 1982, 143: 29-36.
    [33] Stone, M. Cross-validatory choice and assessment of statistical predictions. J. Royal Stat. Soc., 1974, 36 111-147.
    [34] Hosmer, D., Lemeshow, S.: Applied Logistic regression, John Wiley and Sons 1989.
    [35] Aggarwal K.K., Singh Y., Kaur A., Malhotra R. Application of Artificial Neural Network for Predicting Fault Proneness Models. International Conference on Information Systems, Technology and. Management (ICISTM 2007), March 12-13, New Delhi, India, 2007.
    [36] Basili,V., Briand, L., and Melo, W. A validation of object-oriented design metrics as quality indicators. IEEE Transactions on Software Engineering, 1996, 22(10) 751-761.
    [37] Kothari, C. R. Research Methodology. Methods and Techniques. New Delhi: New Age International Limited, 2004.
    [38] Belsley, D., Kuh, E. and Welsch, R. Regression diagnostics: Identifying influential data and sources of collinearity. New York: John Wiley and Sons, 1980.
    [39] Zhou, Y., and Leung, H. Empirical analysis of Object-Oriented Design Metrics for predicting high severity faults. IEEE Transactions on Software Engineering, 2006, 32(10): 771-784.
    [40] Han, J., Kamber, M. Data Mining: Concepts and Techniques. Harchort India Private Limited, 2001.
    [41] Khoshgaftaar, T., Allen, E.D., Hudepohl, J.P, Aud, S.J. Application of neural networks to software quality modeling of a very large telecommunications system. IEEE Transactions on Neural Networks, 1997, 8(4) 902-909.
    [42] Wang, X., Bi, D., and Wang, S. Fault recognition with Labeled multi-category. Third conference on Natural Computation, Haikou, China, 2007.
    [43] Sherrod, P. (2003) DTreg Predictive Modeling Software.
    [44] Zhao, L., Takagi, N. An application of Support vector machines to Chinese character classification problem. IEEE International Conference on systems, Man and Cybernetics, Montreal, 2007.
    [45] Hall, M. Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 17th International Conference on Machine Learning, pp. 359–366, 2007.
    [46] Ferreira C.: Gene Expression Programming: A New Adaptive Algorithm for Solving Problems, Complex Systems, 2001, 13. 87-129.
    [47] Singh Y., Kaur, A., and Malhotra, R. Empirical Validation of object-oriented metrics for Predicting Fault Proneness Models. Software Quality Journal, Springer, published online, July 2009.