Call for Paper - November 2023 Edition
IJCA solicits original research papers for the November 2023 Edition. Last date of manuscript submission is October 20, 2023. Read More

Machine Learning based Approach for protein Function Prediction using Sequence Derived Properties

International Journal of Computer Applications
© 2014 by IJCA Journal
Volume 105 - Number 12
Year of Publication: 2014
Amit Bhola
Sanjeev Kumar Yadav
Arvind Kumar Tiwari

Amit Bhola, Sanjeev Kumar Yadav and Arvind Kumar Tiwari. Article: Machine Learning based Approach for protein Function Prediction using Sequence Derived Properties. International Journal of Computer Applications 105(12):17-21, November 2014. Full text available. BibTeX

	author = {Amit Bhola and Sanjeev Kumar Yadav and Arvind Kumar Tiwari},
	title = {Article: Machine Learning based Approach for protein Function Prediction using Sequence Derived Properties},
	journal = {International Journal of Computer Applications},
	year = {2014},
	volume = {105},
	number = {12},
	pages = {17-21},
	month = {November},
	note = {Full text available}


Protein function prediction is an important and challenging field in Bioinformatics. There are various machine learning based approaches have been proposed to predict the protein functions using sequence derived properties. In this paper 857 sequence-derived features such as amino acid composition, dipeptide composition, correlation, composition, transition and distribution and pseudo amino acid composition are used with various machine learning based approaches such as Random Forest, Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), and fuzzy k-Nearest Neighbor (k-NN) to predict the protein functions. This paper used various feature selection techniques such as Correlation Feature Selection, Gain Ratio, Information Gain, One R attribute, ReliefF to select the optimal number of features. The performance of various classifiers with optimal number of features obtained by various feature selection techniques. The comparative analysis of result shows that the random forest based method with reliefF provide the overall accuracy of 89. 20% and Matthews's correlation coefficient (MCC) 0. 87% that is better to others.


  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410.
  • Pearson WR, Lipman DJ: Improved tools for biological sequencev comparison. Proc Natl Acad Sci USA 1988, 85:2444-2448.
  • Lee, Bum Ju, et al. "Identification of protein functions using a machine-learning approach based on sequence-derived properties. " Proteome science 7. 1, 2009: 27.
  • Statnikov, Alexander, and Constantin F. Aliferis. "Are random forests better than support vector machines for microarray-based cancer classification?" AMIA annual symposium proceedings. Vol. 2007. American Medical Informatics Association, 2007.
  • Cai CZ, Han LY, Ji ZL, Chen X, and Chen YZ: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 2003, 31:3692-3697.
  • Breiman L: Random forests. In Machine Learning Edited by: Schapire RE. Netherlands: Springer; 2001:5-32.
  • Cai CZ, Wang WL, Sun LZ, Chen YZ: Protein function classification via support vector machine approach. Math Biosci 2003, 185:111-122.
  • Suykens, Johan AK, and Joos Vandewalle. "Least squares support vector machine classifiers. " Neural processing letters 9. 3, 1999: 293-300.
  • Yuan Z, Burrage K, Mattick JS: Prediction of protein solvent accessibility using support vector machines. Proteins 2002, 48:566-570.
  • Cunningham, Padraig, and Sarah Jane Delany. "k-Nearest neighbour classifiers. " Multiple Classifier Systems, 2007: 1-17.
  • Keller, James M. , Michael R. Gray, and James A. Givens. "A fuzzy k-nearest neighbor algorithm. " Systems, Man and Cybernetics, IEEE Transactions on 4, 1985: 580-585.
  • Krishnaveni, M. , and V. Radha. "Performance evaluation of Statistical classifiers using Indian Sign language datasets. " International Journal of Computer Science, Engineering and Applications (IJCSEA), 2011, 1. 5:167-175.