CFP last date
22 April 2024
Reseach Article

Machine Learning based Approach for protein Function Prediction using Sequence Derived Properties

by Amit Bhola, Sanjeev Kumar Yadav, Arvind Kumar Tiwari
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 105 - Number 12
Year of Publication: 2014
Authors: Amit Bhola, Sanjeev Kumar Yadav, Arvind Kumar Tiwari
10.5120/18429-9789

Amit Bhola, Sanjeev Kumar Yadav, Arvind Kumar Tiwari . Machine Learning based Approach for protein Function Prediction using Sequence Derived Properties. International Journal of Computer Applications. 105, 12 ( November 2014), 17-21. DOI=10.5120/18429-9789

@article{ 10.5120/18429-9789,
author = { Amit Bhola, Sanjeev Kumar Yadav, Arvind Kumar Tiwari },
title = { Machine Learning based Approach for protein Function Prediction using Sequence Derived Properties },
journal = { International Journal of Computer Applications },
issue_date = { November 2014 },
volume = { 105 },
number = { 12 },
month = { November },
year = { 2014 },
issn = { 0975-8887 },
pages = { 17-21 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume105/number12/18429-9789/ },
doi = { 10.5120/18429-9789 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:37:32.298224+05:30
%A Amit Bhola
%A Sanjeev Kumar Yadav
%A Arvind Kumar Tiwari
%T Machine Learning based Approach for protein Function Prediction using Sequence Derived Properties
%J International Journal of Computer Applications
%@ 0975-8887
%V 105
%N 12
%P 17-21
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Protein function prediction is an important and challenging field in Bioinformatics. There are various machine learning based approaches have been proposed to predict the protein functions using sequence derived properties. In this paper 857 sequence-derived features such as amino acid composition, dipeptide composition, correlation, composition, transition and distribution and pseudo amino acid composition are used with various machine learning based approaches such as Random Forest, Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), and fuzzy k-Nearest Neighbor (k-NN) to predict the protein functions. This paper used various feature selection techniques such as Correlation Feature Selection, Gain Ratio, Information Gain, One R attribute, ReliefF to select the optimal number of features. The performance of various classifiers with optimal number of features obtained by various feature selection techniques. The comparative analysis of result shows that the random forest based method with reliefF provide the overall accuracy of 89. 20% and Matthews's correlation coefficient (MCC) 0. 87% that is better to others.

References
  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410.
  2. Pearson WR, Lipman DJ: Improved tools for biological sequencev comparison. Proc Natl Acad Sci USA 1988, 85:2444-2448.
  3. Lee, Bum Ju, et al. "Identification of protein functions using a machine-learning approach based on sequence-derived properties. " Proteome science 7. 1, 2009: 27.
  4. Statnikov, Alexander, and Constantin F. Aliferis. "Are random forests better than support vector machines for microarray-based cancer classification?" AMIA annual symposium proceedings. Vol. 2007. American Medical Informatics Association, 2007.
  5. Cai CZ, Han LY, Ji ZL, Chen X, and Chen YZ: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 2003, 31:3692-3697.
  6. Breiman L: Random forests. In Machine Learning Edited by: Schapire RE. Netherlands: Springer; 2001:5-32.
  7. Cai CZ, Wang WL, Sun LZ, Chen YZ: Protein function classification via support vector machine approach. Math Biosci 2003, 185:111-122.
  8. Suykens, Johan AK, and Joos Vandewalle. "Least squares support vector machine classifiers. " Neural processing letters 9. 3, 1999: 293-300.
  9. Yuan Z, Burrage K, Mattick JS: Prediction of protein solvent accessibility using support vector machines. Proteins 2002, 48:566-570.
  10. Cunningham, Padraig, and Sarah Jane Delany. "k-Nearest neighbour classifiers. " Multiple Classifier Systems, 2007: 1-17.
  11. Keller, James M. , Michael R. Gray, and James A. Givens. "A fuzzy k-nearest neighbor algorithm. " Systems, Man and Cybernetics, IEEE Transactions on 4, 1985: 580-585.
  12. Krishnaveni, M. , and V. Radha. "Performance evaluation of Statistical classifiers using Indian Sign language datasets. " International Journal of Computer Science, Engineering and Applications (IJCSEA), 2011, 1. 5:167-175.
Index Terms

Computer Science
Information Sciences

Keywords

Protein function Classification Random Forest SVM k-NN fuzzy k-NN