Protein Sequence Similarity Search Suitable for Parallel Implementation

Himanshu S. Mazumdar; Maulika S. Patel

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

Protein Sequence Similarity Search Suitable for Parallel Implementation

by Himanshu S. Mazumdar, Maulika S. Patel

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 50 - Number 22

Year of Publication: 2012

Authors: Himanshu S. Mazumdar, Maulika S. Patel

10.5120/7935-1246

Himanshu S. Mazumdar, Maulika S. Patel . Protein Sequence Similarity Search Suitable for Parallel Implementation. International Journal of Computer Applications. 50, 22 ( July 2012), 1-3. DOI=10.5120/7935-1246

@article{ 10.5120/7935-1246,

author = { Himanshu S. Mazumdar, Maulika S. Patel },

title = { Protein Sequence Similarity Search Suitable for Parallel Implementation },

journal = { International Journal of Computer Applications },

issue_date = { July 2012 },

volume = { 50 },

number = { 22 },

month = { July },

year = { 2012 },

issn = { 0975-8887 },

pages = { 1-3 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume50/number22/7935-1246/ },

doi = { 10.5120/7935-1246 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:48:58.999065+05:30

%A Himanshu S. Mazumdar

%A Maulika S. Patel

%T Protein Sequence Similarity Search Suitable for Parallel Implementation

%J International Journal of Computer Applications

%@ 0975-8887

%V 50

%N 22

%P 1-3

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Having entered the post genomic era, there lies a plethora of information, both genomic and proteomic. This provides quite a lot of resources so that the computational and machine learning strategies be applied to address the problems of biological relevance. Searching in biological databases for similar or homologous sequences is a fundamental step for many bioinformatics tasks. On discovery of a new protein sequence or drug, a biologist would like to confirm the discovery by comparing with the largest available protein database. Alignment based methods become too complex and time consuming with the increase in the number of sequences. Alignment free sequence comparison is many a time used as a filtering step for application of alignment. A novel method of searching for similar sequences in a huge protein database is proposed. The method has two interesting aspects. One is the divide and conquer approach and use of hashing like scheme for indexing the large database. The index consists of the addresses of the 15-residue words in the UniRef100. fasta database. The second aspect is the possibility of data parallelism as the database is divided into m segments for indexing. This can further increase the efficiency of the algorithm. The creation of index is time consuming but the search time is constant and affordable. The method is particularly useful when used with the large databases like UniRef100. fasta which consists of 9757328 protein sequences as on May 2010. The index based searching algorithm is implemented in C # . NET.

References

Tuan D. Pham and Johannes Zuegg. A probabilistic measure for alignment free sequence comparison. Bioinformatics, Advance Access:3455–3461, December 2004.
Susana Vinga and Jonas Almeida. Alignment-free sequence comparison-a review. Bioinformatics, 19:513–523, 2003.
Nikola Kasabov. Bioinformatics: A Knowledge Engineering- Approach. Second IEEE International Conference On Intelligent Systems, June 2004.
Achuthsankar S. Nair. Computational Biology & Bioinformatics: A Gentle Overview. Communications of the Computer Society of India, January 2007.
C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology, Cengage Learning, 1997
J Chen and N. Chaudhari. Cascaded Bidirectional Recurrent Neural Networks for Protein Secondary Structure Prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 4(4), Oct-Dec 2007.
Weizhong Li, and Adam Godzik, Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22:1658–1659, Advance Access published on July 1, 2006
Miriam R. Kantorovitz, Gene E. Robinson, and Saurabh Sinha, A statistical method for alignment-free comparison of regulatory sequences, Bioinformatics 23: Vol. 23 ISMB/ECCB 2007, pages i249–i255.
Clare Sansom. Database searching with DNA and protein sequences: An introduction. Briefings in Bioinformatics (2000) Vol. 1, No. 1 (22–32).
Saul B. Needleman and Christian D Wunsh. A general method applicable to thesearch for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. 48, 443–453, 1970
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences, Journal of Molecular Biology. 147, 195–197, 1981.
Baris E. Suzek, Hongzhan Huang, Peter McGarvey, Raja Mazumder, and Cathy H. Wu. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics, 23:1282–1288, Advance Access published on May 15, 2007.
Carsten Kemena and Cedric Notredame, Upcoming challenges for multiple sequence alignment methods in the high throughput era. Bioinformatics 2009.
Maulika S Patel and Himanshu S Mazumdar. Similarity search using pre-search in UniRef100 database. International Journal of Hybrid Information Technology. 4(3), 31–40, July 2012.
Altschul, S. F. et al. Basic Local Alignment Search Tool. Journal of Molecular Biology. 215, 403-410, 1990.
Gesine Reinert, David Chew, Fengzhu Sun, and Michael S. Waterman,Alignment-Free Sequence Comparison (I): Statistics and Power, Journal of Molecular Biology. 16(12),1615- 1634 December 2009.

Index Terms

Computer Science

Information Sciences

Keywords

15- residue words proteins indexing divide and conquer