Call for Paper - January 2023 Edition
IJCA solicits original research papers for the January 2023 Edition. Last date of manuscript submission is December 20, 2022. Read More

Protein Sequence Similarity Search Suitable for Parallel Implementation

International Journal of Computer Applications
© 2012 by IJCA Journal
Volume 50 - Number 22
Year of Publication: 2012
Himanshu S. Mazumdar
Maulika S. Patel

Himanshu S Mazumdar and Maulika S Patel. Article: Protein Sequence Similarity Search Suitable for Parallel Implementation. International Journal of Computer Applications 50(22):1-3, July 2012. Full text available. BibTeX

	author = {Himanshu S. Mazumdar and Maulika S. Patel},
	title = {Article: Protein Sequence Similarity Search Suitable for Parallel Implementation},
	journal = {International Journal of Computer Applications},
	year = {2012},
	volume = {50},
	number = {22},
	pages = {1-3},
	month = {July},
	note = {Full text available}


Having entered the post genomic era, there lies a plethora of information, both genomic and proteomic. This provides quite a lot of resources so that the computational and machine learning strategies be applied to address the problems of biological relevance. Searching in biological databases for similar or homologous sequences is a fundamental step for many bioinformatics tasks. On discovery of a new protein sequence or drug, a biologist would like to confirm the discovery by comparing with the largest available protein database. Alignment based methods become too complex and time consuming with the increase in the number of sequences. Alignment free sequence comparison is many a time used as a filtering step for application of alignment. A novel method of searching for similar sequences in a huge protein database is proposed. The method has two interesting aspects. One is the divide and conquer approach and use of hashing like scheme for indexing the large database. The index consists of the addresses of the 15-residue words in the UniRef100. fasta database. The second aspect is the possibility of data parallelism as the database is divided into m segments for indexing. This can further increase the efficiency of the algorithm. The creation of index is time consuming but the search time is constant and affordable. The method is particularly useful when used with the large databases like UniRef100. fasta which consists of 9757328 protein sequences as on May 2010. The index based searching algorithm is implemented in C # . NET.


  • Tuan D. Pham and Johannes Zuegg. A probabilistic measure for alignment free sequence comparison. Bioinformatics, Advance Access:3455–3461, December 2004.
  • Susana Vinga and Jonas Almeida. Alignment-free sequence comparison-a review. Bioinformatics, 19:513–523, 2003.
  • Nikola Kasabov. Bioinformatics: A Knowledge Engineering- Approach. Second IEEE International Conference On Intelligent Systems, June 2004.
  • Achuthsankar S. Nair. Computational Biology & Bioinformatics: A Gentle Overview. Communications of the Computer Society of India, January 2007.
  • C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology, Cengage Learning, 1997
  • J Chen and N. Chaudhari. Cascaded Bidirectional Recurrent Neural Networks for Protein Secondary Structure Prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 4(4), Oct-Dec 2007.
  • Weizhong Li, and Adam Godzik, Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22:1658–1659, Advance Access published on July 1, 2006
  • Miriam R. Kantorovitz, Gene E. Robinson, and Saurabh Sinha, A statistical method for alignment-free comparison of regulatory sequences, Bioinformatics 23: Vol. 23 ISMB/ECCB 2007, pages i249–i255.
  • Clare Sansom. Database searching with DNA and protein sequences: An introduction. Briefings in Bioinformatics (2000) Vol. 1, No. 1 (22–32).
  • Saul B. Needleman and Christian D Wunsh. A general method applicable to thesearch for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. 48, 443–453, 1970
  • T. F. Smith and M. S. Waterman. Identification of common molecular subsequences, Journal of Molecular Biology. 147, 195–197, 1981.
  • Baris E. Suzek, Hongzhan Huang, Peter McGarvey, Raja Mazumder, and Cathy H. Wu. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics, 23:1282–1288, Advance Access published on May 15, 2007.
  • Carsten Kemena and Cedric Notredame, Upcoming challenges for multiple sequence alignment methods in the high throughput era. Bioinformatics 2009.
  • Maulika S Patel and Himanshu S Mazumdar. Similarity search using pre-search in UniRef100 database. International Journal of Hybrid Information Technology. 4(3), 31–40, July 2012.
  • Altschul, S. F. et al. Basic Local Alignment Search Tool. Journal of Molecular Biology. 215, 403-410, 1990.
  • Gesine Reinert, David Chew, Fengzhu Sun, and Michael S. Waterman,Alignment-Free Sequence Comparison (I): Statistics and Power, Journal of Molecular Biology. 16(12),1615- 1634 December 2009.