CFP last date
20 May 2024
Reseach Article

Searching the Referentially-compressed Genomes by Incomplete Patterns

by Mohammad Nassef, Amr Badr, Ibrahim Farag
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 88 - Number 1
Year of Publication: 2014
Authors: Mohammad Nassef, Amr Badr, Ibrahim Farag
10.5120/15313-3624

Mohammad Nassef, Amr Badr, Ibrahim Farag . Searching the Referentially-compressed Genomes by Incomplete Patterns. International Journal of Computer Applications. 88, 1 ( February 2014), 1-8. DOI=10.5120/15313-3624

@article{ 10.5120/15313-3624,
author = { Mohammad Nassef, Amr Badr, Ibrahim Farag },
title = { Searching the Referentially-compressed Genomes by Incomplete Patterns },
journal = { International Journal of Computer Applications },
issue_date = { February 2014 },
volume = { 88 },
number = { 1 },
month = { February },
year = { 2014 },
issn = { 0975-8887 },
pages = { 1-8 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume88/number1/15313-3624/ },
doi = { 10.5120/15313-3624 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:06:27.967423+05:30
%A Mohammad Nassef
%A Amr Badr
%A Ibrahim Farag
%T Searching the Referentially-compressed Genomes by Incomplete Patterns
%J International Journal of Computer Applications
%@ 0975-8887
%V 88
%N 1
%P 1-8
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Genome banks contain precious biological information that is mostly not discovered yet. Biologists in turn are keen to precisely explore these banks in order to discover effective patterns (such as motifs and retro-transposons) that have a real impact on the function and evolution of living creatures. Because the modern genome sequencing technologies produce genomes in high throughputs, many techniques have emerged to store genomes in the lowest possible space. Reference-based Compression algorithms (RbCs) efficiently compress the sequenced genomes by mainly storing their differences with respect to a reference genome. Therefore, RbCs give very high compression ratios compared to the traditional compression algorithms. However, in order to search a compressed genome for specific patterns, it has to be totally decompressed, wasting both time and storage. This paper introduces searching for either exact or incomplete patterns inside the referentially compressed genomes without their complete decompression. The introduced search methodolgy is based on instantly searching subsequent sequences that are partially decompressed from the compressed genome. Moreover, the same search process is allowed to simultaneously search for multiple patterns, thus saving more resources. The experimental results showed noticeable performance gains compared to traditionally searching the same compressed genomes after their complete referential decompression.

References
  1. R. A. Gibbs, J. W. Belmont, P. Hardenbol, T. D. Willis, F. Yu, et al. The international hapmap project. Nature, 426(6968):789–796, 2003.
  2. N. Siva. 1000 genomes project. Nature biotechnology, 26(3):256–256, 2008.
  3. G. M. Church. The personal genome project. Molecular Systems Biology, 1(1), 2005.
  4. S. Wandelt and U. Leser. String searching in referentially compressed genomes. In KDIR, pages 95–102, 2012.
  5. S. Wandelt, U. Leser, et al. Adaptive efficient compression of genomes. Algorithms for Molecular Biology, 7:30, 2012.
  6. B. G. Chern, I. Ochoa, A. Manolakos, A. No, K. Venkat, and T. Weissman. Reference based genome compression. In Information Theory Workshop (ITW), 2012 IEEE, pages 427–431. IEEE, 2012.
  7. A. D. Wyner and J. Ziv. The sliding-window lempel-ziv algorithm is asymptotically optimal. Proceedings of the IEEE, 82(6):872–877, 1994.
  8. C. Wang and D. Zhang. A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Research, 39(7):e45–e45, 2011.
  9. A. J. Pinho, D. Pratas, and S. P. Garcia. Green: a tool for efficient compression of genome resequencing data. Nucleic Acids Research, 40(4):e27–e27, 2012.
  10. M. Nassef, A. Badr, and I. Farag. An algorithm for browsing the referentially-compressed genomes. International Journal of Computer Applications, 86(8):1–10, January 2014. Published by Foundation of Computer Science, New York, USA.
  11. A. Cornish-Bowden. Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic acids research, 13(9):3021, 1985.
  12. Python's fast search implementation using a mix between boyer-moore and horspool algorithms. http://svn. python. org/view/python/trunk/ Objects/stringlib/fastsearch. h?revision= 68811&view=markup.
  13. R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20(10):762–772, 1977.
  14. R. N. Horspool. Practical fast searching in strings. Software: Practice and Experience, 10(6):501–506, 1980.
Index Terms

Computer Science
Information Sciences

Keywords

Reference-based Genome Compression Partial Genome Decompression Searching Compressed Genome Incomplete Patterns inCompressi.