CFP last date
20 May 2024
Reseach Article

Efficient Algorithm for Extracting Complete Repeats from Biological Sequences

by Munina Yusufu, Gulina Yusufu
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 128 - Number 16
Year of Publication: 2015
Authors: Munina Yusufu, Gulina Yusufu
10.5120/ijca2015906752

Munina Yusufu, Gulina Yusufu . Efficient Algorithm for Extracting Complete Repeats from Biological Sequences. International Journal of Computer Applications. 128, 16 ( October 2015), 33-37. DOI=10.5120/ijca2015906752

@article{ 10.5120/ijca2015906752,
author = { Munina Yusufu, Gulina Yusufu },
title = { Efficient Algorithm for Extracting Complete Repeats from Biological Sequences },
journal = { International Journal of Computer Applications },
issue_date = { October 2015 },
volume = { 128 },
number = { 16 },
month = { October },
year = { 2015 },
issn = { 0975-8887 },
pages = { 33-37 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume128/number16/22960-2015906752/ },
doi = { 10.5120/ijca2015906752 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:21:54.760501+05:30
%A Munina Yusufu
%A Gulina Yusufu
%T Efficient Algorithm for Extracting Complete Repeats from Biological Sequences
%J International Journal of Computer Applications
%@ 0975-8887
%V 128
%N 16
%P 33-37
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In this paper, an approach for efficiently extracting the repeating patterns in a biological sequence is proposed. A repeating pattern is a subsequence which appears more than once in a sequence, which is one of the most important features that can be used for revealing functional or evolutionary relationships in biological sequences. The algorithm does a rapid scan of the string to find repeating regions where the repeating substring has been marked using length, occurrence positions, and occurrence frequency. The algorithm execute in linear time and space independent of alphabet size. The algorithm also has the capability to restrict output complete repeats in which length (period) p ≥ pmin, where pmin ≥ 1 is a user-specified minimum. The algorithm outputs complete repeats, and can be extended or applied to other situations, for example computing maximal repeats, or finding common motifs in a set of biological sequences.

References
  1. Lander E.S, Linton L.M, Birren B, et al.. Initial Sequencing and Analysis of the Human Genome, Nature, 2001, 409(6822): 860-921.
  2. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: Computational challenges and solutions, Nature Reviews Genetics, 2011, 13 (1): 36-46.
  3. Stefan Kurtz, Jomuna V. Choudhuri, Enno Ohlebusch, Chris Schleiermacher, Jens Stoye, Robert Giegerich. REPuter: The manifold applications of repeat analysis on a genomic scale, Nucleic Acids Research, 2001, 29(22):4633-4642.
  4. Makalowski W. Not junk after all, Science, 2003, 300(5623): 1246-1247.
  5. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 2004, 431: 931-945.
  6. Verkerk A., Pieretti M., Sutcliffe J., Fu Y., Kul D., Pizzuti A., Refiner O., et al.. Identification of gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome, Cell, 1991, 65: 905-914.
  7. Huntington's Disease Collaborative Research Group. A novel gene containing a trinucleotide repeat that is expanded an unstable on Huntington's disease chromosomes, Cell, 1993, 72: 971-983.
  8. Campuzano V., Montermini L., Molto M.D., Pianese L. Cossee M, et al.. Friedreich's ataxia: autosomal recessive disease caused by an intronic GAA triplet repeat expansion, Science, 1996, 271(5254):1423-1427.
  9. E. Eskin, P. A. Pevzner. Finding Composite Regulatory Patterns in DNA Sequences. Bioinformatics, 2002, 18 Suppl 1:S354-363.
  10. Sagot, MF. Spelling Approximate Repeated or Common Motifs Using a Suffix Tree. Lecture Notes in Computer Science, 1998, 1380:111-127.
  11. A.F.A. Smit and P. Green. REPEATMASKER. Available at http://www.repeatmasker.org/
  12. Dan Gusfield. Algorithms on Strings, Trees and Sequences, Cambridge University Press, 1997.
  13. Stefan Kurtz, Chris Schleiermacher. REPuter: Fast computation of maximal repeats in complete genomes, Bioinformatics, 1999, 15(5):426-427.
  14. A. L. Delcher et al. Alignment of whole genomes, Nucleic Acids Research, 1999, 27:2369-2376.
  15. Gary Benson. Tandem repeats finder: A program to analyze DNA sequences, Nucleic Acids Research, 1999, 27(2):573-580.
  16. Frantisek Franek, William F. Smyth, Yudong Tang. Computing all repeats using suffix arrays, Journal of Automata, Languages and Combinatorics, 2003, 8(4): 579-591.
  17. Kaziyuki Narisawa, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda, Efficient computation of substring equivalence classes with suffix arrays, Proc. of 18th CPM, 2007, 340-351.
  18. Simon J. Puglisi, W. F. Smyth, Munina Yusufu. Fast, Practical Algorithms for Computing All the Repeats in a String, Mathematics in Computer Science, 2010, 3(4):371-496.
  19. Albert A. Conti, Tom Van Court, Martin C. Herbordt. Processing Repetitive Sequence Structures with Mismatches at Streaming Rate. Lecture Notes in Computer Science, 2004, 3203:1080-1083.
  20. Juha Karkkainen, Peter Sanders. Simple linear work suffix array construction, Proc. of 30th ICALP, LNCS 2719, 2003, 943-955.
  21. Kasai, G. Lee, H. Arimura, S. Arikawa, K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications, Proc. of 12th CPM, LNCS 2089, 2001, 181-192.
  22. Nizar R. Mabroukeh, C. I. Ezeife. A Taxonomy of Sequential Pattern Mining Algorithms. ACM Computing Surveys, 2010, 43(1):3:1-3:41.
  23. Anisa Al-Hafeedh, Maxime Crochemore, Lucian Ilie, Evguenia Kopylova, W.F. Smyth, German Tischler, Munina Yusufu. A comparison of index-based Lempel- Ziv LZ77 factorization algorithms. ACM Computing Surveys, 2012, 45(1):5:1-5:17.
Index Terms

Computer Science
Information Sciences

Keywords

Complete repeats Biological sequence Suffix array Motif finding