CFP last date
22 April 2024
Reseach Article

Parallelizing and Analyzing the Behavior of Sequence Alignment Algorithm on a Cluster of Workstations for Large Datasets

by Sathe S. R, Shrimankar D. D.
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 74 - Number 21
Year of Publication: 2013
Authors: Sathe S. R, Shrimankar D. D.
10.5120/13042-0091

Sathe S. R, Shrimankar D. D. . Parallelizing and Analyzing the Behavior of Sequence Alignment Algorithm on a Cluster of Workstations for Large Datasets. International Journal of Computer Applications. 74, 21 ( July 2013), 18-30. DOI=10.5120/13042-0091

@article{ 10.5120/13042-0091,
author = { Sathe S. R, Shrimankar D. D. },
title = { Parallelizing and Analyzing the Behavior of Sequence Alignment Algorithm on a Cluster of Workstations for Large Datasets },
journal = { International Journal of Computer Applications },
issue_date = { July 2013 },
volume = { 74 },
number = { 21 },
month = { July },
year = { 2013 },
issn = { 0975-8887 },
pages = { 18-30 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume74/number21/13042-0091/ },
doi = { 10.5120/13042-0091 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:42:54.789746+05:30
%A Sathe S. R
%A Shrimankar D. D.
%T Parallelizing and Analyzing the Behavior of Sequence Alignment Algorithm on a Cluster of Workstations for Large Datasets
%J International Journal of Computer Applications
%@ 0975-8887
%V 74
%N 21
%P 18-30
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

An MPI based parallelization technique for improving the scalability of the global sequence alignment algorithm on clusters of workstation is presented. We propose the parallel implementation of the Wavefront algorithm based on a chunk size transformation to handle large dataset with message passing model. Molecular biologists frequently align DNA sequences of entire genomes to detect important matched and mismatched regions. Even though efficient dynamic programming algorithms exist for this problem, the required computing time is still very high due to the size of these sequences. Because the number of sequenced organisms is increasing rapidly, fast and accurate solutions are of highest importance to research in this area. We show that an appropriate choice of the number of processes and chunk size has great impact on the overall system performance on cluster system. We have conducted the experiments on real-life DNA samples of house mouse mitochondrion and the DNA of rabbit mitochondrion obtained from the public database GenBank [GenBank, http://www. ncbi. nih. gov] in our experiment to measure the algorithm behavior appropriately. The results obtained from performed experiments, demonstrate that developed parallel Wavefront algorithm exposes high speedup and scales linearly with the increasing number of processes. Also the communication among processes and memory requirements are kept at minimum to achieve high efficiency. The experiments were performed on cluster which consists of two workstations of 12 core each with multithreading environment.

References
  1. A. Boukerche, Alba Melo, "Bioinformatics applications in grid computing environments," Grid Computing for Bioinformatics and Computational Biology, pp. 301-325, 2007.
  2. A. Boukerche, A. C. de Melo, "Computational molecular biology," Parallel Computing for Bioinformatics and Computational Biology: Models Enabling Technologies, and Case Studies, pp. 149-166, 2006.
  3. J. C. Setubal and J. Meidanis, Introduction to Computational Molecular Biology. Brooks/Cole Publishing Company, 1997.
  4. . Saul B. Needleman and Christian D. Wunsch. "A general method applicable to the search for similarities in the amino acid sequence of two sequences," Journal of Molecular Biology, pp. 443-453, 1970.
  5. A. Zomaya, Parallel Computing for Bioinformatics and Computational Biology: Models, Enabling Technologies, and Case Studies, 2006.
  6. Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, David J. Lipman, "A basic local alignment search tool," Journal of Molecular Biology, pp 403-410, 1990.
  7. T. F. Smith, M. S. Waterman, "Identification of common molecular subsequences," J. Mol. Biol. Pp. 195-197, 1981.
  8. Chaichoompu K, Kittitornkun S, Tongsima S. "MT-clustalW: multithreading multiple sequence alignment," IPDPS,
  9. Li KB, "ClustalW-MPI: ClustalW analysis using distributed and parallel computing," Bioinformatics, pp. 1585–1586, 2006.
  10. Schmollinger M, Nieselt K, Kaufmann M, Morgenstern B, "DIALIGN P: fast pair-wise and multiple sequence alignment using parallel processors," BMC Bioinform, 2004
  11. Tan G, Feng S, Sun N, "Parallel multiple sequences alignment in SMP clusters," Highperformance computing, 2005.
  12. Zola J, Yang X, Rospondek A, Aluru S, "T-Coffee: a parallel multiple sequence aligner," PDCS, pp. 248–253, 2007.
  13. Boukerche A, Correa JM, Melo ACMA, Jacobi RP, "A hardware accelerator for the fast retrieval of DIALIGN biological sequence alignments in linear space," IEEE Trans. Comput, pp. 808–821, 2010.
  14. Altschul, S. F. , Gish, W. , Miller, W. , Myers, E. W. , Lipman, D. J. , "Basic local alignment search tool," Journal of Molecular Biology, pp. 403-410, 1990.
  15. Chao, K. M. , Zhang, J. , Ostell, J. , Miller, W. , "A local alignment tool for long DNA sequences," Computer Applications in the Biosciences, pp. 147-153, 1994.
  16. Pearson, W. R. , "Comparison of methods for searching protein sequence databases," Protein Science, pp. 1145-1160, 1995.
  17. Martins, W. S. , del Cuvillo, J. B. , Cui, W. , Gao, G. R. , "Whole Genome Alignment using a Multithreaded Parallel Implementation," Proceedings 13th Symposium on Computer Architecture and High Performance Computing, September 2001.
  18. Osamu Gotoh. "An improved algorithm for matching biological sequences," Journal of Molecular Biology, pp. 705-708, 1982.
  19. E. W. Myers and W. Miller, "Optimal alignments in linear space," Computer Applications in the Biosciences, pp. 11-17, 1988.
  20. G. Navarro, "A guided tour to approximate string matching," ACM Computing Surveys, 2001.
  21. J. C. Setubal, J. Meidanis, "Introduction to Computational Molecular Biology," Brooks/Cole Publishing Company, 1997.
  22. D. S. Hirschberg, "A linear space algorithm for computing maximal common subsequences," Communications of the ACM, pp. 341-343, 1975.
  23. David W. Mount, Bioinformatics: Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press, 2004.
  24. S. Aji, F. Blagojevic, W. Feng, D. S. Nikolopoulos. "Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine" Proceedings of the ACM nternational Conference on Computing Frontiers, pp. 13-22, 2008
  25. C. Quammen, Introduction to programming shared-memory and distributed memory parallel computers, Crossroads 12, 2005.
  26. W. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, MIT Press, Cambridge, MA, USA, 1994.
  27. E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham,T. S. Woodall, "Open MPI: goals, concept, and design of a next generation MPI implementation," Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, pp. 97–104.
  28. Plaat, A. , Bal, H. E. , Hofman, R. H. F. , "Sensitivity of Parallel Applications to Large Differences in Bandwidth and Latency in Two-Layer Interconnects," Proceedings 5th IEEE HPCA'99, pp. 244-253, 1999.
  29. Foster, I. , Geisler, J. , Gropp, W. , Karonis, N. , Lusk, E. , Thiruvathukal, G. , Tuecke S. , "Wide-Area Implementation of the Message Passing Interface," Parallel Computing, pp. 1735-1749, 1998.
  30. http://www. niu. edu/mpi
  31. http://www-unix. mcs. anl. gov/mpi/mpich/
  32. Schmidt, B. , Schr¨oder, H. , Schimmler, M. , "Massively Parallel Solutions for Molecular Sequence Analysis," Proc. IPDPS'02, Ft. Lauderdale, Florida, 2002.
  33. Schmidt, B. , Schr¨oder, H. , Schimmler, M. , "A hybrid architecture for bioinformatics," Future Generations Computer Systems, pp. 855-862, 2002.
Index Terms

Computer Science
Information Sciences

Keywords

Sequence Alignment MPI Cluster Computing Parallel Wavefront algorithm