CFP last date
22 April 2024
Reseach Article

Computational Approaches for Variant Identification

by Diksha Garg, Ankita Jiwan, Shailendra Singh
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 165 - Number 8
Year of Publication: 2017
Authors: Diksha Garg, Ankita Jiwan, Shailendra Singh
10.5120/ijca2017913970

Diksha Garg, Ankita Jiwan, Shailendra Singh . Computational Approaches for Variant Identification. International Journal of Computer Applications. 165, 8 ( May 2017), 18-24. DOI=10.5120/ijca2017913970

@article{ 10.5120/ijca2017913970,
author = { Diksha Garg, Ankita Jiwan, Shailendra Singh },
title = { Computational Approaches for Variant Identification },
journal = { International Journal of Computer Applications },
issue_date = { May 2017 },
volume = { 165 },
number = { 8 },
month = { May },
year = { 2017 },
issn = { 0975-8887 },
pages = { 18-24 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume165/number8/27594-2017913970/ },
doi = { 10.5120/ijca2017913970 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T00:11:53.938977+05:30
%A Diksha Garg
%A Ankita Jiwan
%A Shailendra Singh
%T Computational Approaches for Variant Identification
%J International Journal of Computer Applications
%@ 0975-8887
%V 165
%N 8
%P 18-24
%D 2017
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Variant identification is a fundamental part in the analysis of genetic diseases. Variants are the alterations which occur in the arrangement of nucleotide in the DNA sequence. Genetic diseases are caused by variations occurring in genes which may cause change in protein, affecting the survival and adaptation of an individual. A number of computational techniques are applied to identify these variant. Precise diagnosis of genetic diseases is important for proper treatment of patients and to determine explicit prevention strategies. Introduction of next generation sequencing (NGS) techniques in the past have made large number of DNA sequences easily available. This has made variant identification using NGS data a area of interest. This paper briefly discussed the analysis steps followed for NGS data analysis. This paper later explains in detail a few approaches that are used for identifying variants such as Support vector machine based approach, Machine learning based approach, MOSAIK: hash-base approach, Bayesian statistical based approach, JointSLM based approach.

References
  1. Lodish, H., Baltimore, D., Berk, A., Zipursky, S. L., Matsudaira, P. and Darnell, A. 1995. Molecular cell biology. New York: Scientific American Books.
  2. Milunsky, A. and Milunsky, J. 2015. Genetic disorders and the fetus: diagnosis, prevention, and treatment. John Wiley & Sons.
  3. Renkema, K. Y., Stokman, M. F., Giles, R. and Knoers, A. 2014. Next-generation sequencing for research and diagnostics in kidney disease. Nature Reviews Nephrology. 433-444.
  4. Mahdieh, N. and Rabbani, A. 2013. An overview of mutation detection methods in genetic disorders. Iranian journal of pediatrics. 23(4), 375.
  5. Drake, J. W., Charlesworth, B., Charlesworth, J. D. and Crow. 1998. Rates of spontaneous mutation. Genetics. 1667-1686.
  6. Eyre-Walker, A. and Keightley, A. 2007. The distribution of fitness effects of new mutations. Nature Reviews Genetics. 610-618.
  7. Wei, X., Ju, X., Yi, X., Zhu, Q., Qu, N., Liu, T., Chen, Y., Jiang, H., Yang, G. and Zhen, R., 2011. Identification of sequence variants in genetic disease-causing genes using targeted next-generation sequencing. PloS one. 6(12).
  8. Mili, A., Charfeddine, I. B., Mamaï, O., Cherif, W., Adala, L., Amara, A., Pagliarani, S., Lucchiari, S., Ayadi, A. and Tebib, N.2012. Molecular and biochemical characterization of Tunisian patients with glycogen storage disease type III. Journal of human genetics. 170-175.
  9. Ng, S. B., Buckingham, K. J., Lee, C., Bigham, A. W., Tabor, H. K., Dent, K. M., Huff, C. D., Shannon, P. T., Jabs, E. W. and Nickerson, D. 2010. Exome sequencing identifies the cause of a mendelian disorder. Nature genetics. 42(1), 30-35.
  10. Ng, S. B., Bigham, A. W., Buckingham, K. J., Hannibal, M. C., McMillin, M. J., Gildersleeve, H. I., Beck, A. E., Tabor, H. K., Cooper, G. M. and Mefford, H. 2010. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nature genetics. 42(9), 790-793.
  11. Metzker and M. L. 2010. Sequencing technologies—the next generation. Nature reviews genetics. 11(1), 31-46.
  12. Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J., Braverman, M. S., Chen, Y. J., Chen, Z., Dewell, A. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 437(7057), 376-380.
  13. Li, H. and Homer, A. A survey of sequence alignment algorithms for next-generation sequencing. 2010. Briefings in bioinformatics. 11(5), 473-483.
  14. Medvedev, P., Stanciu, M. and Brudno, A. 2009. Computational methods for discovering structural variation with next-generation sequencing. Nature methods. 6, (S13-S20).
  15. Pabinger, S., Dander, A., Fischer, M., Snajder, R., Sperk, M., Efremova, M., Krabichler, B., Speicher, M. R. and Zschocke, J. 2014. A survey of tools for variant analysis of next-generation genome sequencing data.. Briefings in bioinformatics. 15(2), 256-278.
  16. Dai, M., Thompson, R. C., Maher, C., Contreras-Galindo, R., Kaplan, M. H., Markovitz, D. M., Omenn, G. and Meng, A. 2010. NGSQC: cross-platform quality analysis pipeline for deep sequencing data. BMC genomics. 11(4), 7.
  17. Schmieder, R., Edwards, A. 2011. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 27(6), 863-864.
  18. Langmead, B., Trapnell, C., Pop, M. and Salzberg, A. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology. 10(3), 25.
  19. Langmead, B. and Salzberg, A. 2012. Fast gapped-read alignment with Bowtie 2. Nature methods. 9(4), 357-359.
  20. Li, H. and Durbin, A. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 25(14), 1754-1760.
  21. Li, H. and Durbin, A. 2010. Fast and accurate long-read alignment with Burrows–Wheeler transform Bioinformatics. 26(5), 589-595.
  22. Li, H., Ruan, J. and Durbin, A. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research. 18(11), 1851-1858.
  23. Alkan, C., Kidd, J. M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J. O., Baker, C., Malig, M., Mutlu, O. and Sahinalp, A. 2009. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature genetics. 41(10), 1061-1067.
  24. Lee, H. and Schatz, A. 2012. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics. 28(16), 2097-2105.
  25. Camilleri, M., Carlson, P., McKinzie, S., Grudell, A., Busciglio, I., Burton, D., Baxter, K., Ryks, M. and Zinsmeister, A. 2008. Genetic variation in endocannabinoid metabolism, gastrointestinal motility, and sensation. American Journal of Physiology-Gastrointestinal and Liver Physiology. 294(1), G13-G19.
  26. Neuman, J. A., Isakov, O. a nd Shomron, A. 2013. Analysis of insertion–deletion from deep-sequencing data: software evaluation for optimal detection. Briefings in Bioinformatics. 14(1), 46-55.
  27. Kim, S. Y., Li, Y., Guo, Y., Li, R., Holmkvist, J., Hansen, T., Pedersen, O., Wang, J. and Nielsen, A. 2010. Design of association studies with pooled or un‐pooled next‐generation sequencing data. Genetic epidemiology. 34(5), 479-491.
  28. B. V. 2010. A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics. 26, (318-24).
  29. DePristo, M A. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43, ( 491-8).
  30. Li, H. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 25, (2078-9).
  31. Larson, D E. 2012. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 28, ( 311-7).
  32. Abyzov, A. 2011. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, ( 974-84).
  33. Li, J. 2012. CONTRA: copy number analysis for targeted resequencing. Bioinformatics. 28, (1307-13).
  34. Chen, K. 2009. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 6, ( 677-81).
  35. Sun, R. 2012. Breakpointer: using local mapping artifacts to support sequence breakpoint discovery from single-end reads. Bioinformatics. 28, (1024-5).
  36. Wang, K. 2010. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, 164.
  37. Grant, J R. 2011. In-depth annotation of SNPs arising from resequencing projects using NGS-SNP. Bioinformatics. 27 , (2300-1).
  38. Ge, D. 2011. SVA: software for annotating and visualizing sequenced human genomes. Bioinformatics. 27, (1998-2000).
  39. Medina, I. 2012. VARIANT: command line, web service and web interface for fast and accurate functional characterization of variants found by next-generation sequencing. Nucleic Acids Res. 40, (54-8).
  40. Loraine, A E. 2002. Visualizing the genome: techniques for presenting human genome data and annotations. BMC Bioinformatics. 3 , 19.
  41. Nielsen, R., Paul, J. S., Albrechtsen, A. and Song, A. 2011. Genotype and SNP calling from next-generation sequencing data.. Nature Reviews Genetics. 12(6), 443-451.
  42. Spudich, G M. 2010. Touring Ensembl: a practical guide to genome browsing. BMC Genomics. 11 , 295.
  43. Dreszer, T R. 2012. The UCSC Genome Browser database: extensions and updates 2011. Nucleic Acids Res. 40, (918-23).
  44. Carver, T. 2012. Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics. 28, (464-9).
  45. Thorvaldsdóttir, H. 2013. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinformatics. 14, (178-92).
  46. Fang, Y. and Chiu, A. 2013. A novel support vector machine-based approach for rare variant detection. PloS one. 8(8), 71114.
  47. Tax, D. and Duin, A. 2004. Support vector data description. Machine learning. 54(1), 45-66.
  48. Spiess, A. and Neumeyer, A. 2010. An evaluation of R2 as an inadequate measure for nonlinear models in pharmacological and biochemical research: a Monte Carlo approach. BMC pharmacology. 10(1), 6.
  49. Li, L., Jiang, W., Li, X., Moser, K. L., Guo, Z., Du, L., Wang, Q., Topol, E. J., Wang, Q. and Rao, A. 2005. A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. Genomics. 85(1), 16-23.
  50. Malhotra, R. and Chug, A. 2012. Software maintainability prediction using machine learning algorithms. Software Engineering: An International Journal (SEIJ). 19-36.
  51. Wu, C., Walsh, K. M., DeWan, A. T., Hoh, J. and Wang, A. 2011,November. Disease risk prediction with rare and common variants. BMC proceedings. 5, (S61).
  52. Breiman, L. 2001. Random forests. Machine learning. 45(1), 5-32.
  53. Spinella, J. F., Mehanna, P., Vidal, R., Saillour, V., Cassart, P., Richer, C., Ouimet, M., Healy, J. and Sinnett, A. 2016. SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing. BMC genomics. 17(1), 912.
  54. Kullback, S. 1959. Information theory and statistics. New York: wiley.
  55. Lee, G. 2014. MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping. PloS one. 9(3), 90581.
  56. Vel’skii, A. 1962. An algorithm for the organization of information. Sov Math Dok. 3, (263–266).
  57. Smith, W.M.1981. Indentification of common molecular subsequences. J Mol Biol. 147, (195–197).
  58. Garrison, G. 2012. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv. 9.
  59. Zhang, F. P. 2016. Variational inference for rare variant detection in deep, heterogeneous next-generation sequencing data. arXiv preprint arXiv. 1604, 04280.
  60. Kvitek, S. G. 2013. Whole genome, whole population sequencing reveals that loss of signaling networks is the major adaptive strategy in a constant environment.. PLoS Genet. 9(11), 1003972.
  61. Magi, T. F. 2011. Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm. Nucleic acids research. 068.
  62. Yoon, S. J. 2009. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome. 19, 1586-1592.
  63. McCarroll, M. K. 2008. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet. 40, 1166-1174.
Index Terms

Computer Science
Information Sciences

Keywords

Variants Variations Mutations Genetic Disease Variant Identification.