CFP last date
20 May 2024
Reseach Article

A Vital Approach to compress the Size of DNA Sequence using LZW (Lempel-Ziv-Welch) with Fixed Length Binary Code and Tree Structure

by Nishad Pm, R. Manicka Chezian
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 43 - Number 1
Year of Publication: 2012
Authors: Nishad Pm, R. Manicka Chezian
10.5120/6065-8193

Nishad Pm, R. Manicka Chezian . A Vital Approach to compress the Size of DNA Sequence using LZW (Lempel-Ziv-Welch) with Fixed Length Binary Code and Tree Structure. International Journal of Computer Applications. 43, 1 ( April 2012), 7-9. DOI=10.5120/6065-8193

@article{ 10.5120/6065-8193,
author = { Nishad Pm, R. Manicka Chezian },
title = { A Vital Approach to compress the Size of DNA Sequence using LZW (Lempel-Ziv-Welch) with Fixed Length Binary Code and Tree Structure },
journal = { International Journal of Computer Applications },
issue_date = { April 2012 },
volume = { 43 },
number = { 1 },
month = { April },
year = { 2012 },
issn = { 0975-8887 },
pages = { 7-9 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume43/number1/6065-8193/ },
doi = { 10.5120/6065-8193 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:32:46.818230+05:30
%A Nishad Pm
%A R. Manicka Chezian
%T A Vital Approach to compress the Size of DNA Sequence using LZW (Lempel-Ziv-Welch) with Fixed Length Binary Code and Tree Structure
%J International Journal of Computer Applications
%@ 0975-8887
%V 43
%N 1
%P 7-9
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The genome of an organism contains all hereditary information encoded in Deoxyribonucleic Acid (DNA). Molecular sequence databases (e. g. ,EMBL, Genbank, DDJB, Entrez, SwissProt, etc) represent millions of DNA sequences filling many thousands of gigabytes and the databases are doubled in size every 6-8 months, which may go to beyond the limit of storage capacity. There are several text compression algorithm used for DNA compression. This paper proposes a new hybrid algorithm is used to compress DNA sequence, the algorithm is designed by combining the fixed length binary code with the LZW (Lempel-Ziv-Welch) compression algorithm. Initially the input sequence is divided in to fragments where each fragment consist of four nucleotides and fixed length binary code is assigned to each nucleotide then the pattern (STR and CHR) in LZW used the same for creating the dictionary. Assigning a new binary code for each pattern in the dictionary using a binary tree, and the sequence is replaced binary code for the longest match in the dictionary while compression. The proposed approach attains maximum compression in DNA sequences.

References
  1. Raja Rajeswari and Dr. Allam Apparao "Genbit compress – algorithm for repetitive and non-repetitive DNA sequences" Journal of Theoretical and Applied Information Technology 2005
  2. Ateet Mehta and Bankim Patel "DNA compression Using Hash Based Data Structure" International Journal of Information Technology and Knowledge Management July-December 2010, Volume 2, No. 2, pp. 383-386
  3. S. Grumbach and F. Tahi, "A new challenge for compression algorithms: Genetic sequences," J. Inform. Process. Manage. , vol. 30, no. 6, pp. 875-866, 1994.
  4. S. Grumbach and F. Tahi, "Compression ofDNA sequences," in Proc. IEEE Symp. Data Compression, Snowbird, UT, 1993, pp. 340-350.
  5. I. H. Witten, R. Neal, and J. G. Cleary, "Arithmetic coding for data compression," Commun. ACM, vol. 30, pp. 52-541, Jun. 1987.
  6. É. Rivals, J. P. Delahaye, M. Dauchet, and O. Delgrange, "A Guaranteed Compression Scheme for Repetitive DNA Sequences," LIFL Lille I Univ. , Tech. Rep. IT-285, 1995.
  7. G. Korodi and I. Tabus, "An efficient normalized maximum likelihood algorithm for DNA sequence compression," ACM Trans. on Information Systems, vol. 23, no. 1, pp. 3–34, Jan. 2005.
  8. Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing, Reading, Massachusetts: Addison-Welsley Publishing Company, 1992.
  9. WELCH, T. A. 1984. A technique for high-performance data compression. IEEE Comput. 17, 6, 8–19. 9
  10. ZIV, J. AND LEMPEL, A. 1978. "Compression of individual sequences via variable-rate coding". IEEE Trans. Inform. Theory 24, 5, 530–536.
  11. ZIV, J. AND LEMPEL, A. 1977. "A universal algorithm for sequential data compression". IEEE Trans. Inform. Theory 23, 3, 337–343.
  12. D. A. Huffman, "A Method for the Construction of Minimum-Redundancy Codes", Proceedings of the I. R. E. , September 1952, pp 1098–1102.
Index Terms

Computer Science
Information Sciences

Keywords

Embl Ddjb Genome Deoxyribonucleic Acid (dna) Lzw Nucleotide