CFP last date
20 May 2024
Reseach Article

A Scoring Method for the Clustering of Nucleic Acid Sequences

by Barileé Barisi Baridam
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 44 - Number 2
Year of Publication: 2012
Authors: Barileé Barisi Baridam
10.5120/6235-8331

Barileé Barisi Baridam . A Scoring Method for the Clustering of Nucleic Acid Sequences. International Journal of Computer Applications. 44, 2 ( April 2012), 14-22. DOI=10.5120/6235-8331

@article{ 10.5120/6235-8331,
author = { Barileé Barisi Baridam },
title = { A Scoring Method for the Clustering of Nucleic Acid Sequences },
journal = { International Journal of Computer Applications },
issue_date = { April 2012 },
volume = { 44 },
number = { 2 },
month = { April },
year = { 2012 },
issn = { 0975-8887 },
pages = { 14-22 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume44/number2/6235-8331/ },
doi = { 10.5120/6235-8331 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:34:30.597783+05:30
%A Barileé Barisi Baridam
%T A Scoring Method for the Clustering of Nucleic Acid Sequences
%J International Journal of Computer Applications
%@ 0975-8887
%V 44
%N 2
%P 14-22
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The clustering of biological sequence data is a significant task for biologists. The reason is that sequence clustering assists molecular biologists to group sequences based on the ancestral traits or hereditary information that are hidden in sequences. To accomplish the similarity detection and clustering tasks, several clustering algorithms, similarity and distance measures have been proposed. Most of these algorithms and similarity measures manifest some form of inefficiency in the detection of sequences based on their structural similarity as was observed in the course of this study. In this paper, the codon-based scoring method (COBASM) is developed to handle this inefficiency. COBASM employs the codon principle, by the application of triplet nucleotides, in the clustering of nucleic acid sequences. The results obtained show that COBASM is able to produce compact and well- separated clusters based on the structural similarity of sequences.

References
  1. V. I. Levenshtein, 1965. "Binary codes capable of correcting deletions, insertions, and reversals," Doklady Akademii Nauk SSSR, vol. 163, no. 4, pp. 845–848.
  2. J. Yang and W. Wang, 2003. "CLUSEQ: efficient and effective sequence clustering," in Proceeding of 19th International Conference Data Engineering, pp. 101–1125.
  3. J. Claverie and C. Notredame, 2007. Bioinformatics for dummies, 2nd ed. Indiana: Wiley.
  4. M. Eisen, P. Spellman, P. Brown, and D. Botstein, 1998. "Cluster analysis and display of genome-wide expression patterns," in Proceedings National Academy of Science, USA, vol. 95, pp. 14 863–14 868.
  5. R. Xu and D. Wunsch II, 2005. "Survey of clustering algorithms," IEEE Transactions on Neural Networks, vol. 16, no. 3, pp. 601–614.
  6. A. Ben-Dor, R. Shamir, and Z. Yakhini, 2005. "Clustering gene expression patterns," Journal of Computational Biology, vol. 6, no. 3/4.
  7. R. Sharan and R. Shamir, 2000. "CLICK: A clustering algorithm with applications to gene expression analysis," in Proceedings of International Conference on Intelligent Systems and Molecular Biology, vol. 8, pp. 307–316.
  8. G. Getz, H. Gal, I. Kela, D. A. Notterman, and E. Domany, 2003. "Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data," Bioinformatics, vol. 19, no. 9, pp. 1079–1089.
  9. G. Getz, E. Levine, E. Domany, and M. Q. Zhang, 2000. "Super- parametric clustering of yeast gene expression profiles," Physica A, vol. 279, pp. 457–464.
  10. I. S. Dhillon and D. S. Modha, 2001. "Concept decompositions for large sparse text data using clustering," Machine Learning, vol. 42, pp. 143–175.
  11. E. P. Xing and R. M. Karp, 2001. "CLIFF: Clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts," Bioinformatics, vol. 17, no. 1, pp. s306–315.
  12. J. Liu and W. Wang, 2003. "OP-cluster: Clustering by tendency in high dimensional space," in Proceedings of the Third IEEE International Conference on Data Mining.
  13. C. C. Aggarwal and P. S. Yu, 2000. "Finding generalized projected clusters in high dimensional spaces," in ACM SIGMOD, pp. 70–81.
  14. R. Agrawal, J. E. Gehrke, D. Gunopulos, and P. Raghavan, 1998. "Automatic subspace clustering of high dimensional data for data mining applications," in ACM SIGMOD.
  15. T. Li, S. Zhu, and M. Ogihara, 2003. "Algorithms for clustering high dimensional and distributed data," Intelligent Data Analysis, vol. 7, no. 4, pp. 305–326.
  16. P. Berkhin, 2002. "Survey of clustering data mining techniques," Accrue Software, Inc. , San Jose, California, Tech. Rep. 4, available online: www. citeseer. nj. nec. com/berkhin02survey. html.
  17. R. Ng and J. Han, 2004. "CLARANS: A method for clustering objects for spatial data mining," IEEE Transaction on Knowledge and Data Engineering, vol. 14, no. 5, pp. 1003–1016.
  18. V. Nikulin, 2006. "Weighted threshold-based clustering for intrusion detection system," International Journal of Computational Intelligence and Applications, vol. 6, no. 1, pp. 31–19.
  19. M. G. H. Omran, 2004. "Particle swarm optimization methods for pattern recognition and image processing," PhD thesis, University of Pretoria, Faculty of Engineering, Built Environment and Information Technology, Department of Computer Science.
  20. J. Cong and M. Smith. 1993. "A parallel bottom-up clustering algorithm with applications to circuit partitioning in VLSI design," in Proceed- ings of the 30th ACM/IEEE Design Automation Conference, pp. 755–760.
  21. H. Frigui and R. Krishnapuram. 1999. "A robust competitive clustering algorithm with applications in computer vision," in IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 21, pp. 450–465.
  22. S. J. Devlin, R. Gnanadesikan, and J. R. Kettenring. 1975. "Robust estimation and outlier detection with correlation coefficients," Biometrika, vol. 62, no. 3, pp. 531–545.
  23. G. Karypis, E. Han, and V. Kumar. 1999. "CHAMELEON: A hierarchical clustering algorithm using dynamic modeling," IEEE Transaction on Computers, vol. 32, no. 8, pp. 68–75.
  24. E. Torarinsson, J. H. Havgaard, and J. Gorodkin. 2007. "Multiple structure alignment and clustering of RNA sequences," Bioinformatics, vol. 23, no. 8, pp. 926–932.
  25. D. Simovici, N. Singla, and M. Kuperberg. 2004. "Metric incremental clustering of nominal data," in Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM'04), vol. 00, pp. 523–526.
  26. P. Smyth. 1997. "Clustering sequences with hidden markov models," Advances in Neural Information Processing Systems, vol. 648.
  27. F. Porikli. 2004. "Clustering variable length sequences by eigenvector decomposition using HMM," Springer, vol. 3138.
  28. A. Natsev, R. Rastogi, and K. Shim. 2004. "WALRUS: A similarity retrieval algorithm for image databases," IEEE Transaction on Knowledge and Data Engineering, vol. 16, no. 3, pp. 301–316.
  29. E. P. Nawrocki and S. R. Eddy. 2007. "Query-dependent banding (QDB) for faster RNA similarity searches," PLOS Computational Biology, vol. 3, no. 3, pp. 0540–0554.
  30. K. M. Kaplan and J. J. Kaplan. 2004. "Multiple DNA sequence approximate matching," in Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pp. 79–86.
  31. X. Wang, J. T. Wang, K. Lin, D. Shasha, B. A. Shapiro, and K. Zhang. 2000. "An index structure for data mining and clustering," Knowledge and Information Systems, vol. 2, no. 2, pp. 161–184.
  32. L. J. Heyer, S. Kruglyak, and S. Yooseph. 1999. "Exploring expression data: Identification and analysis of coexpressed genes," Genome Research, vol. 9, pp. 1106–1115.
  33. G. Getz, E. Levine, and E. Domany. 2000. "Coupled two-way clustering analysis of gene microarray data," in Proceedings of National Academy of Science, USA, vol. 97, pp. 12 079–12 084.
  34. C. Bohm, K. Kailing, P. Kroger, and A. Zimek. 2004. "Computing clusters of correlation connected objects," in ACM SIGMOD Conference.
  35. L. Zhao and M. Zaki. 2005. "TRICLUSTER: An effective algorithm for mining coherent clusters in 3d microarray data," in ACM SIGMOD Conference.
  36. B. B. Baridam and O. Owolabi. 2010. "Conceptual clustering of RNA sequences with the codon usage model," Global Journal of Computer Science and Technology, vol. 10, no. 8, pp. 41–45.
  37. B. B. Baridam. 2010. "Optimization techniques for the clustering of nucleic acids sequences," PhD thesis, University of Pretoria.
  38. T. Sonstegard, A. V. Capuco, J. White, C. P. Van Tastell, E. E. Connor, J. Cho, R. Sultana, L. Shade, J. E. Wray, K. D. Wells, and J. Quackenbush. 2002. "Analysis of bovine mammary gland EST and functional annotation of the Bos Taurus gene index," Mammary Genome, vol. 13, no. 7, pp. 373–379.
  39. M. R. Sheldon, M. J. Fillyaw, and W. D. Thompson. 1996. "The use and interpretation of the Friedman test in the analysis of ordinal- scale data in repeated measures designs," Physiotherapy Research International, vol. 1, no. 4, pp. 221–228.
Index Terms

Computer Science
Information Sciences

Keywords

Codon Scoring Method Similarity Measure Clustering