Call for Paper - November 2023 Edition
IJCA solicits original research papers for the November 2023 Edition. Last date of manuscript submission is October 20, 2023. Read More

Distance-based K-Means Clustering Algorithm for Anomaly Detection in Categorical Datasets

International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2021
Noor Basha, Ashok Kumar P.S.

Noor Basha and Ashok Kumar P.S.. Distance-based K-Means Clustering Algorithm for Anomaly Detection in Categorical Datasets. International Journal of Computer Applications 183(11):9-14, June 2021. BibTeX

	author = {Noor Basha and Ashok Kumar P.S.},
	title = {Distance-based K-Means Clustering Algorithm for Anomaly Detection in Categorical Datasets},
	journal = {International Journal of Computer Applications},
	issue_date = {June 2021},
	volume = {183},
	number = {11},
	month = {Jun},
	year = {2021},
	issn = {0975-8887},
	pages = {9-14},
	numpages = {6},
	url = {},
	doi = {10.5120/ijca2021921415},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}


Real-world data sets also provide knowledge in an unsupervised manner with distinct and complementary aspects. In the field of cluster analysis, a number of algorithms have recently arisen. A priori, it is difficult for a user to determine which algorithm will be most suitable for a given dataset. For this job, algorithms based on graphs give good results. Such algorithms are however, vulnerable to outliers and noises with minimal edge information found in the tree to split a dataset. Thus, in several fields, the need for better clustering algorithms increases and for this reason utilizing robust and dynamic algorithms to improve and simplify the whole process of data clustering has become an important research field.

In this paper, a novel distance-based clustering algorithm called the entropic distance based K-means clustering algorithm (EDBK) is proposed to remove the outliers in effective way. This algorithm depends on the entropic distance between attributes of data points and some basic mathematical statistics operations. In this work, experiments are conducted using UCI datasets showed that EDBK method outperforms the existing methods such as Artificial Bee Colony (ABC), k-means etc. The EDBK achieved 80.71% recall, 79.81% precision and 75.82% F-measure. The results show that the EDBK method not only improve the clustering accuracy (i.e. nearly 92%), but also greatly reduce the interference of outliers to clustering results.


  1. C. Yin, S. Zhang, Z. Yin, and J. Wang, “Anomaly detection model based on data stream clustering”. Cluster Computing, pp. 1-10, 2017.
  2. Noor Basha, PS Ashokkumar, P Venkatesh ” Reduction of Dimensionality in Structured Data Sets on Clustering Efficiency in Data Mining “ IEEE International Conference on Computational Intelligence and Computing Research (ICCICI), pages 1-4.
  3. Y. Wang, Y. Ru, and J. Chai. "Time series clustering based on sparse subspace clustering algorithm and its application to daily box-office data analysis." Neural Computing and Applications,pp. 1-10, 2018.
  4. D. Bacciu, and Daniele Castellana. "Bayesian mixtures of Hidden Tree Markov Models for structured data clustering." Neurocomputing, 2019.
  5. Noor Basha, K Manjunath, Mohan Kumar Naik, PS Ashok Kumar “Analysis and Forecast of Heart Syndrome by Intelligent Retrieval Approach” Intelligent Computing and Innovation on Data Science, Springer, Singapore, pages 507-515.
  6. S. Huang, Yazhou Ren, and Zenglin Xu. "Robust multi-view data clustering with multi-view capped-norm k-means." Neurocomputing 311 (2018): 197-208.
  7. L. Zong, X. Zhang, L. Zhao, H. Yu, and Q. Zhao, “Multi-view clustering via multi-manifold regularized non-negative matrix factorization,”Neural Networks, 88, 74-89, 2017.
  8. Noor Basha, Ashok Kumar P.S, P Venkatesh, “Early Detection of Heart Syndrome Using Machine Learning Technique,” 4th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT) Pages 387-391.
  9. J. Ma, Xiangming Jiang, and Maoguo Gong. "Two-phase clustering algorithm with density exploring distance measure." CAAI Transactions on Intelligence Technology 3.1 (2018): 59-64.
  10. F. Zabihi, and Babak Nasiri. "A Novel History-driven Artificial Bee Colony Algorithm for Data Clustering." Applied Soft Computingvol. 71, pp. 226-241, 2018.


Artificial Bee Colony, Clustering, Data points, Entropic Distance, K-means, Outliers