CFP last date
22 April 2024
Reseach Article

Distance-based K-Means Clustering Algorithm for Anomaly Detection in Categorical Datasets

by Noor Basha, Ashok Kumar P.S.
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 183 - Number 11
Year of Publication: 2021
Authors: Noor Basha, Ashok Kumar P.S.
10.5120/ijca2021921415

Noor Basha, Ashok Kumar P.S. . Distance-based K-Means Clustering Algorithm for Anomaly Detection in Categorical Datasets. International Journal of Computer Applications. 183, 11 ( Jun 2021), 9-14. DOI=10.5120/ijca2021921415

@article{ 10.5120/ijca2021921415,
author = { Noor Basha, Ashok Kumar P.S. },
title = { Distance-based K-Means Clustering Algorithm for Anomaly Detection in Categorical Datasets },
journal = { International Journal of Computer Applications },
issue_date = { Jun 2021 },
volume = { 183 },
number = { 11 },
month = { Jun },
year = { 2021 },
issn = { 0975-8887 },
pages = { 9-14 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume183/number11/31970-2021921415/ },
doi = { 10.5120/ijca2021921415 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:16:30.598410+05:30
%A Noor Basha
%A Ashok Kumar P.S.
%T Distance-based K-Means Clustering Algorithm for Anomaly Detection in Categorical Datasets
%J International Journal of Computer Applications
%@ 0975-8887
%V 183
%N 11
%P 9-14
%D 2021
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Real-world data sets also provide knowledge in an unsupervised manner with distinct and complementary aspects. In the field of cluster analysis, a number of algorithms have recently arisen. A priori, it is difficult for a user to determine which algorithm will be most suitable for a given dataset. For this job, algorithms based on graphs give good results. Such algorithms are however, vulnerable to outliers and noises with minimal edge information found in the tree to split a dataset. Thus, in several fields, the need for better clustering algorithms increases and for this reason utilizing robust and dynamic algorithms to improve and simplify the whole process of data clustering has become an important research field. In this paper, a novel distance-based clustering algorithm called the entropic distance based K-means clustering algorithm (EDBK) is proposed to remove the outliers in effective way. This algorithm depends on the entropic distance between attributes of data points and some basic mathematical statistics operations. In this work, experiments are conducted using UCI datasets showed that EDBK method outperforms the existing methods such as Artificial Bee Colony (ABC), k-means etc. The EDBK achieved 80.71% recall, 79.81% precision and 75.82% F-measure. The results show that the EDBK method not only improve the clustering accuracy (i.e. nearly 92%), but also greatly reduce the interference of outliers to clustering results.

References
  1. C. Yin, S. Zhang, Z. Yin, and J. Wang, “Anomaly detection model based on data stream clustering”. Cluster Computing, pp. 1-10, 2017.
  2. Noor Basha, PS Ashokkumar, P Venkatesh ” Reduction of Dimensionality in Structured Data Sets on Clustering Efficiency in Data Mining “ IEEE International Conference on Computational Intelligence and Computing Research (ICCICI), pages 1-4.
  3. Y. Wang, Y. Ru, and J. Chai. "Time series clustering based on sparse subspace clustering algorithm and its application to daily box-office data analysis." Neural Computing and Applications,pp. 1-10, 2018.
  4. D. Bacciu, and Daniele Castellana. "Bayesian mixtures of Hidden Tree Markov Models for structured data clustering." Neurocomputing, 2019.
  5. Noor Basha, K Manjunath, Mohan Kumar Naik, PS Ashok Kumar “Analysis and Forecast of Heart Syndrome by Intelligent Retrieval Approach” Intelligent Computing and Innovation on Data Science, Springer, Singapore, pages 507-515.
  6. S. Huang, Yazhou Ren, and Zenglin Xu. "Robust multi-view data clustering with multi-view capped-norm k-means." Neurocomputing 311 (2018): 197-208.
  7. L. Zong, X. Zhang, L. Zhao, H. Yu, and Q. Zhao, “Multi-view clustering via multi-manifold regularized non-negative matrix factorization,”Neural Networks, 88, 74-89, 2017.
  8. Noor Basha, Ashok Kumar P.S, P Venkatesh, “Early Detection of Heart Syndrome Using Machine Learning Technique,” 4th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT) Pages 387-391.
  9. J. Ma, Xiangming Jiang, and Maoguo Gong. "Two-phase clustering algorithm with density exploring distance measure." CAAI Transactions on Intelligence Technology 3.1 (2018): 59-64.
  10. F. Zabihi, and Babak Nasiri. "A Novel History-driven Artificial Bee Colony Algorithm for Data Clustering." Applied Soft Computingvol. 71, pp. 226-241, 2018.
Index Terms

Computer Science
Information Sciences

Keywords

Artificial Bee Colony Clustering Data points Entropic Distance K-means Outliers