CFP last date
22 April 2024
Reseach Article

Article:Outlier Removal Clustering through Minimum Spanning Tree

by T. Karthikeyan, S. John Peter
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 31 - Number 10
Year of Publication: 2011
Authors: T. Karthikeyan, S. John Peter
10.5120/3858-5382

T. Karthikeyan, S. John Peter . Article:Outlier Removal Clustering through Minimum Spanning Tree. International Journal of Computer Applications. 31, 10 ( October 2011), 1-7. DOI=10.5120/3858-5382

@article{ 10.5120/3858-5382,
author = { T. Karthikeyan, S. John Peter },
title = { Article:Outlier Removal Clustering through Minimum Spanning Tree },
journal = { International Journal of Computer Applications },
issue_date = { October 2011 },
volume = { 31 },
number = { 10 },
month = { October },
year = { 2011 },
issn = { 0975-8887 },
pages = { 1-7 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume31/number10/3858-5382/ },
doi = { 10.5120/3858-5382 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:17:46.950003+05:30
%A T. Karthikeyan
%A S. John Peter
%T Article:Outlier Removal Clustering through Minimum Spanning Tree
%J International Journal of Computer Applications
%@ 0975-8887
%V 31
%N 10
%P 1-7
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Minimum spanning tree-based clustering algorithm is capable of detecting clusters with irregular boundaries. Detecting outliers using clustering algorithm is a big desire. Outlier detection is an extremely important task in a wide variety of application. In this paper we propose a minimum spanning tree-based clustering algorithm for detecting outliers. The algorithm partition the dataset into optimal number of clusters. Outliers are detected in the clusters based on outlyingness factor of each point (objects) in the cluster. The algorithm uses a new cluster validation criterion based on the geometric property of data partition of the data set in order to find the proper number of clusters. The algorithm works in two phases. The first phase of the algorithm creates optimal number of clusters, where as the second phase of the algorithm detect outliers.

References
  1. C. Aggarwal and P. Yu, “Outlier Detection for High Dimensional Data”. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Volume 30, Issue 2, pp 37 – 46, May 2001.
  2. J.Almeida, L.Barbosa, A.Pais and S.Formosinho, “Improving Hierarchical Cluster Analysis: A new method with OutlierDetection and Automatic Clustering,”Chemometrics and Intelligent Laboratory Systems 87:208-217, 2007.
  3. Anil K. Jain, Richard C. Dubes “Algorithm for Clustering Data”, Michigan StateUniversity, Prentice Hall, Englewood Cliffs,New Jersey 07632.1988.
  4. F.Angiulli, and C.Pizzuti, Outlier Mining in Large High-Dimensional Data sets,.IEEE Transactions on Kaoeledge and Data Engineering, 17(2): 203-215, 2005.
  5. T. Asano, B. Bhattacharya, M.Keil and F.Yao. “Clustering Algorithms based on minimum and maximum spanning trees”. In Proceedings of the 4th Annual Symposium on ComputationalGeometry,Pages252-257,1988.
  6. D. Avis “Diameter partitioning.” Discrete and Computational Geometr, 1:265-276, 1986.
  7. V.Barnett and T.Lewis,”Outliers in Statistical Data.”, John Wiley, 1994.
  8. M. Breunig, H.Kriegel, R.Ng and J.Sander, “Lof: Identifying density-based local outliers”. In Proceedings of 2000 ACM SIGMOD International Conference on Management of Data. ACM Press, pp 93-104, 2000.
  9. B.Custem and I.Gath,,”Detection of Outliers and Robust Estimation using Fuzzy clustering”, Computational Statistics & data Analyses 15,pp.47-61, 1993.
  10. Feng Luo,Latifur Kahn, Farokh Bastani, I-Ling Yen, and Jizhong Zhou, “A dynamically growing self-organizing tree(DGOST) for hierarchical gene expression profile”,Bioinformatics,Vol 20,no 16, pp 2605-2617, 2004.
  11. M. Fredman and D. Willard. “Trans-dichotomous algorithms for minimum spanning trees and shortest paths”. In Proceedings of the 31st Annual IEEE Symposium on Foundations of Computer Science,pages 719-725, 1990.
  12. Gath and A.Geva, “Fuzzy Clustering for the estimation of the Parameters of the components of Mixtures of Normal distribution”, Pattern Recognition letters, 9, pp.77-86, 1989.
  13. B. Ghosh-Dastidar and J.L. Schafer, “Outlier Detection and Editing Procedures for Continuous Multivariate Data”. ORP Working Papers, September 2003.
  14. A. Hardy, “On the number of clusters”, Computational Statistics and Data Analysis, 23, pp. 83–96, 1996.
  15. T. Hastie, R. Tibshirani, and J. Friedman, “The elements of statistical learning: Data mining, inference and prediction”, Springer-Verlag, 2001.
  16. 16 D.Hawkins,”Identifications of Outliers”, Chapman and Hall, London, ,1980.
  17. Z. He, X. Xu and S. Deng, “Discovering cluster-based Local Outliers”. Pattern Recognition Letters, Volume 24, Issue 9-10, pp 1641 – 1650, June 2003.
  18. H.Gabow, T.Spencer and R.Rarjan. “Efficient algorithms for finding minimum spanning trees in undirected and directed graphs”, Combinatorica, 6(2):pp 109-122, 1986.
  19. M. Jaing, S. Tseng and C. Su, “Two-phase Clustering Process for Outlier Detection”,Pattern Recognition Letters, Volume 22, Issue 6 – 7, pp 691 – 700, May 2001.
  20. S. John Peter, S.P. Victor, “A Novel Algorithm for Meta similarity clusters using Minimum spanning tree”. International Journal of computer science and Network Security. Vol.10 No.2 pp. 254 – 259, 2010
  21. D. Karger, P. Klein and R. Tarjan, “A randomized linear-time algorithm to find minimum spanning trees”. Journal of the ACM, 42(2):321-328, 1995.
  22. E. Knorr and R. Ng, “A Unified Notion of Outliers: Properties and Computation”. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 219 – 222, August 1997.
  23. E..Knorr and R.Ng, “Algorithms for Mining Distance-based Outliers in Large Data sets”, Proc.the 24th International Conference on Very Large Databases(VLDB),pp.392-403, 1998.
  24. E.Knorr, R.Ng and V.Tucakov, “Distance- Based Outliers: Algorithms and Applications”, VLDB Journal, 8(3-4):237-253, 2000.
  25. J. Kruskal, “On the shortest spanning subtree and the travelling salesman problem”, In Proceedings of the American Mathematical Society, pp 48-50, 1956.
  26. A.Loureiro, L.Torgo and C.Soares, “Outlier detection using Clustering methods: A data cleaning Application”, in Proceedings of KDNet Symposium on Knowledge-based systems for the Public Sector. Bonn, Germany, 2004.
  27. Moh’d Belal Al-Zoubi, “An Effective Clustering-Based Approach for Outlier Detection”, European Journal of Scientific Research, Vol.28 No.2 ,pp.310-316, 2009
  28. J. Nesetril, E.Milkova and H.Nesetrilova. Otakar boruvka on “Minimum spanning tree problem”: Translation of both the 1926 papers, comments, history. DMATH: Discrete Mathematics, 233, 2001.
  29. K.Niu, C.Huang, S.Zhang and J.Chen, “ODDC: Outlier Detection Using Distance Distribution Clustering”, T.Washio et al. (Eds.): PAKDD 2007 Workshops, Lecture Notes in Artificial Intelligence (LNAI) 4819,pp.332-343,Springer-Verlag, 2007.
  30. Oleksandr Grygorash, Yan Zhou, Zach Jorgensen. “Minimum spanning Tree Based Clustering Algorithms”. Proceedings of the 18th IEEE International conference on tools with Artificial Intelligence (ICTAI’06) 2006.
  31. S.Papadimitriou, H.Kitawaga, P.Gibbons and C. Faloutsos, “LOCI: Fast outlier detection using the local correlation integral”, Proc. Of the International Conference on Data Engineering ,pp.315-326, 2003.
  32. R. Prim. “Shortest connection networks and some generalization”. Bell systems Technical Journal,36:1389-1401, 1957.
  33. S. Ramaswamy, R. Rastogi and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets”. In Proceedings of the 2000 ACM SIGMOD InternationalConference on Management of Data, Volume 29, Issue 2, pages 427 – 438, May2000.
  34. R. Rezaee, B.P.F. Lelie and J.H.C. Reiber, “A new cluster validity index for the fuzzy C-mean”, Pattern Recog. Lett., 19,237-246, 1998.
  35. D. M. Rocke and J. J. Dai, “Sampling and subsampling for cluster analysis in data mining: With applications to sky survey data”, Data Mining and Knowledge Discovery, 7, pp. 215–232, 2003.
  36. S. Salvador and P. Chan, “Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms”, in Proceedings Sixteenth IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004, Los Alamitos, CA, USA, IEEE Computer Society, pp. 576–584 , 2004.
  37. S. Still and W. Bialek, “How many clusters?” , An information-theoretic perspective, Neural Computation, 16, pp. 2483–2506, 2004.
  38. Stefan Wuchty and Peter F. Stadler. “Centers of Complex Networks”. 2006
  39. C. Sugar and G. James, “Finding the number of clusters in a data set”, An information theoretic approach, Journal of the American Statistical Association, 98 pp. 750–763, 2003.
  40. Y.Xu, V. Olman and D. Xu. “Minimum spanning trees for gene expression data clustering”. Genome Informatics,12:24-33, 2001.
  41. K.Yoon, O.Kwon and D.Bae, “An approach to outlier Detection of Software Measurement Data using the K-means Clustering Method”, First International Symposium on Empirical Software Engineering and Measurement(ESEM 2007), Madrid.pp.443-445, 2007.
  42. C. Zahn. “Graph-theoretical methods for detecting and describing gestalt clusters”. IEEE Transactions on Computers, C-20:68-86, 1971
  43. J.Zhang and N.Wang, “Detecting outlying subspaces for high-dimensional data: the new task”, Algorithms and Performance, Knowledge and Information Systems, 10(3):333-555, 2006.
Index Terms

Computer Science
Information Sciences

Keywords

Euclidean minimum spanning tree Eccentricity Cluster validity Cluster Separation Outlyingness factor Outliers