CFP last date
20 May 2024
Reseach Article

A Novel Similarity Measure for Clustering Categorical Data Sets

by Rishi Sayal, V. Vijay Kumar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 17 - Number 1
Year of Publication: 2011
Authors: Rishi Sayal, V. Vijay Kumar
10.5120/2184-2757

Rishi Sayal, V. Vijay Kumar . A Novel Similarity Measure for Clustering Categorical Data Sets. International Journal of Computer Applications. 17, 1 ( March 2011), 25-30. DOI=10.5120/2184-2757

@article{ 10.5120/2184-2757,
author = { Rishi Sayal, V. Vijay Kumar },
title = { A Novel Similarity Measure for Clustering Categorical Data Sets },
journal = { International Journal of Computer Applications },
issue_date = { March 2011 },
volume = { 17 },
number = { 1 },
month = { March },
year = { 2011 },
issn = { 0975-8887 },
pages = { 25-30 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume17/number1/2184-2757/ },
doi = { 10.5120/2184-2757 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:04:30.984354+05:30
%A Rishi Sayal
%A V. Vijay Kumar
%T A Novel Similarity Measure for Clustering Categorical Data Sets
%J International Journal of Computer Applications
%@ 0975-8887
%V 17
%N 1
%P 25-30
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Measuring similarity between two data objects is a more challenging problem for data mining and knowledge discovery tasks. The traditional clustering algorithms have been mainly stressed on numerical data, the implicit property of which can be exploited to define distance function between the data points to define similarity measure. The problem of similarity becomes more complex when the data is categorical which do not have a natural ordering of values or can be called as non geometrical attributes. Clustering on relational data sets when majority of its attributes are of categorical types makes interesting facts. No earlier work has been done on clustering categorical attributes of relational data set types making use of the property of functional dependency as parameter to measure similarity. This paper is an extension of earlier work on clustering relational data sets where domains are unique and similarity is context based and introduces a new notion of similarity based on dependency of an attribute on other attributes prevalent in the relational data set. This paper also gives a brief overview of popular similarity measures of categorical attributes. This novel similarity measure can be used to apply on tuples and their respective values. The important property of categorical domain is that they have smaller number of attribute values. The similarity measure of relational data sets then can be applied to the smaller data sets for efficient results.

References
  1. Daniel Barbara´ , J. Couto, and Y. Li, “COOLCAT: An Entropy-Based Algorithm for Categorical Clustering,” Proc. 11th ACM Conf. Information and Knowledge Management (CIKM ’02), pp. 582-589, 2002.
  2. David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Clustering categorical data: An approach based on dynamical systems. In Proceedings of the 24th InternationalConference on Very Large Databases, pages 311– 323, New York City, New York, August 24-27 1998.
  3. Duo Chen, Du-Wu Cui, Chao-Xue Wang, Zhu-Rong Wang “A Rough Set-Based Hierarchical Clustering Algorithm for Categorical Data” International Journal of Information Technology, Vol.12, No.3, 2006
  4. Ganti V, J. Gehrke and R.Ramakrishnan. “CACTUS: Clustering Categorical data using summaries.” In Proc Int Conf Knowledge Discovery and Data Mining, 1999, pp.73-88
  5. Gautam Das, Heikki Mannila “Context-Based Similarity Measures for Categorical Databases.” PKDD 2000: 201-210.
  6. Guha S, R Rastogi & K. Shim “ROCK: A robust clustering algorithm for categorical attributes.” In Proc. IEEE Int. Conf. on Data Engineering ,1999 pp 512-521
  7. Ohn Mar San, Van-Nam Huynh, and Yoshiteru Nakamori “An Alternative Extension of The K-Means Algorithm For Clustering Categorical Data” International Journal of Appl. Math. Comput. Sci., 2004, Vol. 14, No. 2, 241–247
  8. P. Andritsos, P. Tsaparas, R. Miller, and K. Sevcik, “LIMBO: Scalable Clustering of Categorical Data,” Proc. Ninth Int’l Conf. Extending Database Technology (EDBT ’04), pp. 123-146, 2004.
  9. Rishi Sayal, D. Durga Bhavani, P. Harsha and Dr. V. Vijaya Kumar “Study of Hierarchical and Partitional Clustering Techniques ”International Conference on Soft Computing & Intelligent Systems” ICSCIS-07, pp. 74-80, 2008.
  10. Rui Xu, Donald Wunsch II, “Survey of Clustering Algorithms”, IEEE in Neural Networks 16(3)(2005).
  11. Silberschatz, Korth, “Data Base System Concepts”, Mc Graw hill, V Edition.
  12. Y. Yang, X. Guan, and J. You, “CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data,” Proc. Eighth ACM Conf. Knowledge Discovery and Data Mining (KDD ’02), pp. 682-687, 2002
  13. Zengyou He, Xiaofei Xu, Shengchun Deng, Bin Dong “K- Histograms: An Efficient Clustering Algorithm for Categorical Dataset” http://arxiv.org/abs/cs/0509033
  14. Zhexue Huang “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining”
  15. Zhexue Huang, “Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283-304, 1998.
Index Terms

Computer Science
Information Sciences

Keywords

Data Clustering Similarity measures Context based similarity Categorical attributes and functional dependency