A Novel Similarity Measure for Clustering Categorical Data Sets

Rishi Sayal; V. Vijay Kumar

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 20 July 2026

Submit your paper

Know more

The week's pick

Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations

Muskan Saraf Sajjad Rezvani Boroujeni Justin Beaudry Hossein Abedi Tom Bush

Random Articles

Effects of Variable Viscosity and Thermal Conductivity on the Flow of Dusty Fluid over a Continuously Moving Plate

July

2015

Fogging: An Advanced Version of Cloud Storage

Mar

2020

Secure Data Retrieval based on Attribute-based Encryption in Cloud

January

2016

Relational Classification using Multiple View Approach with Voting

May

2013

Reseach Article

A Novel Similarity Measure for Clustering Categorical Data Sets

by Rishi Sayal, V. Vijay Kumar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 17 - Number 1

Year of Publication: 2011

Authors: Rishi Sayal, V. Vijay Kumar

10.5120/2184-2757

Rishi Sayal, V. Vijay Kumar . A Novel Similarity Measure for Clustering Categorical Data Sets. International Journal of Computer Applications. 17, 1 ( March 2011), 25-30. DOI=10.5120/2184-2757

@article{ 10.5120/2184-2757,

author = { Rishi Sayal, V. Vijay Kumar },

title = { A Novel Similarity Measure for Clustering Categorical Data Sets },

journal = { International Journal of Computer Applications },

issue_date = { March 2011 },

volume = { 17 },

number = { 1 },

month = { March },

year = { 2011 },

issn = { 0975-8887 },

pages = { 25-30 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume17/number1/2184-2757/ },

doi = { 10.5120/2184-2757 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:04:30.984354+05:30

%A Rishi Sayal

%A V. Vijay Kumar

%T A Novel Similarity Measure for Clustering Categorical Data Sets

%J International Journal of Computer Applications

%@ 0975-8887

%V 17

%N 1

%P 25-30

%D 2011

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Measuring similarity between two data objects is a more challenging problem for data mining and knowledge discovery tasks. The traditional clustering algorithms have been mainly stressed on numerical data, the implicit property of which can be exploited to define distance function between the data points to define similarity measure. The problem of similarity becomes more complex when the data is categorical which do not have a natural ordering of values or can be called as non geometrical attributes. Clustering on relational data sets when majority of its attributes are of categorical types makes interesting facts. No earlier work has been done on clustering categorical attributes of relational data set types making use of the property of functional dependency as parameter to measure similarity. This paper is an extension of earlier work on clustering relational data sets where domains are unique and similarity is context based and introduces a new notion of similarity based on dependency of an attribute on other attributes prevalent in the relational data set. This paper also gives a brief overview of popular similarity measures of categorical attributes. This novel similarity measure can be used to apply on tuples and their respective values. The important property of categorical domain is that they have smaller number of attribute values. The similarity measure of relational data sets then can be applied to the smaller data sets for efficient results.

References

Daniel Barbara´ , J. Couto, and Y. Li, “COOLCAT: An Entropy-Based Algorithm for Categorical Clustering,” Proc. 11th ACM Conf. Information and Knowledge Management (CIKM ’02), pp. 582-589, 2002.
David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Clustering categorical data: An approach based on dynamical systems. In Proceedings of the 24th InternationalConference on Very Large Databases, pages 311– 323, New York City, New York, August 24-27 1998.
Duo Chen, Du-Wu Cui, Chao-Xue Wang, Zhu-Rong Wang “A Rough Set-Based Hierarchical Clustering Algorithm for Categorical Data” International Journal of Information Technology, Vol.12, No.3, 2006
Ganti V, J. Gehrke and R.Ramakrishnan. “CACTUS: Clustering Categorical data using summaries.” In Proc Int Conf Knowledge Discovery and Data Mining, 1999, pp.73-88
Gautam Das, Heikki Mannila “Context-Based Similarity Measures for Categorical Databases.” PKDD 2000: 201-210.
Guha S, R Rastogi & K. Shim “ROCK: A robust clustering algorithm for categorical attributes.” In Proc. IEEE Int. Conf. on Data Engineering ,1999 pp 512-521
Ohn Mar San, Van-Nam Huynh, and Yoshiteru Nakamori “An Alternative Extension of The K-Means Algorithm For Clustering Categorical Data” International Journal of Appl. Math. Comput. Sci., 2004, Vol. 14, No. 2, 241–247
P. Andritsos, P. Tsaparas, R. Miller, and K. Sevcik, “LIMBO: Scalable Clustering of Categorical Data,” Proc. Ninth Int’l Conf. Extending Database Technology (EDBT ’04), pp. 123-146, 2004.
Rishi Sayal, D. Durga Bhavani, P. Harsha and Dr. V. Vijaya Kumar “Study of Hierarchical and Partitional Clustering Techniques ”International Conference on Soft Computing & Intelligent Systems” ICSCIS-07, pp. 74-80, 2008.
Rui Xu, Donald Wunsch II, “Survey of Clustering Algorithms”, IEEE in Neural Networks 16(3)(2005).
Silberschatz, Korth, “Data Base System Concepts”, Mc Graw hill, V Edition.
Y. Yang, X. Guan, and J. You, “CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data,” Proc. Eighth ACM Conf. Knowledge Discovery and Data Mining (KDD ’02), pp. 682-687, 2002
Zengyou He, Xiaofei Xu, Shengchun Deng, Bin Dong “K- Histograms: An Efficient Clustering Algorithm for Categorical Dataset” http://arxiv.org/abs/cs/0509033
Zhexue Huang “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining”
Zhexue Huang, “Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283-304, 1998.

Index Terms

Computer Science

Information Sciences

Keywords

Data Clustering Similarity measures Context based similarity Categorical attributes and functional dependency