Call for Paper - May 2023 Edition
IJCA solicits original research papers for the May 2023 Edition. Last date of manuscript submission is April 20, 2023. Read More

A Novel Framework and Model for Data Warehouse Cleansing

International Journal of Computer Applications
© 2011 by IJCA Journal
Number 1 - Article 1
Year of Publication: 2011
Daya Gupta
Payal Pahwa
Rajiv Arora

Daya Gupta, Payal Pahwa and Rajiv Arora. Article:Novel Framework and Model for Data Warehouse Cleansing. International Journal of Computer Applications 32(8):6-13, October 2011. Full text available. BibTeX

	author = {Daya Gupta and Payal Pahwa and Rajiv Arora},
	title = {Article:Novel Framework and Model for Data Warehouse Cleansing},
	journal = {International Journal of Computer Applications},
	year = {2011},
	volume = {32},
	number = {8},
	pages = {6-13},
	month = {October},
	note = {Full text available}


Data cleansing is a process that deals with identification of corrupt and duplicate data inherent in the data sets of a data warehouse to enhance the quality of data. This paper aims to facilitate the data cleaning process by addressing the problem of duplicate records detection pertaining to the ‘name’ attributes of the data sets. It provides a sequence of algorithms through a novel framework for identifying duplicity in the ‘name’ attribute of the data sets of an already existing data warehouse. The key features of the research includes its proposal of a novel framework through a well defined sequence of algorithms and refining the application of alliance rules [1] by incorporating the use of previously existing and well defined similarity computation measures. The results depicted show the feasibility and validity of the suggested method.


  • Rajiv Arora, Payal Pahwa, Shubha Bansal,” Alliance Rules for Data Warehouse Cleansing”, International Conference on Signal Processing Systems IEEE Explore no. D01 10.1109/ICSPS, 133, pages 743-747, 2009.
  • P.Ponniah, “Data Warehousing Fundamentals- A comprehensive guide for IT professionals”, Ist ed., second reprint, ISBN-81-265-0919-8, Glorious Printers: New Delhi, India, 2007.
  • A.Marcus, J.I.Maletic,”Utilizing Association Rules For the Identification of Errors in Data”, TR-CS-00-04, University of Memphis, 2004.
  • A.Marcus, J.I.Maletic,” Data Cleansing: Beyond Integrity Analysis” Proceedings of the Conference onInformation Quality (IQ2000). Boston: Massachusetts Institute of Technology, pp. 200-209, 2000.
  • T. Redman, "The Impact of Poor Data Quality on the Typical Enterprise", Communications of the ACM, Vol. 41. 8, February 1998.
  • A.Marcus, J.I.Maletic, “Automated Identification of Errors in Data Sets”, TR-CS-00-02, University of Memphis, 2002.
  • A.Marcus, J.I.Maletic and Lin, K.-I.,” Association Rules for Error Identification in Data Sets”, Proceedings of the 10th ACM Conference on Information and Knowledge Management (ACM CIKM 2001). Atlanta, GA, pp. 589-591, 2001.
  • Peter Christen,” A Comparison of Personal Name Matching: Techniques and Practical Issues” Joint Computer Science Technical Report Series, TR-CS-06-02, September, 2006.
  • Gérard Bouchard and Christian Pouyez, Name Variations and Computerised Record Linkage, Historical Methods, Vol. 13, No. 2, Springer 1980, pp119-125.
  • Timothy E. Ohanekwu, C.I. Ezeife,” A Token-Based Data Cleaning Technique for Data Warehouse Systems”, Ontario, Canada N9B, 3P4.
  • Surajit Chaudhary, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani, ” Robust and efficient fuzzy match for online data cleaning”,ACM SIGMOD,2003
  • Amit Rudra, Emilie Yeo, “Key Issues in Achieving Data Quality and Consistency in Data Warehousing among Large Organisations in Australia,” Proceedings of the 32nd Hawaii International Conference on System Sciences – 1999.
  • E. Rahm, H. H. Do: “Data Cleaning: Problems and Current Approaches”, IEEE Techn. Bull. Data Eng., Dec. 2000.
  • Heiko Müller, Johann-Christoph Freytag, Berlin,” Problems, Methods, and Challenges in Comprehensive Data Cleansing”, 10099 Berlin, Germany.
  • A. D.Chapman, “Principles and Methods of Data Cleaning – Primary Species and Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen, 2005.
  • Rohit Ananthakrishna (Cornell University) Surajit Chaudhuri Venkatesh Ganti (Microsoft Research),” Eliminating Fuzzy Duplicates in Data Warehouses”.
  • Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios, “Duplicate Record Detection: A Survey”, IEEE transactions on knowledge and data engineering, vol. 19, no. 1, January 2007.
  • Oktie Hassanzadeh, Mohammad Sadoghi, Ren´ee J. Miller, “Accuracy of Approximate String Joins Using Grams”, University of Toronto 10 King’s College Rd.,Toronto, ON M5S3G4, Canada.
  • Jakub Piskorski_, Marcin Sydow, “Usability of String Distance Metrics for Name Matching Tasks in Polish”.