CFP last date
20 May 2024
Reseach Article

PSO Algorithm to Select Subsets of Q-Gram Features for Record Duplicate Detection

by M. Padmanaban, R. Radha
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 82 - Number 12
Year of Publication: 2013
Authors: M. Padmanaban, R. Radha
10.5120/14166-9829

M. Padmanaban, R. Radha . PSO Algorithm to Select Subsets of Q-Gram Features for Record Duplicate Detection. International Journal of Computer Applications. 82, 12 ( November 2013), 7-14. DOI=10.5120/14166-9829

@article{ 10.5120/14166-9829,
author = { M. Padmanaban, R. Radha },
title = { PSO Algorithm to Select Subsets of Q-Gram Features for Record Duplicate Detection },
journal = { International Journal of Computer Applications },
issue_date = { November 2013 },
volume = { 82 },
number = { 12 },
month = { November },
year = { 2013 },
issn = { 0975-8887 },
pages = { 7-14 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume82/number12/14166-9829/ },
doi = { 10.5120/14166-9829 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:57:32.530137+05:30
%A M. Padmanaban
%A R. Radha
%T PSO Algorithm to Select Subsets of Q-Gram Features for Record Duplicate Detection
%J International Journal of Computer Applications
%@ 0975-8887
%V 82
%N 12
%P 7-14
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Though data quality issues arise with ever-zooming quantity of data, it is a welcome sign that of late, significant improvement has been made in data engineering. Consequently, there have been significant investments from private and government organizations in developing methods for removing replicas from the data repositories. This phenomenon has caused a significant interest among researchers in developing efficient and effective duplicate detection strategy using modern and emerging techniques. In this paper, we have proposed accordingly. In the previous work duplicate record detection was done using Q-gram concept and the fuzzy classifier. Here, different set of features from the data is found out using the Q-gram concept that leads to computational complex environment. In order to reduce the computational task, a set of important Q-gram-based feature subsets is selected. With this intention, the overall steps of the proposed technique are carried out using three different steps, such as, 1) feature computation, 2) feature selection, and 3) detection. Initially, the features are computed using Q-gram concept and then, the subset of optimal feature sets is identified using particle swarm algorithm (PSO) which is one of the most effective optimization algorithms. Once we select the optimal features sets, the Naïve Bayes Classifier is utilized to detect the duplication records. There are two processes which characterize the proposed Duplicate Record Detection technique such as the training phase and the testing phase. The experimental results showed that the proposed Duplicate Record Detection technique has higher accuracy than that of the existing method. The accuracy obtained for the proposed Duplicate Record Detection is found to be 89%.

References
  1. Sunitha yeddula and K. Lakshmaiah, "Investigation of techniques for efficient & accurate indexing for scalable record linkage and deduplication " International Journal of Computer & Communication Technology, Vol. 3, no. 5, p p . 24-27 , 2012.
  2. Lalitha. L, Maheswari. B and Karthik. S, "A Detailed Survey on Various Record Deduplication Methods" International Journal of Advanced Research in Computer Engineering & Technology, Vol. 1, no. 8, p p. 456-489, October 2012.
  3. Qinghai Bai, "Analysis of Particle Swarm Optimization Algorithm" Computer and Information Science, vol. 3, no, 1, p p. 602-612 feb 2010.
  4. Vandana Dixit Kaushik, Amit Bendale, Aditya Nigam and Phalguni Gupta, "An Efficient Algorithm for De-duplication of Demographic Data " Computer Science and Technology, Vol. 7389, no. 5, p p. 602-609, 2012.
  5. Teresa Miquélez, Endika Bengoetxea and Pedro Larrañaga, "Evolutionary computation based on bayesian classifiers" International journal application of mathematics and computer science, vol. 14, no. 3, p p. 335-349, 2004.
  6. Aravind Arasu, Christopher R and DanSuciu, "Large Scale De-duplication with Constraints using De-dupalog" Asian Journal of Management and Humanity Sciences, Vol. 1, no. 4, pp. 558-576, 2007.
  7. Bhagwat, Kave Eshghi, Darrell D. E. Long and Mark Lillibridge, "Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup" International Symposium on Modelling, Analysis and Simulation of Computer and Telecom-munication Systems, vol. 9, no. 6, p p. 947-965, 2010.
  8. Hafiz Muhammad Imran, Azween Bin Abdullah, Muhammad Hussain, Sellappan Palaniappan and Iftikhar, "Intrusions Detection based on Optimum Features Subset and Efficient Dataset Selection" International Journal of Engineering and Innovative Technology (IJEIT) Vol. 2, no. 6, p p. 265-270, December 2012.
  9. Deepa Karunakaran and Rangarajan Rangaswamy , "Optimization Techniques To Record Deduplication", Journal of Computer Science,vol. 5, no. 2, p p. 14-21, 2009.
  10. Sunita Sarawagi and Anuradha Bhamidipaty, "Interactive Deduplication using Active Learning", Journal of Computer Science, vol. 8, no. 9, p p. 1487-1495, 2012.
  11. Michael Spiz, "Using Latent Semantic Indexing for Data Deduplication", proceeding of: 6th Industrial Conference on Data Mining, vol. 12, no. 6, p p. 347-359,2006.
  12. Aron Culotta and Andrew McCallum, "Joint Deduplication of Multiple Record Types in Relational Data", Journal of Computer Science, vol. 2, no. 1, p p. 324-329, 2007.
  13. Peter Christen and Karl Goiser, "Quality and Complexity Measures for Data Linkage and Deduplication", Quality Measures in Data Mining Studies in Computational Intelligence, Vol. 43, no. 11, pp 127-151,2007.
  14. Murat Sariyar and Andreas Borg, "The Record Linkage Package: Detecting Errors in Data", Journal Applications, Vol. 2, no. 2, December 2010.
  15. Ioan Cristian Trelea, "The particle swarm optimization algorithm: convergence analysis and parameter selection", Information Processing Letters, vol. 85, no. 8, p p. 317–325,2004.
  16. Wasif Afzal, Richard Torkar and Robert Feldt, "A systematic review of search-based testing for non-functional system properties", Information and Software Technology, vol. 5, no. 3, p p. 3-20,2009.
  17. Wenge Zhao, "Logistics Requirement Prediction by a Hybrid Model of Particle Swarm Optimization Algorithm and RBF Neural Network", Journal of Computational Information Systems, vol. 9,no. 1, p p. 41-46, 2013.
  18. Alejandro cervantes and Infes galvan and Pedro isasi, "Michigan Particle Swarm Optimization for Prototype Reduction in Classification Problems", New Generation Computing, Vol. 27, no. 3, p p . 239-241, 2009.
  19. Xin Chen and Yangmin Li, "On Convergence and Parameter Selection of an Improved Particle Swarm Optimization ", International Journal of Control, Automation, and Systems, vol. 6, no. 4, p p. 559-570, August 2008.
  20. Swagatam Das, Amit Konar and Uday K. Chakraborty, "Improving Particle Swarm Optimization with Differentially Perturbed Velocity ", International Journal of Control, Automation, and Systems, vol. 6, no. 4, p p. 559-570, August 2008.
  21. Zhang Haiyan, Li Xin and Lu Rongliang, "An Improved Particle Swarm Optimization Algorithm for Seismic Wavelet Estimation ", Journal of Applied Sciences, Engineering and Technology, vol. 4, no. 6, p p. 591-594, 2012.
  22. S. N. Sivanandam and P. Visalakshi, "Multiprocessor Scheduling Using Hybrid Particle Swarm Optimization with Dynamically Varying Inertia ", International Journal of Computer Science & Applications, Vol. 4, no. 3, p p. 95-106,2007.
Index Terms

Computer Science
Information Sciences

Keywords

Duplicate data Non- Duplicate data particle swarm algorithm (PSO) Naïve Bayes Classifier training testing.