Parallel and Distributed Code Clone Detection using Sequential Pattern Mining

Ali El-matarawy; Mohammad El-ramly; Reem Bahgat

Call for Paper

September Edition

IJCA solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 20 August 2025

Submit your paper

Know more

The week's pick

Real-time Synchronization Mechanisms Between Batch-oriented Legacy Systems and Modern Interfaces in the Retirement Domain

Balamurugan Krishnaswamy Gnanasekaran

Random Articles

Estimation of Population Variance in Simple Random Sampling using Auxiliary Information

Nov

2020

Compiler for Detection of Program Vulnerabilities

October

2014

Color Content based Video Retrieval using Block Truncation Coding with Different Color Spaces

February

2013

A Novel Progressive Sampling based Approach for Effective Mining of Association Rules

November

2010

Reseach Article

Parallel and Distributed Code Clone Detection using Sequential Pattern Mining

by Ali El-matarawy, Mohammad El-ramly, Reem Bahgat

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 62 - Number 10

Year of Publication: 2013

Authors: Ali El-matarawy, Mohammad El-ramly, Reem Bahgat

10.5120/10118-4792

Ali El-matarawy, Mohammad El-ramly, Reem Bahgat . Parallel and Distributed Code Clone Detection using Sequential Pattern Mining. International Journal of Computer Applications. 62, 10 ( January 2013), 25-31. DOI=10.5120/10118-4792

@article{ 10.5120/10118-4792,

author = { Ali El-matarawy, Mohammad El-ramly, Reem Bahgat },

title = { Parallel and Distributed Code Clone Detection using Sequential Pattern Mining },

journal = { International Journal of Computer Applications },

issue_date = { January 2013 },

volume = { 62 },

number = { 10 },

month = { January },

year = { 2013 },

issn = { 0975-8887 },

pages = { 25-31 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume62/number10/10118-4792/ },

doi = { 10.5120/10118-4792 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T21:11:27.053408+05:30

%A Ali El-matarawy

%A Mohammad El-ramly

%A Reem Bahgat

%T Parallel and Distributed Code Clone Detection using Sequential Pattern Mining

%J International Journal of Computer Applications

%@ 0975-8887

%V 62

%N 10

%P 25-31

%D 2013

%I Foundation of Computer Science (FCS), NY, USA

Abstract

This research presents a parallel and distributed data mining approach to code clone detection. It aims to prove the value and importance of deploying parallel and distributed computing for real-time large scale code clone detection. It is implemented this approach in a family of clone detectors, called PD EgyCD (Parallel and Distributed Egypt Clone Detector). In this approach, This research builds on an earlier work of the authors for code clone and plagiarism detection using sequential pattern mining by adding parallelism and distribution to our earlier tool EgyCD. Our approach uses data mining through a tailored Apriori-based algorithm for code clone detection. And it uses parallelization and distribution to achieve excellent performance to scale up to clone detection on very large systems. This approach has been implemented as a database application which leverages the capabilities of modern database tools. Two versions have been developed of this distributed technique. The first one uses client-server technique in which all clients and the server deal with only one database. The second one uses agents where each client acts as a separate agent and has its own database and after working on a sub-problem, it submits its partial solution to the server to finally get the complete solution (set of code clones). Experiments show that agents technique is faster than client-server one. Distribution enhances performance very much. Speed improvement is a function of the number of clients/agents used. Our conclusion is that data mining, combined with parallel and distributed computing, can efficiently be deployed for code clone detection of very large systems.

References

C. K. Roy, J. R. Cordy, R. Koschke, Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach. Comparison and Evaluation of Code Clone Detection Techniques, Science of Computer Programming, 74, 470-495, 2009.
B. Baker, On Finding Duplication and Near-Duplication in Large Software Systems, in: Proceedings of the 2nd Working Conference on Reverse Engineering, WCRE 1995, pp. 86-95, 1995.
C. K. Roy and J. R. Cordy, An Empirical Study of Function Clones in Open Source Software Systems. In Proceedings of the 15th Working Conference on Reverse Engineering, WCRE 2008, pp. 81-90, 2008.
E. Juergens, F. Deissenboeck, B. Hummel and S. Wagner. Do Code Clones Matter? In Proceedings of the 31st International Conference on Software Engineering (ICSE'09), pp. 485–495, Vancouver, Canada, May 2009.
J. H. Johnson. Identifying Redundancy in Source Code Using Fingerprints. In Proceeding of the 1993 Conference of the Centre for Advanced Studies Conference (CASCON' 93), pp. 171–183, Toronto, Canada, October 1993.
B. Baker. On Finding Duplication and Near-Duplication in Large Software Systems. In Proceedings of the Second Working Conference on Reverse Engineering(WCRE'95), pp. 86–95, Toronto, Ontario, Canada, July 1995.
A. Chou, J. Yang, B. Chelf, S. Hallem and D. R. Engler. An Empirical Study of Operating System Errors. In Proceedings of the 18th ACM symposium on Operating systems principles (SOSP'01), pp. 73–88, Banff, Alberta, Canada, October 2001.
Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code. IEEE Transactions on Software Engineering, 32(3):176–192, 2006.
M. Fowler. Refactoring: Improving the Design of Existing Code. Addison-Wesley, 2000.
S. Bellon, R. Koschke, G. Antoniol, J. Krinke and E. Merlo, Comparison and Evaluation of Clone Detection Tools, Transactions on Software Engineering, 33(9):577-591, 2007.
Toshihiro Kamiya, Shinji Kusumoto, Katsuro Inoue. CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. Transactions on Software Engineering, Vol. 28(7): 654- 670, July 2002.
Chanchal Kumar Roy and James R. Cordy, A Survey on Software Clone Detection, Technical Report No. 2007-541, School of Computing, Queen's University at Kingston, Ontario, Canada, September 26, 2007.
Raghavan Komondoor and Susan Horwitz. Using Slicing to Identify Duplication in Source Code. In Proceedings of the 8th International Symposium on Static Analysis (SAS'01), Vol. LNCS 2126, pp. 40-56, Paris, France, July 2001.
Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Eleventh International, Conference on Data Engineering, P. S. Yu and A. S. P. Chen, Eds. IEEE Computer Society, Press, Taipei, Taiwan, 3-14.
Chao Liu, Chen Chen, Jiawei Han and Philip S. Yu. GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis. In the Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), pp. 872-881, Philadelphia, USA, August 2006.
B. Hummel, E. Juergens, L. Heinemann, M. Conradt, Index-based Code Clone Detection: Incremental, Distributed, Scalable. Int. Conf. Software Maintenance (ICSM), 2010.
A. Matarawy, M. El-Ramly and R. Bahgat. Code Clone Detection Using Data Mining, Conference of Institute of Statistical Studies and Research (ISSR), Cairo University. (to appear in Dec. 2012).
S. Livieri, Y. Higo, M. Matushita, K. Inoue, Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder, Graduate School of Information Science and Technology, Osaka University1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan, 2007
Vera Wahler, Dietmar Seipel, J¨urgen Wolff v. Gudenberg, and Gregor Fischer. Clone Detection in Source Code by Frequent Itemset Techniques, Source Code Analysis and Manipulation, 2004. Fourth IEEE International Workshop on16-16 Sept. 2004.
M. -S. Chen, J. Han, and P. S. Yu. Data mining: an overview from a database perspective. IEEE Trans. On Knowledge And Data Engineering 8, 866-883,1996.
Q. Zhao, S. S. Bhowmick, Sequential pattern mining: a survey, Technical Report Center for Advanced Information Systems, School of Computer Engineering, Nanyang Technological University, Singapore, 2003.
Jiawei Han, Micheline Kamber: Data Mining – Concepts and Techniques, Kaufmann, 2001.

Index Terms

Computer Science

Information Sciences

Keywords

Code clones textual approach lexical approach syntactic approach clone types parallel code clone detector distributed code clone detector clone relation terminologies data mining apriori property sequential pattern mining