CFP last date
22 April 2024
Reseach Article

Arabic Text Copy Detection using Full, Reduced and Unique Syntactical Structures

by Mohamed Taybe Elhadi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 154 - Number 4
Year of Publication: 2016
Authors: Mohamed Taybe Elhadi
10.5120/ijca2016912088

Mohamed Taybe Elhadi . Arabic Text Copy Detection using Full, Reduced and Unique Syntactical Structures. International Journal of Computer Applications. 154, 4 ( Nov 2016), 13-17. DOI=10.5120/ijca2016912088

@article{ 10.5120/ijca2016912088,
author = { Mohamed Taybe Elhadi },
title = { Arabic Text Copy Detection using Full, Reduced and Unique Syntactical Structures },
journal = { International Journal of Computer Applications },
issue_date = { Nov 2016 },
volume = { 154 },
number = { 4 },
month = { Nov },
year = { 2016 },
issn = { 0975-8887 },
pages = { 13-17 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume154/number4/26478-2016912088/ },
doi = { 10.5120/ijca2016912088 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:59:19.303292+05:30
%A Mohamed Taybe Elhadi
%T Arabic Text Copy Detection using Full, Reduced and Unique Syntactical Structures
%J International Journal of Computer Applications
%@ 0975-8887
%V 154
%N 4
%P 13-17
%D 2016
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper reports on work performed to investigate the use of a combined Part of Speech (POS) tagging and a minimum edit operations algorithm to determine the level of similarity between pairs of Arabic text documents. The level of similarity can be used as an indication of duplication in full or in part of the document's content. Text is first converted into POS tags that are then fed to the string similarity algorithm to determine the similarity of pairs of documents. A normalized score is calculated and used to rank documents. Documents ranked higher than some selected threshold are considered similar and can be near or complete duplicate. The performed experiments compare results based on the use of a set of selected common subsequences that are the results of translation of text into a sequence of syntactical units. The strings are first produced using full-text (FULL). These are further refined to produce a REDUCED; where repeated consecutive characters are reduced to a single character and a number, and more refined to produce a UNIQUE string; where all repeating characters are replaced by a single character. Syntactical features of the text were used as a structural representation of the documents' content. Results obtained from the experiments using the FULL, the REDUCED and the UNIQUE POS-strings showed a clear advantage over the use of the plain text in terms of reduced string size while maintaining the same discrimination power. In particular the unique (most-reduced) string has shown quite comparable results to the reduced, the full and the actual text string.

References
  1. Grune, D, and M, Huntjens, Detecting copied submissions in computer science workshops, Vakgroep Informatica, Faculteit Wiskunde & Informatica, Vrije Universiteit, AMSTERDAM, 1989.
  2. D. M. Campbell, W. R. Chen and R. D. Smith, “Copy Detection Systems for Digital Documents”, IEEE, Washington, DC, USA, May, 2000, pp. 78-88.
  3. Clough, P., Old and new challenges in automatic plagiarism detection, Department of Information Studies, University of Sheffield, 2003.
  4. Bull, J., C. Collins, E. Coughlin and D. Sharp, Technical Review of Plagiarism Detection Software Report, Computer Assisted Assessment Centre, University of Luton, Luton, UK.
  5. Kang, N., A. Gelbukh and S. Han, PPChecker: Plagiarism Pattern Checker in Document Copy Detection, 2006.
  6. A. Singhal, “Modern Information Retrieval: A Brief Overview”, Google, Inc., IEEE, 2001.
  7. Poinçot, P., S. Lesteven and F. Murtagh, Comparison of Two “Document Similarity Search Engines”, ASP Conference Series, Vol. 153, 1998.
  8. L. Bergroth, H. Hakonen and T. Taita, “A Survey of Longest Common Subsequence Algorithms”, In String Processing and Information Retrieval, 7th. International Symposium on, 27-29 Sept. 2000., pp. 39–48.
  9. S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, “Basic Local Alignment Search Tool”, J. Mol. Biol. Vol.215, Academic Press Limited, 1990, pp. 403-410.
  10. I. Yang, C. Huang and K. Chao, “A fast algorithm for computing a longest common increasing subsequence”, Information Processing Letters, Vol.93(5), Elsevier B.V., 2004, pp. 249-253.
  11. Baral, C., Local Alignment: Smith-Waterman algorithm, CSE 591: Computational Molecular Biology Course, Arizona State University, 2004.
  12. M. S. Waterman, “General Methods of Sequence Comparison”, Bull. Math. Biol.Vol(46), 1984, pp. 473-500.
  13. Y. Liu and L. Liang, “A Dual-method Model for Copy Detection”, IEEE, IAT Workshops, 2006, pp. 634-7.
  14. K. Monostori, R. Finkel, A. Zaslavsky, G. Hodasz and M. Pataki, “Comparison of Overlap Detection Techniques”, Intern. Conference on Computational Science, Amsterdam, Holand, 21-24 Apr., 2002, pp 51-60.
  15. Steinberger, R., B. Pouliquen and J. Hagman, Cross-lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC, Springer-Verlag Berlin Heidelberg, 2002.
  16. Esko Ukkonen (1983). On approximate string matching. Foundations of Computation Theory. Springer. pp. 487–495.
  17. Navarro, Gonzalo (1 March 2001). "A guided tour to approximate string matching" (PDF). ACM Computing Surveys.33 (1): 31–88. doi:10.1145/375360.375365. Retrieved19 March 2015.
  18. Daniel Jurafsky; James H. Martin. Speech and Language Processing. Pearson Education International. pp. 107–111.
  19. Finlay S (1999). CopyCatch, Masters Dissertation, University of Birmingham.
  20. Elhadi, M. Al-Tobi, M. "Detection of Duplication in Documents and WebPages Based Documents Syntactical Structures through an Improved Longest Common Subsequence", IJIPM: International Journal of Information Processing and Management, Vol. 1, No. 1, pp. 138 ~ 147, 2010.
  21. Mohamed Elhadi, Text Similarity Calculation Using Text and Syntactical Structures, 8th ICCIT: 2012 International Conference on Computer Sciences and Convergence Information Technology, December 3-5.2012, Seoul, Korea.
  22. Mohamed Elhadi and Amjad Al-Tobi Use of Text Syntactical Structures in Detection of Document Duplicates, Third IEEE International Conference on, Digital Information Management ICDIM 2008, University of East London, London, UK 2008.
  23. Bani-Ismail, B, Al-Rababah, K, Shatnawi, S., The effect of full word, stem, and root as index-term on Arabic information retrieval , Global Journal of Computer Science and Technology, 2011.
  24. Mohamed Elhadi and Amjad Al-Tobi Webpage Duplicate Detection Using Combined 2009 World Congress on Computer Science and Information Engineering (CSIE 2009), March 31 - April 2, 2009, Los Angeles/Anaheim, USA.
  25. Mohamed Elhadi and Amjad Al-Tobi Duplicate Detection in Documents and WebPages using Improved Longest Common Subsequence and Documents Syntactical Structures, 4th ICCIT: 2009 International Conference on Computer Sciences and Convergence Information Technology November 24-26, 2009, Seoul, Korea.
  26. Mohamed Elhadi and Amjad Al-Tobi, Part of Speech (POS) Tag Sets Reduction and Analysis using Rough Set Techniques, Twelfth International Conference on Rough Sets, Fuzzy Sets, Data Mining & Granular Computing RSFDGrC 2009, Indian Institute of Technology, Delhi, India, December 16-18, 2009
  27. A, G. Maguitman, F, Menczer, H. Roinestad and A. Vespignani, “Algorithmic Detection of Semantic Similarity”, International World Wide Web Conference Committee, 2005, pp.107-116.
  28. Mihalcea, R., C, Corley and C, Strapparava, Corpus-based and Knowledge-based Measures of Text Semantic Similarity, American Association for Artificial Intelligence, Jul, 2006.
  29. S. Schleimer, D. S. Wilkerson and A. Aiken, “Winnowing: Local Algorithms for Document Fingerprinting”, International Conference on Management of Data, ACM, 2003, pp. 76–85.
  30. Mohamed Elhadi and Amjad Al-Tobi, Refinements of Longest Common Subsequence Algorithm, ACS/IEEE International Conference on Computer Systems and Applications. Hammamet, Tunisia, May 2010.
  31. Eugene Myers , "An O(ND) Difference Algorithm and its Variations", , Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;
  32. Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.
  33. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
Index Terms

Computer Science
Information Sciences

Keywords

Arabic text processing syntactical structures document similarity reduction edit-based string similarity copy detection.