Call for Paper - January 2024 Edition
IJCA solicits original research papers for the January 2024 Edition. Last date of manuscript submission is December 20, 2023. Read More

A Novel Approach for Developing Paraphrase Detection System using Machine Learning

International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2021
Rudradityo Saha, G. Bharadwaja Kumar

Rudradityo Saha and Bharadwaja G Kumar. A Novel Approach for Developing Paraphrase Detection System using Machine Learning. International Journal of Computer Applications 183(9):29-36, June 2021. BibTeX

	author = {Rudradityo Saha and G. Bharadwaja Kumar},
	title = {A Novel Approach for Developing Paraphrase Detection System using Machine Learning},
	journal = {International Journal of Computer Applications},
	issue_date = {June 2021},
	volume = {183},
	number = {9},
	month = {Jun},
	year = {2021},
	issn = {0975-8887},
	pages = {29-36},
	numpages = {8},
	url = {},
	doi = {10.5120/ijca2021921389},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}


Plagiarism detection is difficult since there can be changes made to a sentence at several levels, namely, lexical, semantic, and syntactic level, to construct a paraphrased or plagiarized sentence posing as original. To identify cases of plagiarism and hence discourage the same, this paper presents a novel Supervised Machine Learning based Paraphrase Detection System developed by conducting experiments using Microsoft Research Paraphrase (MSRP) Corpus and assessed on the same. The proposed paraphrase detection system has achieved comparable performance with existing paraphrase detection systems. The major contributions of this paper are the utilization of a unique combination of lexical, semantic, and syntactic features, utilization of Shapley Additive Explanations (SHAP) Feature Importance Plots in XGBoost, and application of a soft voting classifier comprising of the top 3 performing standalone machine learning classifiers on the training dataset of MSRP Corpus. Another major contribution of the paper is the finding that applying data augmentation techniques degrades performance of machine learning classifiers.


  1. Alzahrani, Salha& Salim, Naomie & Abraham, Ajith. (2012). Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on. 42. 133 - 149. 10.1109/TSMCC.2011.2134847.
  2. El Desouki, M. I., & Gomaa, W. H. (2019). Exploring the Recent Trends of Paraphrase Detection. International Journal of Computer Applications, 975, 8887.
  3. Finch, A., Hwang, Y. S., &Sumita, E. (2005). Using machine translation evaluation techniques to determine sentence-level semantic equivalence. In Proceedings of the third international workshop on paraphrasing (IWP2005).
  4. Kozareva, Z., &Montoyo, A. (2006, August). Paraphrase identification on the basis of supervised machine learning techniques. In International Conference on Natural Language Processing (in Finland) (pp. 524-533), Springer, Berlin, Heidelberg.
  5. Qiu, L., Kan, M. Y., & Chua, T. S. (2006, July). Paraphrase recognition via dissimilarity significance classification. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (pp. 18-26). Association for Computational Linguistics.
  6. Wan, S., Dras, M., Dale, R., & Paris, C. (2006). Using dependency-based features to take the “para-farce” out of paraphrase. Proceedings of the Australasian Language Technology Workshop. 131-138.
  7. Das, D., & Smith, N. A. (2009, August). Paraphrase identification as probabilistic quasi-synchronous recognition. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1 (pp. 468-476). Association for Computational Linguistics.
  8. Uribe, D. (2009, November). Effectively using monotonicity analysis for paraphrase identification. In 2009 Eighth Mexican International Conference on Artificial Intelligence (pp. 108-113). IEEE.
  9. Madnani, N., Tetreault, J., & Chodorow, M. (2012, June). Re-examining machine translation metrics for paraphrase identification. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 182-190). Association for Computational Linguistics.
  10. Ul-Qayyum, Z., & Altaf, W. (2012). Paraphrase identification using semantic heuristic features. Research Journal of Applied Sciences, Engineering and Technology, 4(22), 4894-4904.
  11. Chitra, A., & Rajkumar, A. (2013). Genetic algorithm based feature selection for paraphrase recognition. International Journal on Artificial Intelligence Tools, 22(02), 1350007.
  12. Filice, S., Da San Martino, G., &Moschitti, A. (2015, July). Structural representations for learning relations between pairs of texts. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1003-1013).
  13. Zhang, Yitao& Patrick, Jon. (2012). Paraphrase Identification by Text Canonicalization. Proceedings of the Australasian Language Technology Workshop.
  14. Heilman, M., & Smith, N. (2010). Tree Edit Models for Recognizing Textual Entailments, Paraphrases, and Answers to Questions. In the 2010 Annual Conference of the North American Chapter of the ACL pages 1011-1019, Los Angeles, California.
  15. Malakasiotis (2009). Paraphrase Recognition Using Machine Learning to Combine Similarity Measures. In proceedings of the ACL-IJCNLP 2009 Student Research Workshop, pages 27-35, Suntec, Singapore.
  16. Uysal, A. K., &Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104-112.
  17. Wei, J. & Zou, K. (2019). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 6382–6388, Hong Kong, China, November 3–7, 2019.
  18. Chen, B., & Cherry, C. (2014, June). A systematic comparison of smoothing techniques for sentence-level bleu. In Proceedings of the Ninth Workshop on Statistical Machine Translation (pp. 362-367).
  19. Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957-966).
  20. Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., Constant, N., Céspedes, M. G., Yuan, S., Tar C., Sung, Y. H., Strope B. & Kurzweil R. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175.
  21. Lesk, M. (1986, June). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation (pp. 24-26).
  22. Meng, L., Huang, R., & Gu, J. (2013). A review of semantic similarity measures in wordnet. International Journal of Hybrid Information Technology, 6(1), 1-12.
  23. Ji, Y., & Eisenstein, J. (2013, October). Discriminative improvements to distributional sentence similarity. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 891-896).
  24. Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. Proceedings of the 20th International Conference on Computational Linguistics.
  25. Quirk, C, Brockett, C, & Dolan, W (2004). Monolingual Machine Translation for Paraphrase Generation. Conference: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004 142-149.
  26. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in neural information processing systems (pp. 4765-4774).
  27. Batunacun, Wieland, R., Lakes, T., and Nendel, C.: Using Shapley additive explanations to interpret extreme gradient boosting predictions of grassland degradation in Xilingol, China, Geosci. Model Dev., 14, 1493–1510,, 2021.


Natural Language Processing, Paraphrase Detection, Machine Learning, Classification, Supervised Learning