CFP last date
20 May 2024
Reseach Article

Implementation of Text Similarity using Word Frequency and Cosine Similarity in Python

by Ahmad Farhan AlShammari
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 185 - Number 36
Year of Publication: 2023
Authors: Ahmad Farhan AlShammari
10.5120/ijca2023923160

Ahmad Farhan AlShammari . Implementation of Text Similarity using Word Frequency and Cosine Similarity in Python. International Journal of Computer Applications. 185, 36 ( Oct 2023), 54-59. DOI=10.5120/ijca2023923160

@article{ 10.5120/ijca2023923160,
author = { Ahmad Farhan AlShammari },
title = { Implementation of Text Similarity using Word Frequency and Cosine Similarity in Python },
journal = { International Journal of Computer Applications },
issue_date = { Oct 2023 },
volume = { 185 },
number = { 36 },
month = { Oct },
year = { 2023 },
issn = { 0975-8887 },
pages = { 54-59 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume185/number36/32927-2023923160/ },
doi = { 10.5120/ijca2023923160 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:27:59.189559+05:30
%A Ahmad Farhan AlShammari
%T Implementation of Text Similarity using Word Frequency and Cosine Similarity in Python
%J International Journal of Computer Applications
%@ 0975-8887
%V 185
%N 36
%P 54-59
%D 2023
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The goal of this research is to develop a text similarity program using word frequency and cosine similarity in Python. The purpose of text similarity is to measure the similarity between texts. The word frequency is used to measure the word importance in the text, and cosine similarity is used to measure the similarity between texts. The basic steps of text similarity are explained: preprocessing text, creating list of words, creating bag of words, creating word frequency, calculating cosine similarity, and printing similarity score. The developed program was tested on an experimental text from Wikipedia. The program successfully performed the basic steps of text similarity and provided the required results.

References
  1. Sammut, C., & Webb, G. I. (2011). "Encyclopedia of Machine Learning". Springer.
  2. Aggarwal, C. (2015). "Data Mining: The Textbook". New York: Springer.
  3. Aggarwal, C. (2018). "Machine Learning for Text". New York: Springer.
  4. Hotho, A., Nürnberger, A., & Paass, G. (2005). "A Brief Survey of Text Mining". LDV Forum - GLDV Journal for Computational Linguistics and Language Technology. 20, 19-62.
  5. Gomaa, W. H., & Fahmy, A. A. (2013). "A Survey of Text Similarity Approaches". International Journal of Computer Applications, 68(13), 13-18.
  6. Breitinger, C., Gipp, B., Langer, S. (2015). "Research-Paper Recommender Systems: A Literature Survey". International Journal on Digital Libraries, 17(4), 305-338.
  7. Vijaymeena, M. K., & Kavitha, K. (2016). "A Survey on Similarity Measures in Text Mining". Machine Learning and Applications: An International Journal, 3(2), 19-28.
  8. Gunawan, D., Sembiring, C. A., & Budiman, M. A. (2018). "The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents". In Journal of Physics: Conference Series (Vol. 978, p. 012120). IOP Publishing.
  9. Prasetya, D. D., Wibawa, A. P., & Hirashima, T. (2018). "The Performance of Text Similarity Algorithms". International Journal of Advances in Intelligent Informatics, 4(1), 63-69.
  10. Shahmirzadi, O., Lugowski, A., & Younge, K. (2019). "Text Similarity in Vector Space Models: A Comparative Study". In 2019 18th IEEE international conference on machine learning and applications (ICMLA) (pp. 659-666). IEEE.
  11. Wang, J., & Dong, Y. (2020). "Measurement of Text Similarity: A Survey". Information, 11(9), 421.
  12. Luhn, H. (1958). "The Automatic Creation of Literature Abstracts". IBM Journal of Research and Development, 2(2), 159-165.
  13. Salton, G. & Lesk, M. E. (1965). "The SMART Automatic Document Retrieval Systems: An Illustration". Communications of the ACM. 8 (6): 391-398.
  14. Salton, G. (1971). "The SMART Retrieval System: Experiments in Automatic Document Retrieval". Englewood Cliffs, N.J.: Prentice Hall Inc.
  15. Salton, G., Wong, A., & Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing". Communications of the ACM, 18(11), 613-620.
  16. Salton, G., Yang, C. S., & Yu, C. T. (1975). "A Theory of Term Importance in Automatic Text Analysis". Journal of the American Society for Information Science, 26(1), 33-44.
  17. Salton, G. & McGill, M. (1983). "Introduction to Modern Information Retrieval". McGraw Hill Book Co, New York.
  18. Salton, G., & Buckley, C. (1988). "Term-Weighting Approaches in Automatic Text Retrieval". Information Processing and Management, 24(5), 513-523.
  19. Salton, G. (1989). "Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer". Addison- Wesley Publishing Company, USA.
  20. Salton, G., Allan, J., & Buckley, C. (1994). "Automatic Structuring and Retrieval of Large Text Files". Communications of the ACM, 37(2), 97-108.
  21. Salton, G., Singhal, A., Mitra, M., & Buckley, C. (1997). "Automatic Text Structuring and Summarization". Information Processing & Management, 33(2), 193-207.
  22. Sparck Jones, K. (1972). "A Statistical Interpretation of Term Specificity and Its Application in Retrieval". Journal of Documentation. 28(1), 11–21.
  23. Sparck Jones, K. (2004). "IDF Term Weighting and IR Research Lessons". Journal of Documentation, 60(5), 521-523.
  24. Robertson, S. (1972). "Term Specificity". Journal of Documentation, 28(1), 164-165.
  25. Robertson, S. (1974). "Documentation Note: Specificity and Weighted Retrieval". Journal of Documentation, 30(1), 41-46.
  26. Robertson, S. (2004). "Understanding Inverse Document Frequency: On Theoretical Arguments for IDF". Journal of Documentation, 60(5), 503-520.
  27. Python: https://www.python.org
  28. Numpy: https://www.numpy.org
  29. Pandas: https:// pandas.pydata.org
  30. Matplotlib: https://www. matplotlib.org
  31. NLTK: https://www.nltk.org
  32. SK Learn: https://scikit-learn.org
  33. Wikipedia: https://en.wikipedia.org
Index Terms

Computer Science
Information Sciences

Keywords

Artificial Intelligence Machine Learning Natural Language Processing Text Mining Text Similarity Word Frequency Cosine Similarity Python Programming.