CFP last date
22 April 2024
Reseach Article

A Comprehensive Comparative Study of Word and Sentence Similarity Measures

by Issa Atoum, Ahmed Otoom, Narayanan Kulathuramaiyer
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 135 - Number 1
Year of Publication: 2016
Authors: Issa Atoum, Ahmed Otoom, Narayanan Kulathuramaiyer
10.5120/ijca2016908259

Issa Atoum, Ahmed Otoom, Narayanan Kulathuramaiyer . A Comprehensive Comparative Study of Word and Sentence Similarity Measures. International Journal of Computer Applications. 135, 1 ( February 2016), 10-17. DOI=10.5120/ijca2016908259

@article{ 10.5120/ijca2016908259,
author = { Issa Atoum, Ahmed Otoom, Narayanan Kulathuramaiyer },
title = { A Comprehensive Comparative Study of Word and Sentence Similarity Measures },
journal = { International Journal of Computer Applications },
issue_date = { February 2016 },
volume = { 135 },
number = { 1 },
month = { February },
year = { 2016 },
issn = { 0975-8887 },
pages = { 10-17 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume135/number1/24012-2016908259/ },
doi = { 10.5120/ijca2016908259 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:34:34.227976+05:30
%A Issa Atoum
%A Ahmed Otoom
%A Narayanan Kulathuramaiyer
%T A Comprehensive Comparative Study of Word and Sentence Similarity Measures
%J International Journal of Computer Applications
%@ 0975-8887
%V 135
%N 1
%P 10-17
%D 2016
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Sentence similarity is considered the basis of many natural language tasks such as information retrieval, question answering and text summarization. The semantic meaning between compared text fragments is based on the words’ semantic features and their relationships. This article reviews a set of word and sentence similarity measures and compares them on benchmark datasets. On the studied datasets, results showed that hybrid semantic measures perform better than both knowledge and corpus based measures.

References
  1. P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy,” in Proceedings of the 14th international joint conference on Artificial intelligence (IJCAI’95), 1995, vol. 1, pp. 448–453.
  2. A. Islam and D. Inkpen, “Unsupervised Near-Synonym Choice using the Google Web 1T,” ACM Trans. Knowl. Discov. Data, vol. V, no. June, pp. 1–19, 2012.
  3. B. Chen, “Latent topic modelling of word co-occurence information for spoken document retrieval,” in IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2009, 2009, no. 2, pp. 3961–3964.
  4. D. Croft, S. Coupland, J. Shell, and S. Brown, “A fast and efficient semantic short text similarity metric,” in Computational Intelligence (UKCI), 2013 13th UK Workshop on, 2013, pp. 221–227.
  5. S. Memar, L. S. Affendey, N. Mustapha, S. C. Doraisamy, and M. Ektefa, “An integrated semantic-based approach in concept based video retrieval,” Multimed. Tools Appl., vol. 64, no. 1, pp. 77–95, Aug. 2011.
  6. C. Ho, M. A. A. Murad, R. A. Kadir, and S. C. Doraisamy, “Word sense disambiguation-based sentence similarity,” in Proceedings of the 23rd International Conference on Computational Linguistics: Posters, 2010, no. August, pp. 418–426.
  7. A. Islam and D. Inkpen, “Real-word Spelling Correction Using Google Web IT 3-grams,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3, 2009, pp. 1241–1249.
  8. M. Jarmasz and S. Szpakowicz, “Roget’s Thesaurus and Semantic Similarity,” Recent Adv. Nat. Lang. Process. III Sel. Pap. from RANLP 2003, vol. 111, 2004.
  9. P. Turney, “Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL,” in Proceedings of the 12th European Conference on Machine Learning, 2001, pp. 491–502.
  10. J. O’Shea, Z. Bandar, K. Crockett, and D. McLean, “A Comparative Study of Two Short Text Semantic Similarity Measures,” in Agent and Multi-Agent Systems: Technologies and Applications, vol. 4953, N. Nguyen, G. Jo, R. Howlett, and L. Jain, Eds. Springer Berlin Heidelberg, 2008, pp. 172–181.
  11. J.-H. Chiang and H.-C. Yu, “Literature extraction of protein functions using sentence pattern mining,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 8, pp. 1088–1098, 2005.
  12. I. Atoum and C. H. Bong, “Measuring Software Quality in Use: State-of-the-Art and Research Challenges,” ASQ.Software Qual. Prof., vol. 17, no. 2, pp. 4–15, 2015.
  13. S. T. W. Wendy, B. C. How, and I. Atoum, “Using Latent Semantic Analysis to Identify Quality in Use ( QU ) Indicators from User Reviews,” in The International Conference on Artificial Intelligence and Pattern Recognition (AIPR2014), 2014, pp. 143–151.
  14. I. Atoum, C. H. Bong, and N. Kulathuramaiyer, “Building a Pilot Software Quality-in-Use Benchmark Dataset,” in 9th International Conference on IT in Asia, 2015.
  15. S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman, “Indexing by latent semantic analysis,” J. Am. Soc. Inf. Sci., vol. 41, no. 6, pp. 391–407, Sep. 1990.
  16. T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction to latent semantic analysis,” Discourse Process., vol. 25, no. 2–3, pp. 259–284, 1998.
  17. W. Guo and M. Diab, “A Simple Unsupervised Latent Semantics Based Approach for Sentence Similarity,” in Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, 2012, pp. 586–590.
  18. J. Xu, P. Liu, G. Wu, Z. Sun, B. Xu, and H. Hao, “A Fast Matching Method Based on Semantic Similarity for Short Texts,” in Natural Language Processing and Chinese Computing, Y. Zhou, Guodong and Li, Juanzi and Zhao, Dongyan and Feng, Ed. Chongqing, China: Springer Berlin Heidelberg, 2013, pp. 299–309.
  19. Y. Tian, H. Li, Q. Cai, and S. Zhao, “Measuring the similarity of short texts by word similarity and tree kernels,” in IEEE Youth Conference on Information Computing and Telecommunications (YC-ICT), 2010, pp. 363–366.
  20. L. Li, X. Hu, B.-Y. Hu, J. Wang, and Y.-M. Zhou, “Measuring sentence similarity from different aspects,” in International Conference on Machine Learning and Cybernetics, 2009, 2009, vol. 4, pp. 2244–2249.
  21. C. Fellbaum, “WordNet: An electronic lexical database. 1998,” WordNet is available from http//www. cogsci. princeton. edu/wn, no. 2000, pp. 231–243, 2010.
  22. P. Achananuparp, X. Hu, and X. Shen, “The Evaluation of Sentence Similarity Measures,” in Data Warehousing and Knowledge Discovery, vol. 5182, I.-Y. Song, J. Eder, and T. Nguyen, Eds. Springer Berlin Heidelberg, 2008, pp. 305–316.
  23. D. Lin, “An information-theoretic definition of similarity,” in Proceedings of the 15th international conference on Machine Learning, 1998, vol. 1, pp. 296–304.
  24. P. Resnik, “Disambiguating Noun Groupings with Respect to WordNet Senses,” in Natural Language Processing Using Very Large Corpora SE - 6, 1995, vol. 11, pp. 77–98.
  25. J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus statistics and lexical taxonomy,” in Proceedings of the 10th Research on Computational Linguistics International Conference (ROCLING X), 1997, pp. 19–33.
  26. S. Deerwester and S. Dumais, “Indexing by latent semantic analysis,” J. Am. Soc. Inf. Sci., vol. 41, no. 6, pp. 391–407, Sep. 1990.
  27. R. Rada, H. Mili, E. Bicknell, and M. Blettner, “Development and application of a metric on semantic nets,” IEEE Trans. Syst. Man Cybern., vol. 19, no. 1, pp. 17–30, 1989.
  28. Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” in Proceedings of the 32nd annual meeting on Association for Computational Linguistics, 1994, pp. 133–138.
  29. Y. Li, D. McLean, Z. A. Bandar, J. D. O’Shea, and K. Crockett, “Sentence similarity based on semantic nets and corpus statistics,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 8, pp. 1138–1150, Aug. 2006.
  30. C. Leacock and M. Chodorow, “Combining local context and WordNet similarity for word sense identification,” WordNet An Electron. Lex. database, vol. 49, no. 2, pp. 265–283, 1998.
  31. G. Hirst and D. St-Onge, “Lexical chains as representations of context for the detection and correction of malapropisms,” in WordNet: An electronic lexical database, vol. 305, C. Fellbaum, Ed. Cambridge, MA: The MIT Press, 1998, pp. 305–332.
  32. Z. Zhou, Y. Wang, and J. Gu, “A New Model of Information Content for Semantic Similarity in WordNet,” in Second International Conference on Future Generation Communication and Networking Symposia, 2008, vol. 1, pp. 85–89.
  33. M. a. Rodriguez and M. J. J. Egenhofer, “Determining semantic similarity among entity classes from different ontologies,” IEEE Trans. Knowl. Data Eng., vol. 15, no. 2, pp. 442–456, Mar. 2003.
  34. L. Dong, P. K. Srimani, and J. Z. Wang, “WEST: Weighted-Edge Based Similarity Measurement Tools for Word Semantics,” in IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010, vol. 1, pp. 216–223.
  35. I. Atoum and C. H. Bong, “Joint Distance and Information Content Word Similarity Measure,” in Soft Computing Applications and Intelligent Systems SE - 22, vol. 378, S. Noah, A. Abdullah, H. Arshad, A. Abu Bakar, Z. Othman, S. Sahran, N. Omar, and Z. Othman, Eds. Kuala Lumpur: Springer Berlin Heidelberg, 2013, pp. 257–267.
  36. D. Bollegala, Y. Matsuo, M. Ishizuka, M. D. Thiyagarajan, and N. Navaneethakrishnanc, “A Web Search Engine-Based Approach to Measure Semantic Similarity between Words,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 7, pp. 977–990, Jul. 2011.
  37. J. Allan, C. Wade, and A. Bolivar, “Retrieval and Novelty Detection at the Sentence Level,” in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, 2003, pp. 314–321.
  38. T. C. Hoad and J. Zobel, “Methods for identifying versioned and plagiarized documents,” J. Am. Soc. Inf. Sci. Technol., vol. 54, no. 3, pp. 203–215, 2003.
  39. C. Akkaya, J. Wiebe, and R. Mihalcea, “Subjectivity word sense disambiguation,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1, 2009, pp. 190–199.
  40. G. Tsatsaronis, I. Varlamis, and M. Vazirgiannis, “Text relatedness based on a word thesaurus,” J. Artif. Intell. Res., vol. 37, pp. 1–38, 2010.
  41. C. Burgess, K. Livesay, and K. Lund, “Explorations in context space: Words, sentences, discourse,” Discourse Process., vol. 25, no. 2–3, pp. 211–257, 1998.
  42. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.
  43. A. Islam and D. Inkpen, “Semantic text similarity using corpus-based word similarity and string similarity,” ACM Trans. Knowl. Discov. from Data, vol. 2, no. 2, pp. 10:1–10:25, Jul. 2008.
  44. F. Mandreoli, R. Martoglia, and P. Tiberio, “A Syntactic Approach for Searching Similarities Within Sentences,” in Proceedings of the Eleventh International Conference on Information and Knowledge Management, 2002, pp. 635–637.
  45. G. Huang and J. Sheng, “Measuring Similarity between Sentence Fragments,” in 4th International Conference on Intelligent Human-Machine Systems and Cybernetics, 2012, pp. 327–330.
  46. L. C. Wee and S. Hassan, “Exploiting Wikipedia for Directional Inferential Text Similarity,” in Fifth International Conference on Information Technology: New Generations, 2008, pp. 686–691.
  47. A. Islam, E. Milios, and V. Kešelj, “Text similarity using google tri-grams,” in Advances in Artificial Intelligence, vol. 7310, L. Kosseim and D. Inkpen, Eds. Springer, 2012, pp. 312–317.
  48. N. Malandrakis, E. Iosif, and A. Potamianos, “DeepPurple: Estimating Sentence Semantic Similarity Using N-gram Regression Models and Web Snippets,” in Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, 2012, pp. 565–570.
  49. N. Seco, T. Veale, and J. Hayes, “An Intrinsic Information Content Metric for Semantic Similarity in WordNet,” in Proceedings of the 16th European Conference on Artificial Intelligence, 2004, no. Ic, pp. 1–5.
  50. M. C. Lee, “A novel sentence similarity measure for semantic-based expert systems,” Expert Syst. Appl., vol. 38, no. 5, pp. 6392–6399, 2011.
  51. K. Abdalgader and A. Skabar, “Short-text similarity measurement using word sense disambiguation and synonym expansion,” in AI 2010: Advances in Artificial Intelligence, Springer Berlin / Heidelberg, 2011, pp. 435–444.
  52. Y. Li, H. Li, Q. Cai, and D. Han, “A novel semantic similarity measure within sentences,” in Proceedings of 2012 2nd International Conference on Computer Science and Network Technology, 2012, pp. 1176–1179.
  53. D. Yang and D. M. W. Powers, “Measuring semantic similarity in the taxonomy of WordNet,” in Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38, 2005, pp. 315–322.
  54. J. Feng, Y. Zhou, and T. Martin, “Sentence similarity based on relevance,” in Proceedings of IPMU, 2008, pp. 832–839.
  55. X. Liu, Y. Zhou, and R. Zheng, “Sentence Similarity based on Dynamic Time Warping,” in International Conference on Semantic Computing (ICSC 2007), 2007, pp. 250–256.
  56. R. Mihalcea, C. Corley, and C. Strapparava, “Corpus-based and knowledge-based measures of text semantic similarity,” Assoc. Adv. Artif. Intell., vol. 6, pp. 775–780, 2006.
  57. H. Rubenstein and J. B. Goodenough, “Contextual correlates of synonymy,” Commun. ACM, vol. 8, no. 10, pp. 627–633, Oct. 1965.
  58. G. A. Miller and W. G. Charles, “Contextual correlates of semantic similarity,” Lang. Cogn. Process., vol. 6, no. 1, pp. 1–28, 1991.
  59. P. University, “About WordNet,” Princeton University, 2010. [Online]. Available: http://wordnet.princeton.edu.
  60. W. N. Francis and H. Kucera, “Brown corpus manual,” Lett. to Ed., vol. 5, no. 2, p. 7, 1979.
  61. J. M. Sinclair, Collins COBUILD English dictionary for advanced learners. HarperCollins, 2001.
  62. J. O’Shea, Z. Bandar, K. Crockett, and D. McLean, “Pilot Short Text Semantic Similarity Benchmark Data Set: Full Listing and Description,” 2008.
  63. R. Cilibrasi and P. M. B. Vitányi, “The Google Similarity Distance,” CoRR, vol. abs/cs/041, 2004.
  64. M. Mohler and R. Mihalcea, “Text-to-text Semantic Similarity for Automatic Short Answer Grading,” in Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 2009, pp. 567–575.
Index Terms

Computer Science
Information Sciences

Keywords

Word Similarity Sentence Similarity Corpus Measures Knowledge Measures Hybrid Measures Text Similarity