CFP last date
20 May 2024
Reseach Article

A Simple and Efficient Framework for Sentence Similarity Measurement in Bengali Language

by Maruf Ahmed Mridul, Arnab Sen Sharma
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 183 - Number 21
Year of Publication: 2021
Authors: Maruf Ahmed Mridul, Arnab Sen Sharma
10.5120/ijca2021921582

Maruf Ahmed Mridul, Arnab Sen Sharma . A Simple and Efficient Framework for Sentence Similarity Measurement in Bengali Language. International Journal of Computer Applications. 183, 21 ( Aug 2021), 1-7. DOI=10.5120/ijca2021921582

@article{ 10.5120/ijca2021921582,
author = { Maruf Ahmed Mridul, Arnab Sen Sharma },
title = { A Simple and Efficient Framework for Sentence Similarity Measurement in Bengali Language },
journal = { International Journal of Computer Applications },
issue_date = { Aug 2021 },
volume = { 183 },
number = { 21 },
month = { Aug },
year = { 2021 },
issn = { 0975-8887 },
pages = { 1-7 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume183/number21/32046-2021921582/ },
doi = { 10.5120/ijca2021921582 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:17:23.751943+05:30
%A Maruf Ahmed Mridul
%A Arnab Sen Sharma
%T A Simple and Efficient Framework for Sentence Similarity Measurement in Bengali Language
%J International Journal of Computer Applications
%@ 0975-8887
%V 183
%N 21
%P 1-7
%D 2021
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Sentence similarity measurement is a crucial task for the performance of several Natural Language Processing applications and it has received much attention mainly for English language. However, for low resource languages like Bengali, very few works have been done in this field. This article proposes a simple approach to measure sentence similarity score for low resource languages. Rather than relying on complex approaches that try to extract lexical information from text, here, semantic information using language-agnostic language models based on BERT is extracted. The variable length pairs of sentences are embedded into fixed length feature vectors using different language-agnostic BERT sentence encoders, then their differences are measured using some standard loss functions and finally the concatenated loss vectors are used to train a simple feed forward neural network to measure the similarity score between sentence pairs. The experiments show that this relatively simple approach gives satisfactory results when trained with Bengali sentence pairs. This approach requires almost no intricate pre-processing steps. Which means a similar architecture should work well for other low resources languages for which well performing stemmers, lemmatizers etc are scarce.

References
  1. James Allan, Courtney Wade, and Alvaro Bolivar. Retrieval and novelty detection at the sentence level. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 314–321, 2003.
  2. Satanjeev Banerjee, Ted Pedersen, et al. Extended gloss overlaps as a measure of semantic relatedness. In Ijcai, volume 3, pages 805–810. Citeseer, 2003.
  3. Rafael Ferreira, Rafael Dueire Lins, Steven J Simske, Fred Freitas, and Marcelo Riss. Assessing sentence similarity through lexical, syntactic and semantic analysis. Computer Speech & Language, 39:1–28, 2016.
  4. Hua He, Kevin Gimpel, and Jimmy Lin. Multi-perspective sentence similarity modeling with convolutional neural networks. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1576–1586, 2015.
  5. Timothy C Hoad and Justin Zobel. Methods for identifying versioned and plagiarized documents. Journal of the American society for information science and technology, 54(3):203–215, 2003.
  6. Ming Che Lee. A novel sentence similarity measure for semantic-based expert systems. Expert Systems with Applications, 38(5):6392–6399, 2011.
  7. Lin Li, Xia Hu, Bi-Yun Hu, Jun Wang, and Yi-Ming Zhou. Measuring sentence similarity from different aspects. In 2009 international conference on machine learning and cybernetics, volume 4, pages 2244–2249. IEEE, 2009.
  8. Yuhua Li, David McLean, Zuhair A Bandar, James D O’shea, and Keeley Crockett. Sentence similarity based on semantic nets and corpus statistics. IEEE transactions on knowledge and data engineering, 18(8):1138–1150, 2006.
  9. Xiaoying Liu, Yiming Zhou, and Ruoshi Zheng. Sentence similarity based on dynamic time warping. In International Conference on Semantic Computing (ICSC 2007), pages 250– 256. IEEE, 2007.
  10. Abu Kaisar Mohammad Masum, Sheikh Abujar, Raja Tariqul Hasan Tusher, Fahad Faisal, and Syed Akhter Hossain. Sentence similarity measurement for bengali abstractive text summarization. In 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pages 1–5. IEEE, 2019.
  11. Donald Metzler, Yaniv Bernstein, W Bruce Croft, Alistair Moffat, and Justin Zobel. Similarity measures for tracking information flow. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 517–524, 2005.
  12. Donald Metzler, Yaniv Bernstein, W Bruce Croft, Alistair Moffat, and Justin Zobel. Similarity measures for tracking information flow. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 517–524, 2005.
  13. Jonas Mueller and Aditya Thyagarajan. Siamese recurrent architectures for learning sentence similarity. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
  14. Abujar Onkon, Md Shahidul Islam, and Abu Abed Md Shohaeb. Assessing sentence similarity using lexical and semantic analysis for text summarization using neural network. Assessing Sentence Similarity using Lexical and semantic Analysis for Text Summarization using Neural Network., 4(1):5–5, 2018.
  15. Md Arafat Sultan, Steven Bethard, and Tamara Sumner. Dls@ cu: Sentence similarity from word alignment and semantic vector composition. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 148–153, 2015.
Index Terms

Computer Science
Information Sciences

Keywords

Sentence Similarity Feed Forward Neural Networks Natural Language Processing Sentence Transformers Multilingual BERT