CFP last date
22 June 2026
Reseach Article

Comparative Performance Analysis of BM25 and Vector Space Model for Document Retrieval in Gujarati News Corpora

by Shreya M. Kapadia, Payal D. Joshi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 110
Year of Publication: 2026
Authors: Shreya M. Kapadia, Payal D. Joshi
10.5120/ijca6a8da113498f

Shreya M. Kapadia, Payal D. Joshi . Comparative Performance Analysis of BM25 and Vector Space Model for Document Retrieval in Gujarati News Corpora. International Journal of Computer Applications. 187, 110 ( May 2026), 38-44. DOI=10.5120/ijca6a8da113498f

@article{ 10.5120/ijca6a8da113498f,
author = { Shreya M. Kapadia, Payal D. Joshi },
title = { Comparative Performance Analysis of BM25 and Vector Space Model for Document Retrieval in Gujarati News Corpora },
journal = { International Journal of Computer Applications },
issue_date = { May 2026 },
volume = { 187 },
number = { 110 },
month = { May },
year = { 2026 },
issn = { 0975-8887 },
pages = { 38-44 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number110/comparative-performance-analysis-of-bm25-and-vector-space-model-for-document-retrieval-in-gujarati-news-corpora/ },
doi = { 10.5120/ijca6a8da113498f },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2026-05-30T22:32:56.027223+05:30
%A Shreya M. Kapadia
%A Payal D. Joshi
%T Comparative Performance Analysis of BM25 and Vector Space Model for Document Retrieval in Gujarati News Corpora
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 110
%P 38-44
%D 2026
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Retrieving relevant information from Gujarati news articles is a challenging task because of the limited availability of computational resources and language-processing tools for Gujarati, despite the rapid growth of digital news content. In this study, an information retrieval–based framework for Gujarati news document retrieval is proposed using the GSF-2009 corpus released under the FIRE evaluation initiative. Two classical retrieval models, BM25 and the Vector Space Model (VSM), are employed to retrieve and rank documents relevant to user-defined event-oriented queries. Experimental evaluation is performed using both short and descriptive Gujarati queries. For the short query “ગુજરાતમાં ભારે વરસાદ”, VSM demonstrates better performance with Recall = 0.7 and F1-score = 0.8, whereas BM25 records Recall = 0.3 and F1-score = 0.5. In contrast, for the descriptive query “ગુજરાતમાં ભારે વરસાદના કારણે અનેક જિલ્લાઓમાં પૂર જેવી સ્થિતિ”, BM25 outperforms VSM with Precision = 1.0, Recall = 0.7, and F1-score = 0.8, whereas VSM achieves Precision = 0.8, Recall = 0.5, and F1-score = 0.6. The results indicate that VSM performs more effectively for short keyword-based queries, while BM25 achieves better retrieval effectiveness for long and context-rich queries. Explicit event detection is not performed in this study; however, event-oriented retrieval is effectively supported through retrieval of documents associated with real-world events. The proposed framework provides an effective baseline for Gujarati news retrieval and supports further research in event-oriented retrieval for low-resource Indic languages.

References
  1. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
  2. Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: Proc. 21st ACM SIGIR, pp. 37–45 (1998)
  3. Yang, Y., Pierce, T., Carbonell, J.: A study of retrospective and on-line event detection. In: Proc. 21st ACM SIGIR, pp. 28–36 (1998)
  4. Basaka, S.: Event detection from news in Indian languages using similarity-based pattern finding. In: Proc. FIRE 2020 (EDNIL) (2020)
  5. Singh, J., Goel, P., Debnath, A., Shrivastava, M.: A semantico-syntactic approach to event mention detection in Hindi. In: Proc. ISA Workshop (2021)
  6. Mirashi, A., Sonavane, S., Lingayat, P., Padhiyar, T., Joshi, R.: L3Cube-IndicNews: News-based datasets for Indic languages. arXiv:2401.02254 (2024)
  7. Haq, S., Sharma, A., Bhattacharyya, P.: IndicIRSuite: Multilingual datasets and retrieval models for Indian languages. arXiv:2312.09508 (2023)
  8. Xie, J., Zhang, Y., Kou, H., Zhao, X., Feng, Z., Song, L., Zhong, W.: A survey of the application of neural networks to event extraction. Tsinghua Science and Technology 30(2), 748–768 (2025)
  9. Nguyen, H., Shi, X., Li, J.: A survey on deep learning approaches for event extraction. IEEE Access 8, 16754–16769 (2020)
  10. Jiao, Y., Zhao, L.: Real-time extraction of news events based on BERT. International Journal of Advanced Networking and Monitoring and Controls 9(3), 24–34 (2024)
  11. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3(4), 333–389 (2009)
  12. Khoo, K.B., Ishizuka, M.: Topic extraction from news archive using TF*PDF algorithm. In: Proc. IEEE Int. Conf. on Web Intelligence, pp. 571–577 (2002)
  13. Balouchzahi, F., Shashirekha, H.L.: An approach for event detection from news in Indian languages using linear SVC. In: Proc. Forum for Information Retrieval Evaluation (FIRE 2020) – Event Detection from News in Indian Languages (EDNIL), CEUR Workshop Proceedings (2020)
Index Terms

Computer Science
Information Sciences

Keywords

Gujarati News Retrieval Information Retrieval BM25 Vector Space Model (VSM) GSF-2009 Corpus Event-Oriented Retrieval