CFP last date
20 June 2025
Reseach Article

Bangla News Document Categorization using Deep Learning Approaches and Fine-tuned BERT

by Muhammad Anwarul Azim, Md Gias Uddin, Mohammad Khairul Islam, Abu Nowshed Chy
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 9
Year of Publication: 2025
Authors: Muhammad Anwarul Azim, Md Gias Uddin, Mohammad Khairul Islam, Abu Nowshed Chy
10.5120/ijca2025924469

Muhammad Anwarul Azim, Md Gias Uddin, Mohammad Khairul Islam, Abu Nowshed Chy . Bangla News Document Categorization using Deep Learning Approaches and Fine-tuned BERT. International Journal of Computer Applications. 187, 9 ( May 2025), 1-9. DOI=10.5120/ijca2025924469

@article{ 10.5120/ijca2025924469,
author = { Muhammad Anwarul Azim, Md Gias Uddin, Mohammad Khairul Islam, Abu Nowshed Chy },
title = { Bangla News Document Categorization using Deep Learning Approaches and Fine-tuned BERT },
journal = { International Journal of Computer Applications },
issue_date = { May 2025 },
volume = { 187 },
number = { 9 },
month = { May },
year = { 2025 },
issn = { 0975-8887 },
pages = { 1-9 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number9/bangla-news-document-categorization-using-deep-learning-approaches-and-fine-tuned-bert/ },
doi = { 10.5120/ijca2025924469 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-06-01T00:56:22.928488+05:30
%A Muhammad Anwarul Azim
%A Md Gias Uddin
%A Mohammad Khairul Islam
%A Abu Nowshed Chy
%T Bangla News Document Categorization using Deep Learning Approaches and Fine-tuned BERT
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 9
%P 1-9
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

With the explosive growth of text documents available in digital form, document categorization has become a critical challenge in managing digital data effectively and precisely. So, researchers apply supervised, semi-supervised, and unsupervised approaches to categorize text documents. Recently, Transformers-based models show outstanding results in the downstream tasks of natural language processing, such as text classification, sentiment analysis, emotion classification, name entity recognition, spam email detection, etc. As the Bangla language is a widely spoken language, we deploy deep neural networks based CBiLSTM, BiLSTM, FastText, and Transformer-based BERT classifier models to categorize Bangla news documents into predefined categories. We utilize pre-trained fastText word embedding vectors with CBiLSTM, BiLSTM classifier models. The dataset we used has 28,800 news documents with 12 categories. In order to find the best outcome, we fine-tune each model with different hyperparameters. Fine-tuned BERT classifier model manages to achieve the highest accuracy of 94.74% compared to other classifier models. We also compare the accuracy of different classifier models with respect to Bangla news documents.

References
  1. Durgesh, K. SRIVASTAVA, and B. Lekha. "Data classification using support vector machine."Journal of theoretical and applied information technology12, no. 1 (2010): 1-7.
  2. El Kourdi, Mohamed, Amine Bensaid, and Tajje-eddine Rachidi. "Automatic Arabic document categorization based on the Naïve Bayes algorithm." In proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, pp. 51-58. 2004.
  3. Lewis, David D., and Marc Ringuette. "A comparison of two learning algorithms for text categorization." InThird annual symposium on document analysis and information retrieval, vol. 33, pp. 81-93. 1994.
  4. Johnson, Rie, and Tong Zhang. "Effective use of word order for text categorization with convolutional neural networks."arXiv preprint arXiv:1412.1058 (2014).
  5. Lai, Siwei, Liheng Xu, Kang Liu, and Jun Zhao. "Recurrent convolutional neural networks for text classification." In Twenty-ninth AAAI conference on artificial intelligence. 2015.
  6. Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis Lau. "A C-LSTM neural network for text classification."arXiv preprint arXiv:1511.08630 (2015).
  7. Sun, Chi, Xipeng Qiu, Yige Xu, and Xuanjing Huang. "How to fine-tune bert for text classification?" In China national conference on Chinese computational linguistics, pp. 194-206. Springer, Cham, 2019.
  8. Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter."arXiv preprint arXiv:1910.01108 (2019).
  9. Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. "Roberta: A robustly optimized bert pretraining approach."arXiv preprint arXiv:1907.11692 (2019).
  10. Britannica, T. Editors of Encyclopaedia. "Bangla language." Encyclopedia Britannica, July 28, 2017. https://www.britannica.com/topic/Bangla-language.
  11. Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. "Enriching word vectors with subword information."Transactions of the association for computational linguistics 5 (2017): 135-146.
  12. Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. "Bag of tricks for efficient text classification."arXiv preprint arXiv:1607.01759 (2016).
  13. Schuster, Mike, and Kuldip K. Paliwal. "Bidirectional recurrent neural networks." IEEE transactions on Signal Processing 45, no. 11 (1997): 2673-2681.
  14. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
  15. Chy, Abu Nowshed, Md Hanif Seddiqui, and Sowmitra Das. "Bangla news classification using naive Bayes classifier." In 16th Int'l Conf. Computer and Information Technology, pp. 366-371. IEEE, 2014.
  16. Al Mostakim, Sadek, Faiza Ehsan, Syeda Mahdiea Hasan, Sadia Islam, and Swakkhar Shatabda. "Bangla content categorization using text based supervised learning methods." In 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1-6. IEEE, 2018.
  17. Kabir, Fasihul, Sabbir Siddique, Mohammed Rokibul Alam Kotwal, and Mohammad Nurul Huda. "Bangla text document categorization using stochastic gradient descent (sgd) classifier." In 2015 International Conference on Cognitive Computing and Information Processing (CCIP), pp. 1-4. IEEE, 2015.
  18. Mahmud, Quazi Ishtiaque, Noymul Islam Chowdhury, and Md Masum. "Reducing feature space and analyzing effects of using non linear kernels in svm for bangla news categorization." In 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1-6. IEEE, 2018.
  19. Islam, Md, Fazla Elahi Md Jubayer, and Syed Ikhtiar Ahmed. "A comparative study on different types of approaches to Bangla document categorization." arXiv preprint arXiv:1701.08694 (2017).
  20. Islam, Md Saiful, Fazla Elahi Md Jubayer, and Syed Ikhtiar Ahmed. "A support vector machine mixed with TF-IDF algorithm to categorize Bangla document." In 2017 international conference on electrical, computer and communication engineering (ECCE), pp. 191-196. IEEE, 2017.
  21. Dhar, Ankita, Niladri Sekhar Dash, and Kaushik Roy. "Application of tf-idf feature for categorizing documents of online bangla web text corpus." In Intelligent Engineering Informatics, pp. 51-59. Springer, Singapore, 2018.
  22. Dhar, Ankita, Niladri Sekhar Dash, and Kaushik Roy. "Categorization of Bangla web text documents based on TF-IDF-ICF text analysis scheme." In Annual Convention of the Computer Society of India, pp. 477-484. Springer, Singapore, 2018.
  23. Mahmud, Quazi Ishtiaque, Noymul Islam Chowdhury and Mohammad Masum. “A Multi Layer Perceptron Along with Memory Efficient Feature Extraction Approach for Bangla Document Categorization.” Journal of Computer Science 16 (2020): 378-390.
  24. Hossain, Md, and Mohammed Moshiul Hoque. "Automatic Bangla document categorization based on deep convolution nets." In Emerging Research in Computing, Information, Communication and Applications, pp. 513-525. Springer, Singapore, 2019.
  25. Mojumder, Pritom, Mahmudul Hasan, Md Hossain, and K. M. Hasan. "A study of fasttext word embedding effects in document classification in Bangla language." In International Conference on Cyber Security and Computer Science, pp. 441-453. Springer, Cham, 2020.
  26. Rahman, Md Mahbubur, Rifat Sadik, and Al Amin Biswas. "Bangla document classification using character level deep learning." In 2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), pp. 1-6. IEEE, 2020.
  27. Ahmad, Adnan, and Mohammad Ruhul Amin. "Bangla word embeddings and it's application in solving document classification problem." In 2016 19th international conference on computer and information technology (ICCIT), pp. 425-430. IEEE, 2016.
  28. Alam, Md Tanvir, and Md Mofijul Islam. "Bard: Bangla article classification using a new comprehensive dataset." In 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1-5. IEEE, 2018.
  29. Khushbu, Sharun Akter, Abu Kaisar Mohammad Masum, Sheikh Abujar, and Syed Akhter Hossain. "Neural network based Bangla news headline multi classification system: Selection of features describes comparative performance." In 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1-6. IEEE, 2020.
  30. Tudu, Ronald, Shaibal Saha, Prasun Nandy Pritam, and Rajesh Palit. "Performance analysis of supervised machine learning approaches for Bangla text categorization." In 2018 5th Asia-Pacific World Congress on Computer Science and Engineering (APWC on CSE), pp. 221-226. IEEE, 2018.
  31. Alam, Tanvirul, Akib Khan, and Firoj Alam. "Bangla text classification using transformers." arXiv preprint arXiv:2011.04446 (2020).
  32. Rahman, Md Mahbubur, Md Aktaruzzaman Pramanik, Rifat Sadik, Monikrishna Roy, and Partha Chakraborty. "Bangla documents classification using transformer based deep learning models." In 2020 2nd International Conference on Sustainable Technologies for Industry 4.0 (STI), pp. 1-5. IEEE, 2020.
  33. Ghosh, Koyel, and Apurbalal Senapati. "Technical domain classification of bangla text using BERT." Biochemistry (bioche) 2 (2021): 741.
  34. Open Source Bangla Dataset Corpus. https://scdnlab.com/corpus/
  35. “Neuralspace-Reverie/Indic-Transformers-BN-Bert • Hugging Face.” neuralspace-reverie/indic-transformers-bn-bert • Hugging Face. Accessed July 23, 2022. https://huggingface.co/neuralspace-reverie/indic-transformers-bn-bert.
Index Terms

Computer Science
Information Sciences

Keywords

Bangla News Document Categorization; Text Classification; CBiLSTM; BiLSTM; Word embedding; fastText; Fine Tuned BERT