CFP last date
20 August 2025
Call for Paper
September Edition
IJCA solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 20 August 2025

Submit your paper
Know more
Random Articles
Reseach Article

Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments

by Amel Muminovic
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 25
Year of Publication: 2025
Authors: Amel Muminovic
10.5120/ijca2025925403

Amel Muminovic . Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments. International Journal of Computer Applications. 187, 25 ( Jul 2025), 1-9. DOI=10.5120/ijca2025925403

@article{ 10.5120/ijca2025925403,
author = { Amel Muminovic },
title = { Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments },
journal = { International Journal of Computer Applications },
issue_date = { Jul 2025 },
volume = { 187 },
number = { 25 },
month = { Jul },
year = { 2025 },
issn = { 0975-8887 },
pages = { 1-9 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number25/moderating-harm-benchmarking-large-language-models-for-cyberbullying-detection-in-youtube-comments/ },
doi = { 10.5120/ijca2025925403 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-07-31T02:40:02.748355+05:30
%A Amel Muminovic
%T Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 25
%P 1-9
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three state-of-the-art large language models: OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5 080 YouTube comments drawn from high-abuse videos in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1 334 harmful and 3 746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with almost perfect agreement (Cohen’s κ = 0.83). Each model is evaluated in a strict zero-shot setting with an identical minimal prompt and deterministic decoding, giving a fair multi-language comparison without task-specific tuning. GPT-4.1 achieves the best balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flags the most harmful posts (recall = 0.875) but its precision falls to 0.767 because of frequent false positives. Claude attains the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall drops to 0.720. Qualitative analysis shows that all three models struggle with sarcasm, coded insults, and mixed-language slang. The findings highlight the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset, along with the prompts and model outputs, has been made available to support reproducibility and further progress in automated content moderation.

References
  1. B. Dean, “Social media usage & growth statistics,” Backlinko, Feb. 21, 2024. [Online]. Available: https://backlinko.com/social-media-users
  2. A. B. Barrag´an Mart´ın et al., “Study of cyberbullying among adolescents in recent years: A bibliometric analysis,” Int. J. Environ. Res. Public Health, vol. 18, no. 6, p. 3016, Mar. 2021. doi:10.3390/ijerph18063016
  3. S. Hinduja and J. W. Patchin, “Bullying, cyberbullying, and suicide,” Arch. Suicide Res., vol. 14, no. 3, pp. 206–221, 2010. doi:10.1080/13811118.2010.494133
  4. C. P. Barlett, “Anonymously hurting others online: The effect of anonymity on cyberbullying frequency,” Psychol. Pop. Media Cult., vol. 4, no. 2, pp. 70–79, 2015. doi:10.1037/a0034335
  5. L. Huang et al., “The severity of cyberbullying affects bystander intervention among college students: The roles of feelings of responsibility and empathy,” Psychol. Res. Behav. Manag., vol. 16, pp. 893–903, Mar. 2023. doi:10.2147/PRBM.S397770
  6. A. Vigderman, “Cyberbullying: Twenty crucial statistics for 2024,” Security.org, Oct. 9, 2024. [Online]. Available: https://www.security.org/resources/cyberbullying-factsstatistics
  7. W. Craig et al., “Social media use and cyber-bullying: A cross-national analysis of young people in 42 countries,” J. Adolesc. Health, vol. 66, no. 6, pp. S100–S108, Jun. 2020. doi:10.1016/j.jadohealth.2020.03.006
  8. M. H. Ribeiro, J. Cheng, and R. West, “Automated content moderation increases adherence to community guidelines,” in Proc. ACM Web Conf. (WWW), 2023, pp. 2666–2676. doi:10.1145/3543507.3583275
  9. S. Wang and K. J. Kim, “Content moderation on social media: Does it matter who and why moderates hate speech?” Cyberpsychol. Behav. Soc. Netw., vol. 26, no. 7, pp. 527–534, Jul. 2023. doi:10.1089/cyber.2022.0158
  10. T. Gillespie, “Content moderation, AI, and the question of scale,” Big Data Soc., vol. 7, no. 2, pp. 1–5, Jul. 2020. doi:10.1177/2053951720943234
  11. H. Lopez and S. K¨ubler, “Context in abusive language detection: On the interdependence of context and annotation of user comments,” Discourse, Context Media, vol. 63, Art. no. 100848, Feb. 2025. doi:10.1016/j.dcm.2024.100848
  12. M. van Geel, P. Vedder, and J. Tanilon, “Relationship between peer victimization, cyberbullying, and suicide in children and adolescents: A meta-analysis,” JAMA Pediatr., vol. 168, no. 5, pp. 435–442, May 2014. doi:10.1001/jamapediatrics.2013.4143
  13. Z. Waseem and D. Hovy, “Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter,” in Proc. NAACL Student Res. Workshop, San Diego, CA, USA, Jun. 2016, pp. 88–93. doi:10.18653/v1/N16-2013
  14. A.-M. Founta et al., “Large scale crowdsourcing and characterization of Twitter abusive behavior,” in Proc. Int. Conf. Web Social Media, Atlanta, GA, USA, Mar. 2018, pp. 491–500. doi:10.1609/icwsm.v12i1.14991
  15. M. Zampieri et al., “Predicting the type and target of offensive posts in social media,” in Proc. NAACL, Minneapolis, MN, USA, Jun. 2019, pp. 1415–1420. doi:10.18653/v1/N19-1144
  16. P. R¨ottger, B. Vidgen, D. Nguyen, Z. Waseem, H. Margetts, and J. Pierrehumbert, “HateCheck: Functional tests for hate speech detection models,” in Proc. 59th Annu. Meet. Assoc. Comput. Linguistics & 11th Int. Joint Conf. NLP (Long Papers), Online, Aug. 2021, pp. 41–58. doi:10.18653/v1/2021.acl-long.4
  17. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, Minneapolis, MN, USA, Jun. 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423
  18. Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, Jul. 2019. [Online]. Available: https://arxiv.org/abs/1907.11692
  19. B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. Goyal, and A. Mukherjee, “HateXplain: A benchmark dataset for explainable hate speech detection,” in Proc. AAAI Conf. Artif. Intell., vol. 35, no. 17, May 2021, pp. 14867–14875. doi:10.1609/aaai.v35i17.17745
  20. B. Vidgen, T. Thrush, Z. Waseem, and D. Kiela, “Learning from the worst: Dynamically generated datasets to improve online hate detection,” in Proc. 59th Annu. Meet. Assoc. Comput. Linguistics & 11th Int. Joint Conf. NLP (Long Papers), Aug. 2021, pp. 1667–1682. doi:10.18653/v1/2021.acl-long.132
  21. S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “RealToxicityPrompts: Evaluating neural toxic degeneration in language models,” in Findings Assoc. Comput. Linguistics: EMNLP 2020, Nov. 2020, pp. 3356–3369. doi:10.18653/v1/2020.findings-emnlp.301
  22. A. Arora, “Sarcasm detection in social media: A review,” in Proc. Int. Conf. Innov. Comput. Commun. (ICICC), Dec. 2020, pp. 1–4. doi:10.2139/ssrn.3749018
  23. M. S. Jahan and M. Oussalah, “A systematic review of hate speech automatic detection using natural language processing,” Neurocomputing, vol. 546, Art. no. 126232, Aug. 2023. doi:10.1016/j.neucom.2023.126232
  24. J. M. P´erez et al., “Assessing the impact of contextual information in hate speech detection,” IEEE Access, vol. 11, pp. 30575–30590, 2023. doi:10.1109/ACCESS.2023.3258973
  25. A. Muminovic and A. K. Muminovic, “Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages,” arXiv preprint arXiv:2506.09992, Jun. 2025. [Online]. Available: https://arxiv.org/abs/2506.09992
  26. H. Mubarak, K. Darwish, and W. Magdy, “Abusive language detection on Arabic social media,” in Proc. 1st Workshop on Abusive Language Online, Vancouver, Canada, 2017, pp. 52–56. doi:10.18653/v1/W17-3008
  27. T. Mandl et al., “Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in Indo- European languages,” in Proc. FIRE, Kolkata, India, 2019, pp. 14–17. doi:10.1145/3368567.3368584
  28. T. Gr¨ondahl, L. Pajola, M. Juuti, M. Conti, and N. Asokan, “‘All you need is Love’: Evading hate speech detection,” in Proc. 11th ACM Workshop Artif. Intell. Security, Toronto, Canada, 2018, pp. 2–12. doi:10.1145/3270101.3270103
  29. N. Murikinati, A. Anastasopoulos, and G. Neubig, “Transliteration for cross-lingual morphological inflection,” in Proc. 17th SIGMORPHON Workshop Computational Research Phonetics, Phonology, and Morphology, Online, Jul. 2020, pp. 189–197. doi:10.18653/v1/2020.sigmorphon-1.22
  30. J. Khanuja, A. Dandapat, A. Srinivasan, S. Sitaram, and M. Choudhury, “GLUECoS: An evaluation benchmark for codeswitched NLP,” in Proc. ACL-IJCNLP, Bangkok, Thailand, 2021, pp. 3575–3585. doi:10.18653/v1/2020.acl-main.329
  31. J. Ranasinghe and M. Zampieri, “Multilingual offensive language identification with cross-lingual embeddings,” in Proc. EMNLP, Online, 2020, pp. 5838–5844. doi:10.18653/v1/2020.emnlp-main.470
  32. C¸ . C¸ ¨oltekin, “A corpus of Turkish offensive language on social media,” in Proc. LREC, Marseille, France, 2022, pp. 4878–4885. [Online]. Available: https://aclanthology.org/2020.lrec-1.758
  33. E. Pamungkas and V. Patti, “Cross-domain and cross-lingual abusive language detection: A hybrid approach with deep learning and a multilingual lexicon,” in Proc. ACL, Florence, Italy, 2019, pp. 363–370. doi:10.18653/v1/P19-1051
  34. Y. Liu and M. Zhang, “LLM-Mod: Can Large Language Models Assist Content Moderation?” in Proc. ACM Conf. Fairness, Accountability, and Transparency (FAccT), Rio de Janeiro, Brazil, 2024, pp. 1–12. doi:10.1145/3613905.3650828
  35. F. M. Plaza-Del-Arco, D. Nozza, and D. Hovy, “Respectful or toxic? Using zero-shot learning with language models to detect hate speech,” in Proc. 7th Workshop Online Abuse and Harms (WOAH), Singapore, Jan. 2023, pp. 46–52. doi:10.18653/v1/2023.woah-1.6
  36. J. Pavlopoulos et al., “Toxicity detection: Does context really matter?” in Proc. ACL, Online, 2020, pp. 4296–4305. doi:10.18653/v1/2020.acl-main.396
  37. A. Baheti, M. Sap, and Y. Tsvetkov, “Just say no: Analyzing the stance of neural dialogue generation in offensive contexts,” in Proc. EMNLP, Online, 2021, pp. 4846–4859. doi:10.18653/v1/2021.emnlp-main.397
  38. M. Sap et al., “Social bias frames: Reasoning about social and power implications of language,” in Proc. ACL, Online, 2020, pp. 5477–5490. doi:10.18653/v1/2020.acl-main.486
  39. I. Solaiman and C. Dennison, “Process for adapting language models to society (PALMS),” Tech. Rep., OpenAI, 2021. [Online]. Available: https://arxiv.org/abs/2106.10328
  40. T. Bolukbasi, K.-W. Chang, J. Zou, V. Saligrama, and A. Kalai, “Man is to computer programmer as woman is to homemaker? Debiasing word embeddings,” in Proc. NeurIPS, Barcelona, Spain, 2016, pp. 4356–4364. doi:10.48550/arXiv.1607.06520
  41. H. Welbl, A. Stiennon, and Y. Bai, “Challenges in detoxifying language models,” Tech. Rep., DeepMind, 2021. [Online]. Available: https://arxiv.org/abs/2109.07445
  42. R. Hartvigsen, H. Palangi, and X. He, “Toxigen: Controllable generation of implicit and adversarial toxic text,” in Proc. ACL, Dublin, Ireland, 2022, pp. 524–535. doi:10.18653/v1/2022.acllong. 39
  43. E. Bender et al., “On the dangers of stochastic parrots,” in Proc. FAccT, Online, 2021, pp. 610–623. doi:10.1145/3442188.344592
Index Terms

Computer Science
Information Sciences

Keywords

Artificial intelligence Cyberbullying Hate Speech Large Language Models Natural Language Processing Social Media