CFP last date
20 July 2026
Reseach Article

Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations

by Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 121
Year of Publication: 2026
Authors: Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush
10.5120/ijca2a3bedf57f08

Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush . Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations. International Journal of Computer Applications. 187, 121 ( Jun 2026), 1-7. DOI=10.5120/ijca2a3bedf57f08

@article{ 10.5120/ijca2a3bedf57f08,
author = { Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush },
title = { Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations },
journal = { International Journal of Computer Applications },
issue_date = { Jun 2026 },
volume = { 187 },
number = { 121 },
month = { Jun },
year = { 2026 },
issn = { 0975-8887 },
pages = { 1-7 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number121/quantifying-label-induced-bias-in-large-language-model-self-and-cross-evaluations/ },
doi = { 10.5120/ijca2a3bedf57f08 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2026-07-01T03:10:16+05:30
%A Muskan Saraf
%A Sajjad Rezvani Boroujeni
%A Justin Beaudry
%A Hossein Abedi
%A Tom Bush
%T Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 121
%P 1-7
%D 2026
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Large language models (LLMs) are increasingly relied upon to evaluate text quality in research, industry, and automated content workflows. However, their judgments may not be as objective as assumed. This study systematically examined whether LLMs exhibit bias when assessing text attributed to different model “authors.” Blog posts were generated by three leading LLMs, Chat- GPT, Gemini, and Claude, and each model evaluated every post under three conditions: with no author label, with a correct author label, and with deliberately incorrect author labels. The results reveal substantial bias driven by perceived authorship rather than actual content quality. Posts labeled as “Claude,” regardless of who produced them, consistently received elevated scores, while posts labeled as “Gemini” were systematically downgraded. In many cases, false author labels not only shifted absolute scores but reversed preference rankings entirely, with swings as large as 50 percentage points. Additional behavioral patterns emerged: Gemini tended to be unusually harsh when evaluating its own work, whereas Claude tended to rate its own writing more favorably. These effects appeared not only in overall preferences but also across detailed quality dimensions such as coherence, informativeness, and conciseness. Taken together, the findings indicate that LLM evaluation is highly sensitive to author attribution cues and may be influenced by implicit reputational priors associated with model identities. The results suggest that evaluators do not consistently separate content quality from perceived authorship, leading to systematic score inflation for some labels and penalties for others. These observations call into question the reliability of LLM-based assessment methods commonly used for benchmarking, content moderation, and automated review pipelines. To mitigate these risks, future evaluation frameworks should incorporate blind assessment protocols, multi-model consensus scoring, and statistical safeguards designed to detect label-induced bias.

References
  1. C. van der Lee, A. Gatt, E. van Miltenburg, S. Wubben, and E. Krahmer, “Best practices for the human evaluation of automatically generated text,” in Proc. 12th Int. Conf. Natural Language Generation, Tokyo, Japan, Oct.–Nov. 2019, pp. 355–368. doi: 10.18653/v1/W19-8643.
  2. S. Gehrmann, H. Strobelt, and A. M. Rush, “GLTR: Statistical detection and visualization of generated text,” arXiv preprint arXiv:1906.04043, Jun. 2019. doi: 10.48550/arXiv.1906.04043.
  3. D. Wilson and D. Sperber, “Truthfulness and relevance,” Mind, vol. 111, no. 443, pp. 583–632, Jul. 2002. doi: 10.1093/mind/111.443.583.
  4. E. Perez et al., “Discovering language model behaviors with model-written evaluations,” arXiv preprint arXiv:2212.09251, Dec. 2022. doi: 10.48550/arXiv.2212.09251.
  5. A. Panickssery, S. R. Bowman, and S. Feng, “LLM evaluators recognize and favor their own generations,” arXiv preprint arXiv:2404.13076, Apr. 2024. doi: 10.48550/arXiv.2404.13076.
  6. K. Wataoka, T. Takahashi, and R. Ri, “Self-preference bias in LLM-as-a-judge,” arXiv preprint arXiv:2410.21819, Oct. 2024. doi: 10.48550/arXiv.2410.21819.
  7. P. Wang et al., “Large language models are not fair evaluators,” arXiv preprint arXiv:2305.17926, Aug. 2023. doi: 10.48550/arXiv.2305.17926.
  8. W.-L. Chen, Z. Wei, X. Zhu, S. Feng, and Y. Meng, “Do LLM evaluators prefer themselves for a reason?,” arXiv preprint arXiv:2504.03846, Apr. 2025. doi: 10.48550/arXiv.2504.03846.
  9. Y. Zhao, B. Wang, Y. Wang, D. Zhao, X. Jin, J. Zhang, R. He, and Y. Hou, “A comparative study of explicit and implicit gender biases in large language models via self-evaluation,” in Proc. 2024 Joint Int. Conf. Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, May 2024, pp. 186–198.
  10. M. Saraf, A. R. Kulkarni, and M. Niamat, “Detecting Hardware Trojans: Deep Learning Solutions Combining PUF Metrics and Side-Channel Observations,” in Proc. 2025 1st Int. Conf. Secure IoT, Assured Trusted Comput. (SATC), Dayton, OH, USA, 2025, pp. 1–5. doi: 10.1109/SATC65530.2025.11137155.
  11. Y. Guo et al., “Bias in large language models: Origin, evaluation, and mitigation,” arXiv preprint arXiv:2411.10915, Nov. 2024. doi: 10.48550/arXiv.2411.10915.
  12. S. Rezvani Boroujeni, H. Abedi, and T. Bush, “Enhancing Glass Defect Detection with Diffusion Models: Addressing Imbalanced Datasets in Manufacturing Quality Control,” Computer and Decision Making (COMDEM), vol. 2, no. 1, pp. xx–xx, 2025. doi: 10.59543/comdem.v2i.14391.
  13. J. Yang, W. Cui, Y. Tao, and T. Shi, “CLNSO: A Knowledge- Aware Recommendation Algorithm Based on Comparative Learning and Negative Sample Optimization,” Engineering Letters, vol. 33, no. 10, pp. 4108–4118, 2025.
  14. A. Golkarieh et al., “Breakthroughs in Brain Tumor Detection: Leveraging Deep Learning and Transfer Learning for MRI-Based Classification,” Computational Demography, vol. 2, no. 1, pp. xx–xx,2024. doi: 10.59543/comdem.v2i.14243.
  15. A. Golkarieh, K. Kiashemshaki, S. R. Boroujeni, and N. A. Isakan, “Advanced U-Net Architectures with CNN Backbones for Automated Lung Cancer Detection and Segmentation in Chest CT Images,” arXiv preprint arXiv:2507.09898, Jul. 2025. doi: 10.48550/arXiv.2507.09898.
Index Terms

Computer Science
Information Sciences

Keywords

Large Language Models AI Evaluation Bias Label Effects Cross- Model Evaluation Benchmarking Fairness