| International Journal of Computer Applications |
| Foundation of Computer Science (FCS), NY, USA |
| Volume 187 - Number 121 |
| Year of Publication: 2026 |
| Authors: Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush |
10.5120/ijca2a3bedf57f08
|
Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush . Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations. International Journal of Computer Applications. 187, 121 ( Jun 2026), 1-7. DOI=10.5120/ijca2a3bedf57f08
Large language models (LLMs) are increasingly relied upon to evaluate text quality in research, industry, and automated content workflows. However, their judgments may not be as objective as assumed. This study systematically examined whether LLMs exhibit bias when assessing text attributed to different model “authors.” Blog posts were generated by three leading LLMs, Chat- GPT, Gemini, and Claude, and each model evaluated every post under three conditions: with no author label, with a correct author label, and with deliberately incorrect author labels. The results reveal substantial bias driven by perceived authorship rather than actual content quality. Posts labeled as “Claude,” regardless of who produced them, consistently received elevated scores, while posts labeled as “Gemini” were systematically downgraded. In many cases, false author labels not only shifted absolute scores but reversed preference rankings entirely, with swings as large as 50 percentage points. Additional behavioral patterns emerged: Gemini tended to be unusually harsh when evaluating its own work, whereas Claude tended to rate its own writing more favorably. These effects appeared not only in overall preferences but also across detailed quality dimensions such as coherence, informativeness, and conciseness. Taken together, the findings indicate that LLM evaluation is highly sensitive to author attribution cues and may be influenced by implicit reputational priors associated with model identities. The results suggest that evaluators do not consistently separate content quality from perceived authorship, leading to systematic score inflation for some labels and penalties for others. These observations call into question the reliability of LLM-based assessment methods commonly used for benchmarking, content moderation, and automated review pipelines. To mitigate these risks, future evaluation frameworks should incorporate blind assessment protocols, multi-model consensus scoring, and statistical safeguards designed to detect label-induced bias.