| International Journal of Computer Applications |
| Foundation of Computer Science (FCS), NY, USA |
| Volume 187 - Number 90 |
| Year of Publication: 2026 |
| Authors: Harun Hadzagic, Zerina Altoka |
10.5120/ijca2026926575
|
Harun Hadzagic, Zerina Altoka . Synthetic Data Generation for Automated JavaScript Vulnerability Detection using Fine-Tuned CodeBERT. International Journal of Computer Applications. 187, 90 ( Mar 2026), 16-22. DOI=10.5120/ijca2026926575
The dynamic and flexible nature of JavaScript, the foundational language of modern web development, makes it highly susceptible to vulnerabilities such as Cross-Site Scripting (XSS), SQL Injection, and Hardcoded Secrets. Traditional security analysis tools, as well as manual code review, struggle to maintain accuracy and scalability in complex codebases, especially with the increasing use of AI in code production. To address this, this paper presents a high-performance solution utilizing a CodeBERT transformer model fine-tuned for automated binary sequence classification. A balanced dataset constructed of 71 vulnerabilities with 60 JavaScript code snippets (30 pairs of secure and insecure versions) generated through advanced LLMs. Employing a rigorous Pair-ID splitting methodology, it ensured the model was evaluated on truly unseen vulnerability patterns, preventing data leakage and overfitting. The fine-tuned CodeBERT model achieved exceptional performance on the held-out test set, culminating in an F1-Score of 0.9413. Crucially, the model attained a Recall of 0.9468 for the 'Insecure' class, confirming its ability to minimize missed vulnerabilities, the most critical error in security screening. Furthermore, a generalization check using an alternating dataset validated the model's robustness, maintaining a high F1-Score. The findings demonstrate the viability of specialized Code LLMs for reliable vulnerability detection, paving the way for low-latency integration into continuous integration pipelines to enforce secure coding practices in real time.