| International Journal of Computer Applications |
| Foundation of Computer Science (FCS), NY, USA |
| Volume 187 - Number 90 |
| Year of Publication: 2026 |
| Authors: Handayani, Ety Sutanty, Esti Setiyaningsih |
10.5120/ijca2026926600
|
Handayani, Ety Sutanty, Esti Setiyaningsih . A Hybrid Structural and TF-IDF-based Machine Learning Framework for Large-Scale Phishing URL Detection. International Journal of Computer Applications. 187, 90 ( Mar 2026), 52-59. DOI=10.5120/ijca2026926600
Phishing attacks continue to pose significant cybersecurity risks by exploiting deceptive URLs to obtain sensitive user information, thereby necessitating accurate and scalable automated detection mechanisms. This study proposes a machine learning–based approach for phishing URL classification by integrating structural URL feature extraction with Natural Language Processing (NLP) techniques using Term Frequency–Inverse Document Frequency (TF-IDF). The dataset comprises 822,010 labeled URLs, consisting of 52% legitimate and 48% phishing instances, with prior validation to ensure the absence of missing values. Feature engineering was conducted through two complementary strategies: handcrafted structural features—including URL length, domain length, number of digits, special characters, suspicious keywords, HTTPS usage, and number of subdomains and TF-IDF based textual representation using unigram, bigram, and trigram tokenization. The combined feature set was used to train a Random Forest classifier with optimized hyperparameters, and model evaluation was performed using Stratified 5-Fold Cross Validation to preserve class distribution across training and testing subsets. Performance assessment was conducted using confusion matrix, precision, recall, and F1-score to provide a comprehensive evaluation of detection capability. The experimental findings indicate that the integration of structural and textual features significantly improves classification effectiveness, enabling robust and balanced detection of phishing and legitimate URLs, thus demonstrating the practical applicability of the proposed method for large-scale real-world deployment.