International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 187 - Number 31 |
Year of Publication: 2025 |
Authors: Emmanuel O. Oshoiribhor, Adetokunbo M. John-Otumu |
![]() |
Emmanuel O. Oshoiribhor, Adetokunbo M. John-Otumu . An Explainable Random Forest Model for Early Diabetes Prediction using LIME Interpretability Technique. International Journal of Computer Applications. 187, 31 ( Aug 2025), 26-35. DOI=10.5120/ijca2025925542
This study aims to improve diabetes prediction by integrating Random Forest classifiers with Explainable AI (XAI) methods such as LIME to enhance model interpretability and clinical trust. Using the “diabetes.csv” dataset from Kaggle (768 records with nine clinical features), the research addresses challenges posed by its imbalanced distribution of 500 non-diabetic and 268 diabetic cases. Baseline evaluations showed accuracies of 70% for SVM and 72.07% for Random Forest, with similar precision, recall, F1-scores, and ROC AUC values around 0.81. Applying Random Search for hyperparameter tuning improved Random Forest performance to 75% accuracy, 64% precision, 69% recall, 67% F1-score, and 0.83 ROC AUC. To assess robustness and generalization, a Text-Guided Synthetic Dataset (synthetic_diabetes_data.csv, 35 KB) was generated using ChatGPT, containing 1000 instances (450 non-diabetes, 550 diabetes) with real, integer, and categorical features based on prompt design. Testing on this balanced, diverse dataset yielded higher performance: 93.5% accuracy, 92% precision, 94% recall, 93% F1-score, and 0.95 ROC AUC. LIME explanations provided clear, case-specific insights, aiding clinician understanding and supporting trustworthy decision-making. Human-centered evaluations rated these explanations highly for plausibility, clarity, and clinical usefulness. Despite challenges from data imbalance in real-world settings, the study demonstrates that combining machine learning with explainable AI offers an effective, transparent approach for early diabetes prediction, while highlighting the need for high-quality, diverse datasets to ensure reliable deployment in clinical practice.