Optimizing Stroke Prediction Using Backward Elimination and SMOTE with C4.5 and K-Nearest Neighbors

Imam Bagus Pratama; Ahmad Zainul Fanani; M. Arief Soeleman; Via Indriani Kumalasari

doi:10.63158/journalisi.v8i2.1521

Authors

Imam Bagus Pratama Universitas Dian Nuswantoro, Indonesia
Ahmad Zainul Fanani Universitas Dian Nuswantoro, Indonesia
M. Arief Soeleman Universitas Dian Nuswantoro, Indonesia
Via Indriani Kumalasari State University of Semarang, Indonesia

DOI:

https://doi.org/10.63158/journalisi.v8i2.1521

Keywords:

stroke prediction, SMOTE, class imbalance, C4.5, K-nearest neighbors, backward elimination, feature selection

Abstract

Early prediction of stroke risk is crucial for reducing mortality and the burden on the healthcare system, but class imbalance and irrelevant features often compromise model reliability. This study analyzes the impact of Backward Elimination and SMOTE on the performance of the C4.5 and K-NN algorithms in stroke prediction. The study used a fixed working subset of 1,239 data points and evaluated four modeling scenarios using Stratified 10-Fold Cross Validation. Model performance was measured using accuracy, precision, recall, F1-score, and AUC. The results showed that Backward Elimination improved model performance on the analyzed subsets. For C4.5, accuracy increased from 70.94% to 73.05%, stroke recall from 83.94% to 85.14%, and AUC from 0.776 to 0.806. For K-NN, accuracy increased from 72.31% to 74.82% and precision from 39.91% to 42.73%, while stroke recall remained relatively stable at 74.30%. These findings indicate that although the improvements are small numerically, the results remain practically relevant as they enhance the balance between sensitivity and class discrimination capability. In the context of stroke screening, reducing false negatives is more important because it helps minimize undetected high-risk cases, although false positives still need to be considered as a consequence of further testing. Overall, C4.5 with Backward Elimination demonstrates more balanced performance, although the results are still limited to the analyzed subset.

Downloads

Download data is not yet available.

References

[1] V. L. Feigin et al., “World Stroke Organization (WSO): Global Stroke Fact Sheet 2022,” International Journal of Stroke, vol. 17, no. 1, pp. 18–29, Jan. 2022, doi: 10.1177/17474930211065917.

[2] W. Heseltine-Carp et al., “Machine learning to predict stroke risk from routine hospital data: A systematic review,” Int. J. Med. Inform., vol. 196, no. January, p. 105811, 2025, doi: 10.1016/j.ijmedinf.2025.105811.

[3] F. Asadi, M. Rahimi, A. H. Daeechini, and A. Paghe, “The most efficient machine learning algorithms in stroke prediction: A systematic review,” Health Sci. Rep., vol. 7, no. 10, 2024, doi: 10.1002/hsr2.70062.

[4] T. Vu et al., “Machine Learning Approaches for Stroke Risk Prediction: Findings from the Suita Study,” J. Cardiovasc. Dev. Dis., vol. 11, no. 7, 2024, doi: 10.3390/jcdd11070207.

[5] P. Chakraborty et al., “Predicting stroke occurrences: a stacked machine learning approach with feature selection and data preprocessing,” BMC Bioinformatics, vol. 25, no. 1, pp. 1–23, 2024, doi: 10.1186/s12859-024-05866-8.

[6] J. Zhu et al., “Processing imbalanced medical data at the data level with assisted-reproduction data as an example,” BioData Min., vol. 17, no. 1, 2024, doi: 10.1186/s13040-024-00384-y.

[7] F. Fadmadika, H. H. Handayani, T. Al Mudzakir, and J. Indra, “Pengaruh Smote Terhadap Performa Algoritma Random Forest Dan Algoritma Gradient Boosting Dalam Memprediksi Penyakit Stroke,” Jurnal Teknik Informasi dan Komputer (Tekinkom), vol. 7, no. 2, p. 837, Dec. 2024, doi: 10.37600/tekinkom.v7i2.1575.

[8] A. Fernández, S. García, F. Herrera, and N. V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” Journal of Artificial Intelligence Research, vol. 61, pp. 863–905, 2018, doi: 10.1613/jair.1.11192.

[9] Z. Khairi, R. Yanti, T. A. Fitri, and E. Fatdha, “Optimasi Algoritma Knn Menggunakan Smote Untuk Prediksi Stroke,” Jurnal Algoritma, vol. 22, no. 2, pp. 164–175, Nov. 2025, doi: 10.33364/algoritma/v.22-2.2474.

[10] F. Nabila, I. Afrianty, S. Sanjaya, and F. Syafria, “Implementasi Algoritma C4.5 dalam Melakukan Klasifikasi Penyakit Stroke Otak,” Jurnal Informatika Universitas Pamulang, vol. 8, no. 2, pp. 229–235, 2023, doi: 10.32493/informatika.v8i2.31361.

[11] A. Gupta et al., “Predicting stroke risk: An effective stroke prediction model based on neural networks,” Journal of Neurorestoratology, vol. 13, no. 1, p. 100156, 2025, doi: 10.1016/j.jnrt.2024.100156.

[12] Indah Werdiningsih et al., “Analisis Prediksi Stroke Menggunakan Pendekatan Decision Tree dengan Seleksi Fitur dan Neural Network,” Jurnal Sistem Cerdas, vol. 6, no. 3, pp. 213–221, Dec. 2023, doi: 10.37396/jsc.v6i3.310.

[13] K. Moulaei, L. Afshari, R. Moulaei, B. Sabet, S. M. Mousavi, and M. R. Afrash, “Explainable artificial intelligence for stroke prediction through comparison of deep learning and machine learning models.,” Sci. Rep., vol. 14, no. 1, p. 31392, Dec. 2024, doi: 10.1038/s41598-024-82931-5.

[14] P. Eini, M. Rezayee, M. Kassulke, and J. Tremblay, “Efficacy and comparative performance of machine learning models for stroke risk prediction in hypertensive patients: A systematic review and meta-analysis,” International Journal of Cardiology Cardiovascular Risk and Prevention, vol. 28, no. October 2025, p. 200564, 2026, doi: 10.1016/j.ijcrp.2025.200564.

[15] B. Van Calster et al., “Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance,” Lancet Digit. Health, vol. 7, no. 12, p. 100916, 2025, doi: 10.1016/j.landig.2025.100916.

[16] K. M. Sujon, R. Hassan, K. Choi, and M. A. Samad, “Accuracy, precision, recall, f1-score, or MCC? empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models,” J. Big Data, vol. 12, no. 1, 2025, doi: 10.1186/s40537-025-01313-4.

[17] M. Liu et al., “Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques,” Artif. Intell. Med., vol. 142, Aug. 2023, doi: 10.1016/j.artmed.2023.102587.

[18] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. S. Philip Kegelmeyer, “synthetic minority over-sampling Technique,” J Artif Intell Res, vol. 16, p. 16, 2018.

[19] D. Patel, A. Saxena, and J. Wang, “A Machine Learning-Based Wrapper Method for Feature Selection,” International Journal of Data Warehousing and Mining, vol. 20, no. 1, pp. 1–33, 2024, doi: 10.4018/IJDWM.352041.

[20] D. Zhang, N. Yu, X. Yang, Y. De Marinis, Z. P. Liu, and R. Gao, “SRPNet: stroke risk prediction based on two-level feature selection and deep fusion network,” Front. Physiol., vol. 15, no. November, pp. 1–13, 2024, doi: 10.3389/fphys.2024.1357123.

[21] M. E. Klontzas et al., “ESR Essentials: common performance metrics in AI—practice recommendations by the European Society of Medical Imaging Informatics,” Eur. Radiol., pp. 1528–1540, 2025, doi: 10.1007/s00330-025-11890-w.

[22] I. Aiyer, L. Shaik, A. Sheta, and S. Surani, “Review of Application of Machine Learning as a Screening Tool for Diagnosis of Obstructive Sleep Apnea,” Medicina (Lithuania), vol. 58, no. 11, 2022, doi: 10.3390/medicina58111574.

[23] M. Goyal et al., “A bayesian framework to optimize performance of pre-hospital stroke triage scales,” J. Stroke, vol. 23, no. 3, pp. 443–448, 2021, doi: 10.5853/jos.2021.01312.

[24] S. Patil, R. Rossi, D. Jabrah, and K. Doyle, “Detection, Diagnosis and Treatment of Acute Ischemic Stroke: Current and Future Perspectives,” Front. Med. Technol., vol. 4, no. June, 2022, doi: 10.3389/fmedt.2022.748949.

[25] M. Jacobs, N. Hammarlund, E. Evans, and C. Ellis, “Identifying predictors of stroke in young adults: a machine learning analysis of sex-specific risk factors,” Frontiers in Stroke, vol. 3, no. Ml, 2024, doi: 10.3389/fstro.2024.1488313.

[26] A. A. Soladoye, N. Aderinto, M. R. Popoola, I. A. Adeyanju, A. Osonuga, and D. B. Olawade, “Machine learning techniques for stroke prediction: A systematic review of algorithms, datasets, and regional gaps,” Int. J. Med. Inform., vol. 203, no. June, p. 106041, 2025, doi: 10.1016/j.ijmedinf.2025.106041.

Optimizing Stroke Prediction Using Backward Elimination and SMOTE with C4.5 and K-Nearest Neighbors

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

publisher

sidebar

certificate

template

gs-citation

index

stat