Validation Study of Published Machine Learning Methods for the Prediction of Breast Cancer in Women

Program: Data Science Master's Degree
Location: Not Specified (remote)
Student: Nalini Uppu

Breast Cancer is one of the serious health concerns for women worldwide. Early-stage detection has shown better patient outcomes by providing appropriate treatment options. This project is a validation study of published machine learning (ML) methods for breast cancer prediction in women using age, BMI, and blood composition data (glucose, insulin, homeostasis model assessment (HOMA), leptin, adiponectin, resistin, and monocyte chemotactic protein-1 (MCP–1)). In this study, XGBoost and Logistic Regression (LR) ML methods were evaluated, and the XGBoost model outperformed the LR model with high accuracy in classifying healthy and cancer patients. The findings of the validation study indicated that the XGBoost model’s performance on full data was better than that of the Support Vector Machine (SVM), and the important variables, including glucose, BMI, resistin, and age, selected by the XGBoost model, were similar to cancer biomarkers in Patrício et al.’s (2018) work. The classification error rate of the XGBoost model for full (0.052) and test (0.207) data were comparable to the discriminator analysis error rate (0.164). From the data analysis of the important predictors, glucose was identified as a key predictor in cancer prediction. It was observed that higher glucose levels were associated with cancer patients. XGBoost model performed better than SVM and discriminant analysis in predicting cancer patients. As accurate diagnosis is a high priority in medical practice, ML methods with low error rate can support the decision-making in cancer screening and increase patient life expectancy.