Predicting Risk of Heart Disease for Early Detection: A Machine-Learning Approach

Program: Data Science Master's Degree
Location: Not Specified (remote)
Student: Larry Vue

Cardiovascular disease (heart disease) is still the leading cause of death worldwide and significantly impacts healthcare expenses. With a better understanding of data and newer developments in data science, the need for early detection can save lives and reduce healthcare costs [1][2]. This project developed and evaluated supervised learning models to predict heart disease using clinical variables. The data primarily come from the UCI Cleveland Heart Disease dataset and a larger Kaggle cardiovascular dataset. Data cleaning, exploratory analysis, and feature engineering were used to assess the predictive value of key risk factors. The data helped compare baseline and advanced machine learning classifiers, including logistic regression, decision trees, random forests, and gradient boosting (XGBoost), using stratified cross-validation. The focus, aligned with business metrics, is to prioritize recall to minimize false negatives. Class imbalance was addressed using class weights and thresholds, along with ROC/PR analysis and cost-sensitive decision-making. Results show that the interpretations of tree-based models align with clinically relevant relationships. The final model will be a calibrated logistic regression. This achieved strong ranking performance on the internal test set and an interpretable coefficient profile for clinicians. Error analysis revealed that false negatives are often in hard-to-see cases; lowering the threshold slightly reduced misses while maintaining acceptable precision. Overall, the project demonstrates a simple, interpretable model that can provide actionable risk. The model will show the importance of threshold choices for clinical workflows.