Predicting Life Expectancy in U.S. Cities: How A Machine Learning Solution May Aid in Budgetary Decision-Making
Program: Data Science Master's Degree
Location: Not Specified (remote)
Student: Nikolas Tubert
The purpose of this project was to develop an accurate life expectancy predictive model to address the limitations of traditional life expectancy calculation. The model results were then interpreted to provide recommendations for public health interventions to the city of Long Beach, CA Department of Health & Human Services, highlighting the utility of a machine learning algorithm in aiding public health budget allocation. Machine learning models were trained on data from the Big Cities Health Coalition (BCHC), which were cleaned via column/row removal, transposition, imputation, and scaling/transformation. Additionally, correlation analysis identified redundant features that were alleviated via feature reduction and feature engineering. Three machine learning algorithms (LASSO, XGBoost, and ANN) were trained via single 10-fold cross validation. Additionally, 5-fold double cross validation was performed to evaluate the entire model selection process. ANN had the lowest MAE at 0.941; however, this was only slightly lower than XGBoost at 0.985. The XGBoost model was selected for interpretability reasons. The MAE and R2 from double cross validation were 0.944 and 0.935, respectively. The key features of the model were cardiovascular disease deaths, injury deaths, cancer deaths, and motor vehicle deaths. Post-analysis utilizing SHAP values found that deaths from cardiovascular disease, motor vehicles, and diabetes had the largest negative impact on life expectancy for the most vulnerable populations of Long Beach, CA in 2023. Policy interventions for each of these areas were proposed to the city of Long Beach based on the successes of other U.S. city initiatives.