Handling Missing Data: Effects of Different Approaches on the Performance of Predictive Models Built on Complex Big Data Datasets

Program: Data Science Master's Degree
Location: Not Specified (remote)
Student: David Vlosak

The purpose of this project was to explore the impact of different missing value imputation (MVI) approaches on the performance of predictive binary-classification models built on such imputed complex large or Big Data datasets. Based on an experimental strategy using simulation within a postpositive framework, findings indicated that the combination of MVI approach (bagging, KNN, mixed models), model (logistic regression, naïve Bayes, boosted trees), and dataset characteristics (number of cases and predictors, feature data types, missingness mechanisms, distribution of missing values among different predictor data types, and missing-data rates) impacted predictive performance for binary-classification problems. On the one hand, missing value imputation approaches consisting of a blend of different imputation methods (e.g., mixed models) resulted in the most accurate predictive performance regardless of model type or dataset characteristics (e.g., missing-data rate). Similarly, predictive performance for imputed complex datasets initially possessing a 25% missing-data rate was relatively accurate regardless of model and imputation type. On the other hand, MVI approaches using a singular imputation method (e.g., bagging and KNN) resulted in different predictive performance values depending on the model used and dataset characteristics (e.g., missing-data rate). Predictive performance was evaluated using overall classification accuracy (OCA), and the trustworthiness of OCA values were confirmed by the metric accuracy variation percentage (AVP). The project findings contributed to the existing gap in the literature by including complex datasets in studying the impact of MVI approaches and models on binary-classification predictive performance. The project findings also contributed to data-practitioner praxis by identifying some combinations of MVI approaches, models, and dataset characteristics that are likely to result in relatively accurate predictive performance and other combinations that might be best to avoid.