Application of Imputation and Prediction Techniques on Blended Property Data

Program: Data Science Master's Degree
Location: California (onsite)
Student: Parag Ahire

Missing data is inherent to any data collection process. The project involved application of data imputation and data prediction techniques on key characteristics of a property in a blended property dataset. The objective of the project was to answer several questions listed below: 1) Which imputation techniques and which prediction techniques do not introduce noise in the data thereby maintaining the sanity of the data in the fields with missing data? 2) How do the results of statistical data imputation approaches compare to those of machine learning prediction approaches? 3) Which data imputation and prediction techniques work better in terms of their metrics while also considering the time taken to process the data? 4) Do imputation or prediction techniques applied on the entire population of properties in the county produce better or worse results in comparison to the results of techniques applied on properties in the proximity of a property with missing values? The dataset used for the project was a blend of county recorder assessment data, multiple listing service data and appraisal data. Various imputation and prediction techniques were applied at the population level and neighborhood level. A neighborhood was determined based on distance between properties as derived from their geocoordinates.