Consumer Preference Sorting

Program: Data Science Master's Degree
Location: Not Specified (onsite)
Student: Reed Erler

This paper goes through the process of making an automatic valuation model (AVM) using real estate data taken from a 2017 Zillow competition on Kaggle.com. AVMs are increasing in popularity in the age of big data to estimate the price of land and estimate the structural value on a property. This paper will explore the process fitting linear models (multilinear regression, lasso regression, and ridge regression) to produce estimates. The paper will also go into detail on the processes needed to fit a linear model and data concepts. Some of these concepts include how to construct a pipeline which can handle the preprocessing of the data set in scalable and automated fashion. The paper will also explore concepts such as model stacking and cardinality reduction methods for categorical data. The results will be then analyzed to understand the relationships among in the dataset. After which, the preprocessing activities and results of the linear models will be compared to the performance of an advanced model called XGBoost. In the XGBoost analysis, I will optimize the hyperparameters using a grid search method and discuss the benefits of this method. Lastly, I will conduct an analysis to try to understand the shortcomings of using an AVM to estimate prices.