Capstone Projects

Using Machine Learning to Predict Home Runs for a National TV Audience

Program: Data Science Master's
Location: Not Specified (remote)
Student: Jack Puncochar

The introduction of the Statcast tracking system increased the prevalence of analytics in baseball. Research emerged claiming that Statcast metrics, exit velocity and launch angle, are strong predictors of home run hitting after contact. However, there is no research exploring the pre-contact probability of hitting a home run. Understanding the probability that a hitter goes yard before the pitch can provide national audiences with valuable content. A home run probability model exists at one media company, but the model suffers from overfitting. This study dealt with extreme class-imbalance (< 5% of batted balls are home runs). Both logistic regression and Naïve Bayes performed poorly using precision and recall for model assessment. The poor performance was attributed to uncertainty in data that is only known pre-pitch and it was found that home run classification may not be meaningful when the desired output is a probability. Instead, log loss was used for model selection and logistic regression was selected to estimate predicted probabilities on new data. An R Shiny application made it possible to display results of the HR probability model on live pitches. The application needed bug fixes and optimized code prior to being sent to the client. Another limitation with the R Shiny app was a lack of automation in updating with Statcast’s real-time updates. The HR probability system was not sent to the client; however, the framework to efficiently process Statcast data and deploy an accurate HR probability model will help them emerge as leaders in the industry.