Program: Data Science Master's
Host Company: JMJ Phillip Executive Search
Location: Chicago, Illinois (onsite)
Student: Emma Foulkes
This project builds out automated data cleaning processes for a recruitment company client focusing on developing an algorithm to detect duplicate candidate records and using REST API POST requests to update inaccurate candidate location data. The client uses the CRM program Bullhorn for tracking candidate data and has an API interface set up through the CRM. The API requests were run through Python and the data cleaning and duplicate algorithm were built out in R. The duplicate detection algorithm utilized the Levenstein string similarity scoring system for determining the likelihood of candidate records matching.
There were three primary objectives of this project. The first objective was to create a model to identify potential duplicate records. The second objective was to create and automate data cleaning processes for common areas of inaccuracy in candidate records. Two areas were addressed in this project, updating state abbreviations to state names, and filling in missing zip code data. The third objective was to decrease employee time spent on identifying and cleaning missing or inaccurate data manually.