Capstone Projects

Topic Modeling and Feature Extraction of Medical School Admissions Essays

Program: Data Science Master's
Location: Minnesota (onsite)
Student: Jackie Dockendorf

A lot of structured and unstructured data is created throughout the different stages of the medical education continuum. Still, much of the unstructured data does not get used as frequently as structured data in research. Unstructured text data cannot be used in its raw format in most traditional statistical and machine learning analyses. Text must be transformed before it can be used for this purpose. A review of the literature found previous works that used higher education admissions essays for thematic, text mining, and computational linguistic analyses. It also found works that used the output of similar analyses to predict different types of outcomes. This paper discusses how a set of features were extracted from 1361 medical school admissions personal statement essays. The purpose was to create something from otherwise unstructured text data that could be linked to student, clinical, and workforce outcomes. Methods used include natural language processing techniques and unsupervised machine learning based topic modeling methods, including Latent Dirichlet Allocation and Non-Negative Matrix Factorization. A topic model was created with interpretable topics, which gave insight into the contents of the personal statements. The model was applied to the dataset of essays to create a feature vector that was exported. The results of the analysis have the potential to be used as input in other studies, and the methods used could be replicated for similar unstructured text data.