Natural Language Processing Applied to Clinical Notes for Detection of High Mortality Conditions (A Case Study)

Program: Data Science Master's Degree
Location: Not Specified (remote)
Student: Korey Bernhardt

This project was an exploratory case study that took an in-depth look at using natural language processing (NLP) approaches to categorize unstructured data in the medical field, notably clinical notes. A look at the history of using NLP for clinical notes was completed, in addition to creating models to apply NLP to a dataset for further analysis. The primary question to be answered was whether high mortality conditions be extrapolated from clinical notes. Additional research questions included whether unsupervised learning models could be used and whether the approach could be scaled to other diseases. Detecting high-mortality diseases from clinical notes can significantly benefit primary care providers, medical specialists, and ultimately patients as people rely more and more on multiple care providers to support their medical needs. Four models were used to compare results, including three supervised models and one unsupervised model. Multiclass and binary classification approaches were analyzed. While a binary logistic regression performed the best, with 91% balanced accuracy and 92% weighted recall, an unsupervised neural networks model also achieved good results, with 89% balanced accuracy and 90% weighted recall. These results indicate that high-mortality conditions can be extrapolated from clinical notes with a high degree of accuracy using unsupervised learning. The results can be scaled to additional diseases with further research.