Exploring Natural Language Processing Techniques for Information Retrieval from Clinical Notes
Program: Data Science Master's Degree
Location: Fort Collins, Colorado (remote)
Student: Bradley Johnson
With the implementation of Electronic Health Records (EHR) systems, the amount of data able to be stored and accessed in real-time has increased immensely. Most of the data currently used in medical research is structured data, but new studies are looking to utilize the unstructured data stored within the EHR systems. The most common unstructured data in EHR systems are clinical notes. The current process of extracting information from clinical notes is through manual chart review, a time-consuming and costly process. This paper presents three main methods for performing information retrieval from clinical notes using Natural Language Processing (NLP) techniques. Exploratory data analysis methods are created using two frameworks, term frequency document and term-document matrix. Text preprocessing methods are created using the removal of stopwords and other words, lemmatization, and tokenization. Information retrieval methods are created using regular expressions, named entity recognition, and keyword extraction. The use of the methods created are explored through a case study and found to be more cost-effective and time-efficient than manual chart review while maintaining a high level of accuracy.