Capstone Projects

The Development of an Augmented Data Management Package to Aid in the Development of Extract, Transform, and Load Processes

Program: Data Science Master's Degree
Location: Not Specified (onsite)
Student: David Kendall

In many typical extract, transform, and load (ETL) implementations, certain aspects of the development effort are candidates for automation. When storing data in a typical relational database, the primary and foreign keys must be discovered. When storing data in a graph database, candidates for node and edge relationships must be discovered. Once these steps are complete, the quality of data being processed is taken into consideration. Data must be analyzed for inconsistencies and cleaned prior to ultimately being loaded into any destination system. This project attempts to create a package that utilizes machine learning and various computation methods to optimize and improve upon what was just recently described. In general, this practice is beginning to become known as Augmented Data Management. Various software tools exist today that perform the same functions as what this project attempts to do. The goal of this project is to show how a package to perform these functions can be built from scratch and eventually implemented into regular ETL processes. Various methods were used to identify key relationships depending on the storage implementation which you are deciding to implement. These relationships are primary and foreign keys for relational databases, and nodes and edges for graph databases. A measure called the Wharf Coefficient is used to find potential graph relationships while ratios of unique values to dataset rows is used to determine the primary and foreign keys. Finally, machine learning methods are used to detect anomalies in the datasets. These methods include Density Based Clustering and Application with Noise (DBSCAN), Isolation Forest, and Local Outlier Factor (LOF). The results of the methods are discussed in detail and provide a glimpse into the capabilities of python programming.