Capstone Projects

Contextual Data Anomaly Detection Using Adaptive Machine Learning

Program: Data Science Master's
Location: Wisconsin (onsite)
Student: Brian Wells

Data anomalies, also known as outliers or deviations, are relatively rare observations in a dataset that are inconsistent with established patterns in the rest of the data. Historically, anomaly detection in data has used statistical methods and, more recently, unsupervised machine learning methods such as distance or density-based clustering. The project proposes the large-scale use of a supervised machine learning algorithm to perform anomaly detection at scale for commercial data curation. Using a gradient boosted trees algorithm (GBT), I demonstrate that it is possible to use the target dataset to create an artificial representation of the target data through a process similar to encoding. Since one-off outliers are challenging to capture in such a model, random errors would not be represented well in the contrast data. When compared, this results in significant differences between actual and estimated values when the actual values are anomalous. In this paper, the technique was successfully applied to a sizable collection of engineering data used in machine performance meta-modeling to detect data errors.