The Development of Feed Type Classification Algorithms for a Commercial Testing Laboratory

Program: Data Science Master's Degree
Location: Not Specified (onsite)
Student: Kyle Taysom

Commercial feed testing laboratories receive samples from customers and classify them into feed types using a combination of information provided by the customer, visual appraisal of the sample, and knowledge of the laboratory staff. These classifications inform decisions about the sample in the laboratory including the methods of analysis use and the quality control and quality assurance procedures utilized. They also impact the interpretive information that is provided along with the nutrient analysis in reports returned to the customer. Since customers do not reliably provide accurate sample descriptions and some feed types are difficult to visually distinguish, these classifications are prone to error. Each feed type produces nutrient measurements that fall into predictable ranges and distinguish it from other feed types. Given enough diversity across feed types and enough nutrients measured on each sample, classification models could be built to predict the feed type from its nutrient contents. This project utilizes a large dataset of feed analysis from a commercial laboratory to build classification models. A series of statistical filters, including principle component and cluster analysis, were built to clean the data set from outlier measurements, detect samples with abnormal nutrient relationships for their feed type, and detect sub-populations within the pre-defined classifications from the laboratory. After cleaning the data set, random forest and support vector machine models were created for each feed type. After initially defining the feed type based on customer information and visual appraisal, the laboratory can utilize these models to confirm or deny the original feed type classification. Both random forest and support vector machine models reliably identified samples that did not belong to the target feed type, with mean specificities of 98.46% and 97.43% respectively. However, support vector machines were more reliable at identifying which samples were in fact the target feed, with mean sensitivities of 98.91% compared to 85.28% for random forests. The procedures presented in this project can be used to create reliable classification models that overcome many of the challenges present in commercial feed testing laboratory data including sparse data due to the fact that customers do not always request the same nutrient analysis on each sample and sparse data due to the fact that each feed type includes a different set of routinely measured nutrients.