Validatable Data Pipeline and Reporting for Regulated Industry Custom Development Projects

Program: Data Science Master's Degree
Location: Not Specified (remote)
Student: Patrick Cassidy

This project created a controlled software environment for potential use in FDA-regulated research, development, and manufacturing. Biopython was used within the environment to analyze protein sequence data from NCBI to identify potential target sequences for recombinant manufacturing in E. Coli.

The environment was built using Docker, which included Git, AWSCLI, and Micromamba to manage Python dependencies for Biopython, boto3, Jupyter, and pyMSAviz. Compliance features were included with Git/GitHub and AWS. Namely, user access control, version history, and controlled storage.

The container was deployed on a local machine for protein sequence analysis of insulin. Data was sourced from NCBI using their BLAST tool to extract FASTA files, then cleaned and filtered for alignment in the software MEGA. The aligned sequences were annotated, analyzed, and compared for desired characteristics to find the 10 most promising targets for future development. To verify reproducibility of results, the analysis was replicated on a second machine.

The demonstration of compliance features and reproducibility of results shows this project could be a foundation for a validatable data pipeline in FDA-regulated biopharmaceutical production.