Data mining PII via optical character recognition on public image hosting sites
Program: Data Science Master's Degree
Location: Not Specified (onsite)
Student: Shawn Peters
Previous research has revealed that credentials compromise is the catalyst for many data breaches and cyberattacks. Most web applications offer safeguards that give users control over their security and privacy, but cybercriminals’ tactics continue to evolve. This study aims to demonstrate that username and password data can be mined from nontraditional sources, specifically text in images. An optical character recognizer will be used to scrape text from images on an image hosting service, categorize the data by keywords, and mine for credentials. Text scraped from 1.18 million images is analyzed for personally identifiable information and mined for user-, service-, and system-level credentials. Analysis of a focused subset of the data uncovered over 1000 usernames and passwords, and a branch of additional mining uncovered several social security numbers. This investigation proves that compromising textual data is contained in images hosted publicly, and that data could be collected for criminal use.