Exploratory Case Study on Identifying Phishing Email with Text Mining
Program: Data Science Master's Degree
Location: Not Specified (onsite)
Student: Esmeralda D. Robledo
The growing problem of phishing emails has urged a need for intelligent phishing email identification. And only when phishing email characteristics are fully understood can an effective counter measure be taken to minimize the risk of these fraud emails. This paper describes text mining techniques topic modeling, network analysis and clustering for intelligent phishing email identification, which focuses on unsupervised methods to help identify phishing emails. In this paper a fraud email dataset from Kaggle is used to demonstrate the use of topic modeling, network analysis and clustering for analyzing emails. Topic modeling is used to find a set of topics. Then integrate network analysis on emails topic results to view the relationships with network graphs. Finally, after exploring the email topics, a strategy can be formed to apply a clustering model that can be used to identify phishing/fraud emails for the given fraud email dataset. The results on the fraud email dataset showed that the text mining techniques used in this paper for identifying phishing email can easily and accurately cluster emails between legitimate and phishing emails. This case study ends by recommending that these techniques provide information to improve decisions on information security risks. Text mining techniques are learned from the study of data science. And with phishing emails that continue to make it pass email filtering systems, the demand for security data science will only increase to aid in the defense against fraud on computer information systems.