Twitter Data Analysis to Effectively Promote a Leadership Development Class Online
Program: Data Science Master's Degree
Location: Not Specified (onsite)
Student: Allen Engel
This paper looks at mining Twitter data to effectively promote a leadership development class online. In order for something to be effectively promoted on Twitter, it needs to be retweeted. For this reason, we have framed our research and research questions on identifying which Twitter items in the domain of leadership development lead to being retweeted or not being retweeted. Additionally, we want to identify which non-text data versus text-based data helps explain being retweeted or not. We conducted various preprocessing and feature extraction techniques on about 100,000 tweets for analysis. We looked at counts of various non-text variables and ultimately decided on using n-grams for analyzing Twitter text. We focused on classification, namely binary classification, of being retweeted or not being retweeted. For classification, we pursued Logistic Regression, Support Vector Machines, Random Forests, and several Penalized Regression algorithms. We found that Random Forests ultimately had the highest prediction accuracy though we additionally utilized output from our top-performing Penalized Regression, which was Ridge Regression. We ultimately found that our research questions were supported, and we were able to find various non-text and text-based variables that helped explain either being retweeted or not being retweeted. We found achieving a highly accurate model was challenging, as predicting human behavior (including the act of retweeting) is also challenging. At the same time, there is definitely helpful information, that if methodically considered, can help increase the probabilities of being retweeted and enable effective promotion of leadership development content online.