Watching #ebola tweets over time
What are all the people on twitter talking about regarding #ebola?
I was interested in finding out what people were using twitter for when discussing the recent ebola outbreak. Where people trying to disseminate information? Or are people just interested in cracking jokes?
For this project I worked with the twitter api, which comes in two flavors: a RESTful implementation and a streaming api. Since the ebola outbreak is a developing event, I built a stream listener in python and collected every tweet starting from Oct 18 tagged with #ebola and stored it into a MongoDB database.
This eventually resulted in a collection of >1 million tweets. I aggregated tweets per hour each day and used nltk and twokenize (link) to tokenize the words in a twitter-aware manner. For example, all punctuation is stripped away except # and @ signs were left attached to words since they represented twitter-specific hashtags and users. The word tokenn were then ranked by their weighted tf-idf values to obtain the top 10 words of the hour. I visualized the result in a D3 area chart:
I probably spent more time than I was hoping to on working with the Mongo database and the D3 visualization, but I’m pretty satisfied with the result. It combines aspects of d3.brush, and the example X-value mouseover at bl.ock.
In it’s current form, the chart simply shows the most talked about words for each hour, giving a historical picture of twitter word bites. For example, a huge spike can be seen on Oct 24 8:00pm, which was when news broke of Dr. Craig Spencer testing positive for ebola.. It’s also interesting to see that on the topic of #ebola, breaking news only causes spikes in activity for ~2 hours before interest seems to fade.
You can also see cyclical twitter activity each day. The fact that peak and low activity times correlate well with timezones in the USA indicates that most people tweeting #ebola reside in the United States. Unsurprisingly, the lowest activity times are during sleeping hours. Activity appears to peak in the morning hours just before lunch, so it seems that people like to check the news (and tweet about it) in the morning hours as they arrive at work.
The original plan was to cluster the tweets, but I was unable to get any satisfactory results possibly because tweets are very short bits of text that aren’t constructed like typical proper sentences. If I were to revisit this, I’d like to try using Latent Dirichlet Allocation to extract topics within the tweets.