Twitter is probably one of the most obvious resources available for gauging public sentiment. It offers a rich, large-scale data source that can give insight into what people are thinking without having to interview or survey them. However, the use of Twitter data for research is relatively unexplored terrain. So before conceiving of any “serious” research studies, my colleague Alex Ghanouni and I decided to explore Twitter as a data resource. In this piece, I would like to share our thoughts about our first informal attempt at venturing into the ‘Twittersphere’.
The starting point of our adventure was a curiosity about what is being said about cancer treatment and cancer prevention in social media. We adapted publically available Python code to track keywords in real time. The first iteration was for one hour only (12th March 2015); the subsequent two iterations were for 24 hours each (24th March; 5th May).
One seemingly easy question to address was the volume of tweets about “cancer treatment” and “cancer prevention” in relation to each other and “cancer” in general. We naively assumed that a count of tweets would be able to address our question. However, after the second iteration, it became apparent how naive our initial searches had been: Many of the tweets found using the keyword “cancer” turned out to be referring to the zodiac sign. As we could not think of a select group of second keywords that would be (almost) guaranteed to be used in conjunction with “cancer” the disease, we gave up on tracking “cancer” alone.
We had more luck with “cancer treatment”, “cancer prevention” and their relatively unambiguous synonyms and permutations. The volume of tweets for “cancer treatment” (24th March: 8355; 5th May: 5558) was consistently larger than that for “cancer prevention” (24th March: 5156; 5th May: 1487). This was even true around the 24th of March, the day when the news broke about Angelina Jolie’s preventative surgical removal of her ovaries and fallopian tubes. Although these findings do not reveal what is said about these topics, it should nevertheless give an indication of how much interest they generate. When it comes to cancer, it appears the public discourse mainly revolves around treatment rather than prevention. This is also in line with what we expected based on our professional and personal experience. Although our present investigation could not have been more rudimentary, more serious attempts at tracking specific keywords over longer periods of time might lead to genuinely novel insights.
Of course, we were also at least as interested in the content of the tweets about “cancer treatment” versus “cancer prevention”. To avoid a time-consuming traditional content analysis, we used the free web-based tool, ‘Wordle’, to create word clouds which reflect the frequency of words in text. Before creating the word clouds, we first removed all search terms from the tweet texts. When we examined the word clouds it became clear that there were two reasons why words were frequently used. Firstly, the words could be related to “real” news, which was the case for cancer prevention on the 24th March from 12pm GMT:
However, in two of the four word clouds we inspected, the most prominent words related to an obscure news source tweeting about dubious cancer cures (most likely for commercial reasons) or out-of-date research findings (a cervical screening paper from 1979). Finally, it was hard to interpret the results of the fourth word cloud, as there were few words that really stood out. A few of the largest words originated from a poem line (“That smile could end wars, and cure cancer”).
As an academically-trained researcher, I felt compelled to do a quick – albeit not too rigorous – literature search for peer-reviewed publications as well. Both the PubMed and PsycINFO databases yielded around 750 hits containing the keyword “Twitter”. Compared with other one-word search terms, this is a modest number. One review of published health studies using Twitter data concluded that most researchers lacked the knowledge and skills to process the large volumes of data and limited their samples in accordance with their ability to process and analyse the data (Finfgeld-Connett, 2014). A second limitation they noted was the population-representativeness of Twitter users, or rather, the lack thereof. Broadly speaking, we concurred with this review’s conclusions, although we would like to add a few nuances and additional observations.
Let’s start with looking at us, the researchers first. Our first research experience with Twitter was in line with the challenges of using Big Data for health research that I discussed in a previous blog post. Most of us who are interested in the content of social media tend to have a social science background. Programming and data mining are therefore not part of our skillset acquired through formal education. This obviously constrains what we can do with large volumes of data without help from those who are conventionally employed to work with Big Data. Having said that, we felt that the lack of a reliable alternative to human judgement limited us more than our technical skills. We repeatedly needed to resort to more simple forms of analysis (i.e. reading the tweets…) to determine what the data were actually telling us and there seemed to be no obvious way we could have outsourced this task to an algorithm.
Similarly, although there are probably sophisticated programmes to weed out bot-generated tweets, authenticity of the tweets might be a more general problem which cannot be easily addressed without human intervention. The most obvious challenge is that tweets originate from a variety of users who have diverse professional, commercial and personal motives. This is compounded by Fifgeld-Connett’s observation regarding the representativeness of Twitter users.
These challenges may not be insurmountable, but they do highlight that Twitter data is far from “clean” and straightforward to interpret for health research purposes. I for one will be keeping a keen eye on future research endeavours tackling these issues.
Finfgeld-Connett, D. (2014), ‘Twitter and Health Science Research’, Western Journal of Nursing Research, 1-15.