By Rudolf Ammann, on 17 December 2010
Claire Warwick, director of the Centre for Digital Humanities at University College London, said that humanities researchers had been using the word-frequency techniques being described by Michel and Aiden for several decades. But the sheer size of their dataset marked it out from the usual tools. “What’s different is that this allows people to not just look at several hundred thousand words or several million words but several million books. So the overview is much bigger. That may bring out some hitherto unexpected ideas.”
The database of 500bn words is thousands of times bigger than any existing research tool, with a sequence of letters that is 1,000 times longer than the human genome. The vast majority, around 72%, is in English with small amounts in French, Spanish, German, Chinese, Russian, and Hebrew.
“In science, huge datasets which people have used super-computing on have led to some fascinating new discoveries that otherwise wouldn’t be possible,” said Warwick. “Whether that’s going to be the same in the arts and humanities, I don’t know yet.”
The scanned books can now be mined for cultural trends with very little effort using Google’s Ngram Viewer:
“One of the ways to use this is to suggest ideas,” said Warwick. “You can look at something like this and say, how fascinating that a certain term seems to occur so commonly and I wonder why that should be.”
On 17 March, Claire will deliver a public Lunch Hour Lecture at UCL on Twitter and Digital Identity.