Bluclobber, or: Enabling complex analysis of large scale digital collections
By Melissa M Terras, on 7 May 2015
Recently, Jisc announced their Research Data Spring programme, which is providing funding to a variety of pilot projects in order to find “new technical tools, software and service solutions, which will improve researchers’ workflows and the use and management of their data”. We’re delighted that our pitch, ‘Enabling Complex Analysis of Large Scale Digital Collections‘, in collaboration with the British Library Digital Research team, and UCL Research Computing, is one of the funded projects. The idea that we pitched is that
The British Library (BL) has numerous digital datasets, but not the processing power for users to run advanced queries against or analyse them. We will use UCL’s world leading Research Computing to open up this digital data, investigating the needs and requirements of a service that will allow researchers to undertake complex searching of the BL’s digital content. [more]
Over a three month period, we’ll be exploring how to get the BL’s dataset of 65,000 digitised, out of copyright books onto UCL’s High Performance Computing facilities, and we’ll then work with a range of researchers from across the Arts and Humanities in running “easy”, then more complex, then really quite tricky searches across this corpus (which represents 4% of the British National Bibliography) in order to both aid those researchers, but work out how we can help facilitate access to this type of computing for a much wider research audience. James Baker, a Curator of Digital Research has already blogged about the overview of the project, so, a month down the line into a three month project, how are we getting on?
The first thing, of course, was to get the data across to UCL from the BL. We are just across the road, really, and still, as is the nature with big datasets, its easier to run that over on physical media than over the network. The first part of the project is really about ingestion, mounting the data, and understanding its structure, and we’re very lucky to have James Hetherington from UCL Research Computing on the case. The data itself comprises of 224Gb of compressed ALTO XML representing 60k+ 17th, 18, and 19th century books, and one of the interesting features of the data is that each word is expressed as a single line including where it is on the page (from top and from left) in pixels. We are therefore not just dealing with vanilla OCR, but can reconstitute the layout of the page: this means we also have to reconstitute the text before doing any text mining. For example, the word “plaza” in one of the books is encoded thus with its place on the page:
<String ID=”P56_ST00016″ HPOS=”367″ VPOS=”152″ WIDTH=”76″ HEIGHT=”35″ CONTENT=”plaza” WC=”0.92″ CC=”03000″/>
– this is exciting as we can do some interesting things with placement of text and images further down the line, but first, comes the reconstituting, and getting things ready to process. There are other quirks – the data for each individual book is provided in one zip file, which means we’ve got 65,000 of the blighters. We need to do things like restructure the data so we can use parallelism in the reads, etc etc.
Now, UCL has fantastic infrastructure for research computing, but we’re the first project that has used it for the Arts and Humanities (woohooo, that has been am ambition of mine for a few years). Legion, our centrally funded resource for running complex and large computational queries across a large number of cores, is normally used across the sciences, and from a technical point of view its not been set up for this type of data, so time has had to be taken to install XML libraries on Legion given the scientists had no requirements for them previously. Choices follow – what language should we use to query? We’ve chosen python so we can express queries in a language comprehensible to domain experts. We need to develop efficient mathematical models for querying, to find out how many cores are needed for processing: in the first tests we’ve analysed 1/150th of the corpus across one node, of 16 cores, and our early tests show that 100 machines should be able to analyse the whole corpus in a few seconds, but we need to test that further.
And here we have a screenshot of the text of the book, reconstituted from the XML. What you are seeing here is a paragraph of a text, and below, we can repaint the exact placement of words on the page (although it looks a bit like a ransom note letter cut from magazines at the moment, but hey, it works!). We’re now placed to be running some one word searches across the corpus, testing efficiency, and proving that we can do global queries across the corpus successfully. We have plenty of choices to make about what information we can and need to report back to those who requested the search, and what format is most useful. We’ve got the data up and running, now comes refinement, and at our next team meeting in a couple of weeks, we should have results from a couple of real queries: keeping note of technical issues we face along the way. We have two other members of the core team: UCLDH’s David Beavan, is keeping us on track with the computational linguistics element of the project, given his previous research background in this area, and UCL CASA‘s Martin Austwick is helping with data visualisation of the search results, given his background in this area.
Alongside the technical aspects of getting the data and infrastructure to a stage where we can do what we promised (run searches across all of the books!) we’ve also identified four early career researchers who have detailed queries to do, and we’ll be undertaking two days of workshops in June with them, learning more about their needs, and what we need to do to make this useful for Arts and Humanities researchers. But discussing those searches is for another blogpost! I’ll leave it here with a note on the title of this blogpost: the long title is the formal name of the project, but James H, in the screenshot above, decided to call the project Bluclobber: British Library UCL Open Books something something something (he forgets) … and well, the twitter handle was free, so it may well stick.
Our next meeting is at the British Library in a few weeks time, and we’ll have more to report back then. A final word – we need to pitch for the next round of funding, and if you wanted to vote our idea up the ranks a bit for second round funding, feel free to upvote us here, thanks! We have big hopes for rolling this out to a much, much wider audience in the future…