A A A

Archive for the 'READ project' Category

Project Update – Bentham vs the computer

By Louise Seaward, on 23 February 2018

Throughout it’s long history, the Bentham Project has always been interested in the way in which technological advances could be integrated into its work on the scholarly edition of Bentham’s Collected Works.  Transcribe Bentham is currently a proud partner in an international collaboration focused on using innovative computer science techniques to process historical manuscripts.  The mission of the READ (Recognition and Enrichment of Archival Documents) project is to make archival collections more accessible through the development and dissemination of Handwritten Text Recognition (HTR) technology.

This technology is freely available through the Transkribus platform.  Using algorithms of machine learning, it is possible to teach a computer to read a particular kind of handwriting – even Bentham’s!  The technology is trained by being shown images of documents and their accurate transcriptions.  Thanks to the hard work of the Transcribe Bentham volunteers, we are lucky to have a sizeable collection of transcripts that can be used as training data for automated text recognition.

We last summarised our experiments with HTR in a blog post from June 2017.   At that point, we had used technology from the Computational Intelligence Technology Lab (CITlab) at the University of Rostock to produce a model capable of processing the easier papers from the Bentham collection, largely those written by Bentham’s secretaries.  This model can automatically produce transcripts with a Character Error Rate of between 5 and 10%, meaning that 90-95% of characters in the transcript are correct.  The Bentham model is now publicly available in Transkribus under the title ‘English Writing M1’ and has been applied to other collections of eighteenth- and nineteenth-century English handwriting with some success.

Screenshot from Transkribus with automatically generated transcript. Box Add 3350, fo. 158, The British Library (Click to enlarge image)

Although this model copes well with documents where the handwriting and layout are relatively clear, it struggles to recognise the more difficult examples of writing from Bentham’s own hand.  So we decided to take on the challenge of teaching a computer to read some of the very worst examples of Bentham’s handwriting!

We used the Transkribus platform to create training data based on Boxes 30 and 31 of the Bentham Papers held in UCL Special Collections.  These manuscripts were part of Bentham’s lifelong obsession with critiquing the work of William Blackstone, the English jurist who was most famous for his Commentaries on the Laws of England (1765-9).  To create the training data, we uploaded around 200 digital images to Transkribus, segmented each image into lines and then copied over existing transcripts to match each image.

The resulting model generates transcripts with an average Character Error Rate of 26%.  This error rate is unfortunately too high to automatically produce transcripts suitable for scholarly editing.  Nevertheless, it does have the potential to facilitate the full-text search of the Bentham Papers.  Transkribus now includes sophisticated Keyword Spotting technology, which is capable of finding words and phrases in documents, even if they have been mistranscribed by the computer.

Screenshot from Transkribus with automatically generated transcript.  Box 31, fol. 78, UCL Bentham Papers, Special Collections, University College London (Click to enlarge image)

We are working with the Pattern Recognition and Human Language Technology (PRHLT) research centre at the Universitat Politècnica de València and the Digitisation and Digital Preservation group at the University of Innsbruck to present an open-access search functionality for the Bentham Papers.  We are also hoping that volunteers could get involved in this endeavour by checking and correcting the results of significant search queries to ensure their accuracy.

Improving the recognition of Bentham’s handwriting is our other aim and to this end, we will be producing more pages of training data in Transkribus.  The technology moves so fast that the efficiency of this process has already been streamlined thanks to technology from CITlab.  Transkribus can now easily find lines in images (even in documents with complex layouts) and it is also possible to use existing transcripts to automatically train a model, rather than copying them into Transkribus line by line.

If you are interested in following in our footsteps, you are welcome to give Transkribus a try!  You can find more information on the READ website and in the Transkribus How to Guides.

I would like to thank Chris Riley, PhD student and transcription assistant at the Bentham Project, for helping to produce the training data for the latest Bentham model.

Project Update – Report from the British Academy soirée

By Louise Seaward, on 23 June 2017

A guest post by Dr Tim Causer who represented Transcribe Bentham and the Bentham Project at the latest British Academy soirée

Professor Philip Schofield and Dr Tim Causer represented the Bentham Project at the British Academy soirée on 20 June. Over 500 people attended the event and heard talks from a number of British Academy Fellows, and visited stands featuring the work of British Academy Research Projects, of which the Bentham Project is one.

Professor Schofield and Dr Causer, stationed in the Council Room beside Henry Pickersgill’s 1829 portrait of Bentham, discussed with visitors the work of the Project, the production of The Collected Works of Jeremy Bentham, recent open-access publications from UCL Press, and the ongoing and exciting work of the European Commission-funded READ project. Of particular interest to visitors was the Transkribus platform and its Handwritten Text Recognition tools, and the prototype ‘ScanTent’ which, when used in conjunction with the free and forthcoming DocScan app, allows users to efficiently capture images of archival and printed material.

Professor Philip Schofield and Dr Tim Causer at the British Academy

Professor Philip Schofield and Dr Tim Causer at the British Academy

A good time was had by all, particularly under the beneficent eye of Mr Bentham himself!

Project Update – teaching a computer to READ Bentham

By Louise Seaward, on 9 June 2017

The difficulty of Bentham’s handwriting is notorious.  At the Bentham Project, we have years of experience of transcribing Bentham but you will still regularly find us hunched over a manuscript with a magnifying glass or blankly staring at a digital image on a computer screen, zooming in and out on a particular word.

One of the Bentham Project's favourite tools

One of the Bentham Project’s favourite tools

 

Across the last few years, we have been working closely with various teams of computer scientists in the hope of making progress on the automated recognition of Bentham’s writing.  This collaboration started under the tranScriptorium project in 2013 and now continues in its successor project READ (Recognition and Enrichment of Archival Documents).

READ’s mission is to make archival collections more accessible through the development and dissemination of Handwritten Text Recognition (HTR) technology.  This technology is freely available through the Transkribus platform.  Using algorithms of machine learning, it is possible to teach a computer to read a particular style of writing.  The technology is trained by being shown images of documents and their accurate transcriptions.  Anyone can start a test project with around 20,000 words or around 100 pages.

Under the tranScriptorium project, we initially had some success in training a model to process manuscripts from the Bentham collection.  Using around 900 pages of Bentham images and transcripts, researchers from the Pattern Recognition and Human Language Technology (PRHLT) research centre at the Universitat Politècnica de València created a HTR model for us using statistical algorithms called Hidden Markov Models.  This model was able to produce relatively accurate transcriptions of the Bentham papers, with a Character Error Rate of around 18% (meaning that around 82% of the characters in a transcript would be correct).

In the first stage of the READ project, we have already been able to enhance the accuracy of the HTR technology.  The team at the Computational Intelligence Technology Lab (CITlab) at the University of Rostock created a new model using this same dataset.  This model was based on Neural Networks, computational models for machine learning which work similarly to the human brain.  This model can produce automatic transcripts of the Bentham papers with a Character Error Rate of only 5-10%.

Now it’s time to take things up a notch!  In our first experiments with HTR, we put forward ‘easier’ documents for the computer to process.  These tended to be pages written by Bentham’s secretaries where the layout is clear and the handwriting relatively neat.  Now we want to test how the computer copes with some of the worst examples of Bentham’s writing.  We are producing a new set of training data based on a selection of manuscripts which were written by Bentham himself when the philosopher was in his eighties.  Box xxx of the Bentham Papers in UCL Special Collections contains the Blackstone Familiarized papers.  These were part of Bentham’s lifelong obsession with critiquing the work of William Blackstone, the English jurist who was most famous for his Commentaries on the Laws of England (1765-9).  Bentham first turned against Blackstone as a teenage student when he attended his lectures at the University of Oxford.  In several published works and unpublished papers, Bentham argued that Blackstone was an apologist for the obvious inadequacies in the English legal system and blind to the necessity of reform.

 

Screenshot of page from Blackstone Familiarized in Transkribus. UCL Special Collections, Bentham Papers, Box xxx, fo. 156 [Image: UCL Special Collections]

Screenshot of page from Blackstone Familiarized in Transkribus. UCL Special Collections, Bentham Papers, Box xxx, fo. 156 [Image: UCL Special Collections]

The Blackstone Familiarized papers have been digitised by UCL Creative Media Services and transcribed by Professor Philip Schofield, the Director of the Bentham Project and General Editor of the Collected Works of Jeremy Bentham.  The images were uploaded to Transkribus and Chris Riley, a PhD student from the Faculty of Laws, has been marking the lines of text on each image and then copying the transcripts into the platform.

We are aiming to produce 200 pages of ‘difficult’ Bentham training data which can be fed into a new version of our latest HTR model.  We are also interested in comparing the accuracy of different models.  How far does this new material enhance the accuracy of the models we already have and would it be worthwhile to have separate models for Bentham himself and his secretaries?

The prospect of the automated recognition of Bentham’s handwriting would considerably speed up the full transcription of Bentham’s writings and the publication of his Collected Works.  We also want to experiment with HTR technology in a new version of Transcribe Bentham where volunteer transcribers could ask the computer to provide suggested readings of words that they are difficult to decipher.  Until then, we have some more transcribing to do!