Project Update – teaching a computer to READ Bentham
By uczwlse, on 9 June 2017
The difficulty of Bentham’s handwriting is notorious. At the Bentham Project, we have years of experience of transcribing Bentham but you will still regularly find us hunched over a manuscript with a magnifying glass or blankly staring at a digital image on a computer screen, zooming in and out on a particular word.
Across the last few years, we have been working closely with various teams of computer scientists in the hope of making progress on the automated recognition of Bentham’s writing. This collaboration started under the tranScriptorium project in 2013 and now continues in its successor project READ (Recognition and Enrichment of Archival Documents).
READ’s mission is to make archival collections more accessible through the development and dissemination of Handwritten Text Recognition (HTR) technology. This technology is freely available through the Transkribus platform. Using algorithms of machine learning, it is possible to teach a computer to read a particular style of writing. The technology is trained by being shown images of documents and their accurate transcriptions. Anyone can start a test project with around 20,000 words or around 100 pages.
Under the tranScriptorium project, we initially had some success in training a model to process manuscripts from the Bentham collection. Using around 900 pages of Bentham images and transcripts, researchers from the Pattern Recognition and Human Language Technology (PRHLT) research centre at the Universitat Politècnica de València created a HTR model for us using statistical algorithms called Hidden Markov Models. This model was able to produce relatively accurate transcriptions of the Bentham papers, with a Character Error Rate of around 18% (meaning that around 82% of the characters in a transcript would be correct).
In the first stage of the READ project, we have already been able to enhance the accuracy of the HTR technology. The team at the Computational Intelligence Technology Lab (CITlab) at the University of Rostock created a new model using this same dataset. This model was based on Neural Networks, computational models for machine learning which work similarly to the human brain. This model can produce automatic transcripts of the Bentham papers with a Character Error Rate of only 5-10%.
Now it’s time to take things up a notch! In our first experiments with HTR, we put forward ‘easier’ documents for the computer to process. These tended to be pages written by Bentham’s secretaries where the layout is clear and the handwriting relatively neat. Now we want to test how the computer copes with some of the worst examples of Bentham’s writing. We are producing a new set of training data based on a selection of manuscripts which were written by Bentham himself when the philosopher was in his eighties. Box xxx of the Bentham Papers in UCL Special Collections contains the Blackstone Familiarized papers. These were part of Bentham’s lifelong obsession with critiquing the work of William Blackstone, the English jurist who was most famous for his Commentaries on the Laws of England (1765-9). Bentham first turned against Blackstone as a teenage student when he attended his lectures at the University of Oxford. In several published works and unpublished papers, Bentham argued that Blackstone was an apologist for the obvious inadequacies in the English legal system and blind to the necessity of reform.
The Blackstone Familiarized papers have been digitised by UCL Creative Media Services and transcribed by Professor Philip Schofield, the Director of the Bentham Project and General Editor of the Collected Works of Jeremy Bentham. The images were uploaded to Transkribus and Chris Riley, a PhD student from the Faculty of Laws, has been marking the lines of text on each image and then copying the transcripts into the platform.
We are aiming to produce 200 pages of ‘difficult’ Bentham training data which can be fed into a new version of our latest HTR model. We are also interested in comparing the accuracy of different models. How far does this new material enhance the accuracy of the models we already have and would it be worthwhile to have separate models for Bentham himself and his secretaries?
The prospect of the automated recognition of Bentham’s handwriting would considerably speed up the full transcription of Bentham’s writings and the publication of his Collected Works. We also want to experiment with HTR technology in a new version of Transcribe Bentham where volunteer transcribers could ask the computer to provide suggested readings of words that they are difficult to decipher. Until then, we have some more transcribing to do!
7 Responses to “Project Update – teaching a computer to READ Bentham”
- 1
-
2
Project update – spreading the word about Transcribe Bentham | UCL Transcribe Bentham wrote on 4 August 2017:
[…] up-and-running for close to seven years. With our role in the READ project, we are working on new technological innovations that should make the Transcription Desk more user-friendly, thereby making it easier for […]
-
3
Project Update – Bentham vs the computer | UCL Transcribe Bentham wrote on 23 February 2018:
[…] last summarised our experiments with HTR in a blog post from June 2017. At that point, we had used technology from the Computational Intelligence Technology Lab […]
-
4
Project update – celebrating the digitisation of Bentham’s manuscripts! | UCL Transcribe Bentham wrote on 11 June 2018:
[…] gave an overview of the history of Transcribe Bentham: from its origins in 2010 to our present-day experiments with Handwritten Text Recognition technology. Professor Melissa Terras, formerly of UCL Centre for Digital Humanities, now at the University […]
-
5
Searching Bentham’s manuscripts with Keyword Spotting! | UCL Transcribe Bentham wrote on 15 October 2018:
[…] about our progress with HTR and the Transkribus platform in blog posts from June 2017 and February […]
-
6
+ Searching Jeremy Bentham’s manuscripts with Keyword Spotting – READ Project wrote on 15 October 2018:
[…] about their progress with HTR and our Transkribus platform in blog posts from June 2017 and February […]
-
7
Project Update – Improving the Automated Recognition of Bentham’s handwriting | UCL Transcribe Bentham wrote on 28 November 2018:
[…] with Handwritten Text Recognition (HTR) technology and the Transkribus platform in blog posts from June 2017, February 2018 and October […]
[…] Transcribe Bentham is also now part of the EU-funded Recognition and Enrichment of Archival Documents (READ) project. The READ project is focused on making archival material more accessible through the development of Handwritten Text Recognition technology. We are in the process of teaching a computer to help us decipher Bentham’s handwriting! […]