A A A

Archive for the 'READ project' Category

Project Update – Improving the Automated Recognition of Bentham’s handwriting

By Louise Seaward, on 28 November 2018

As our volunteer transcribers know, getting to grips with Bentham’s handwriting can be a steep learning curve.  Bentham never wrote particularly neatly and his scrawl became increasingly difficult to comprehend as he grew older.  Since 2013, the Bentham Project has been experimenting with advanced machine learning technology via the Transkribus platform in an attempt to train algorithms to automatically decipher Bentham’s handwriting.  And we have lately seen vastly improved results!

Read about our progress with Handwritten Text Recognition (HTR) technology and the Transkribus platform in blog posts from June 2017, February 2018 and October 2018.

HTR technology is open to anyone around the world thanks to Transkribus and the READ project. Once users have installed the platform, they can set about processing images and transcripts as training data for automated text recognition.  The software uses computational models for machine learning called neural networks.  These networks are trained to recognise a style of writing by being shown images and transcripts of that writing.  Anyone can start a test project in Transkribus by uploading around 75 pages of digitised images to the platform and transcribing each page as fully as possible.  The software learns from everything it is shown and so the more pages of training data, the better!  Find out more about getting started with Transkribus in the Transkribus How to Guides.

When we started working with HTR, it is fair to say that we were somewhat uncertain about the capabilities of the technology.  So we decided to focus on training a model to recognise some of the easier papers in the Bentham collection – those written by Bentham’s secretaries who tend to have neat handwriting.  Using around 900 pages of images and transcripts, we trained a model that is now publicly available to all Transkribus users under the name ‘English Writing M1’.  This model can produce transcripts of pages from the Bentham collection with a Character Error Rate (CER) of between 5-20%.  It produces good transcripts of pages written by Bentham’s secretaries but struggles to decipher Bentham’s own hand.

So our next challenge was to improve the recognition of Bentham’s most difficult handwriting.  For the past 18 months we have been continually creating training data in Transkribus based on very complex pages from the Bentham collection, periodically retraining HTR models and then assessing the results.  Until recently, our best result was a model trained on 81,000 transcribed words (around 340 pages) which used the ‘English Writing M1’ model as a base model.  By using a base model, Transkribus users can give the system a boost and ensure that it builds directly on what it has already learnt from the creation of an earlier model.  In this case, our resulting model could produce transcripts with an average CER of 17.75%.

The great thing about working with Transkribus is that the technology is improving all the time, thanks to the efforts of the computer scientists who work on the READ project.  The latest innovation is HTR+, a new form of Handwritten Text Recognition technology formulated by the CITlab team at the University of Rostock.  HTR+ is based on TensorFlow, a software library developed by Google.  It is works similarly to the existing HTR but processes data much faster, meaning that the algorithms can learn more quickly and so produce better results.  We used HTR+ to train a model on 140,000 transcribed words (or 535 pages) of Bentham’s most difficult handwriting.  This model can generate transcripts with a CER of around 9%.

An automated transcript of very difficult handwriting, using our latest HTR+ model. Image courtesy of UCL Special Collections.

HTR+ is not yet available to all Transkribus users – but users can request access by sending an email to the Transkribus team (email@transkribus.eu)

We are getting closer to the reliable recognition of Bentham’s handwriting and this is very exciting!  As a scholarly editing project dedicated to producing Bentham’s Collected Works, we require highly accurate transcripts as a basis for our work.  The experience of other Transkribus users suggests that transcripts which have a CER of around 5% can be corrected rapidly and easily.  So our next priority is to conduct some tests to see how easy Bentham Project researchers find it to correct and edit transcripts generated by this model where the CER is 9%.

We will also continue creating new pages of training data in Transkribus using images and transcripts of Bentham’s most difficult handwriting.  As well as retraining our current model with additional pages of data, we want to create smaller models focused on specific hands and languages in the Bentham collection.  This new training data could also be used to improve our Keyword Spotting tool, which was set up by the PRHLT research center at the Universitat Politècnica de València.

We are also preparing a large-scale experiment with Text2Img matching technology devised by the CITlab team.  This technology allows users to use existing transcripts as training data for HTR, rather than creating transcripts afresh in Transkribus.  We hope that this technology will allow us to create a new model based on several thousand pages from the Bentham collection – watch this space!

And of course, we can’t forget Transcribe Bentham.  We still plan to be able to integrate HTR technology directly into our crowdsourcing platform over the next few years.  The idea is that users will be able to check and correct automated transcripts or simply transcribe as normal and receive computer-generated suggestions of words that are difficult to decipher.  We believe that new users, who tend to be daunted by the complexity of Bentham’s handwriting, are likely to find these transcription options more attractive.  Experienced users may also appreciate word suggestions to assist their transcription work.

The Bentham Project is at the cutting-edge of this transformational technology and we hope that these advances will ultimately bring us closer to the complete transcription and publication of Bentham’s Collected Works.

My thanks go to Chris Riley, Transcription Assistant at the Bentham Project for his assistance with the preparation of training data in Transkribus.

Project Update – Searching Bentham’s manuscripts with Keyword Spotting!

By Louise Seaward, on 15 October 2018

The Bentham Project has been experimenting with the Handwritten Text Recognition (HTR) of Bentham’s manuscripts for the past five years, first as a partner in the tranScriptorium project and now as part of READ.

Read about our progress with HTR and the Transkribus platform in blog posts from June 2017 and  February 2018.

Keyword Spotting

Our results have thus far been impressive, especially considering the immense difficulty of Bentham’s own handwriting.  But automated transcription is not yet at a point where it is sufficiently accurate to be used by Bentham Project researchers as a basis for scholarly editing.  It would be too time-consuming (and probably too irritating!) for us to correct the errors in the computer-generated transcripts of papers written in Bentham’s hand.

However, the current state of the technology is strong enough for keyword searching!  And thanks to a collaboration with the PRHLT research center at the Universitat Politècnica de València (another partner in the READ project) we have some exciting new results to report.  It is now possible to search over 90,000 digital images of the central collections of Bentham’s manuscripts, which are held at Special Collections University College London and The British Library.

A Keyword Spotting search for the word ‘pleasure’

 

Appeal for volunteers!

I have prepared a Google sheet with some suggested search terms in 5 different spreadsheet tabs (Bentham’s neologisms, concepts, people, places and other).

It would be fantastic if people filled in the spreadsheet to record some of their searches, using my suggested search terms and some of their own.  Transcribers could search for subjects they are interested in and then cross-reference to material on the Transcription Desk that they might like to transcribe.

Who knows what we might find??  I hope to share some of these results in my upcoming presentation at the Transkribus User Conference in November 2018.  Thanks in advance for your participation.

Background

The PRHLT team have processed the Bentham papers with cutting-edge HTR and probabilistic word indexing technologies.  This sophisticated form of searching is often called Keyword Spotting.  It is more powerful than a conventional full-text search because it uses statistical models trained for text recognition to search through probability values assigned to character sequences (words), considering most possible readings of each word on a page.

We delivered thousands of images and transcripts to the team in Valencia and gave them access to the data we had already used to train HTR models in Transkribus.  After cleaning our data and using Transkribus technology to divide the images into lines, the team in Valencia trained neural network algorithims to recognise and index the collection.

The result is that this vast collection of Bentham’s papers can be efficiently searched, including those papers that have not yet been transcribed!  The accuracy rates are impressive.  The spots suggest around 84-94% accuracy (6-16% Character Error Rate) when compared with manual transcriptions of Bentham’s manuscripts.  More precisely speaking, laboratory tests show that the word average search precision ranges from 79% to 94%.  This means that, out of 100 average search results, only as few as 6 may fail to actually be the words searched for. The accuracy of spotted words depends on the difficulty of Bentham’s handwriting – although it is possible to find useful results in Bentham’s scrawl!  There could be as many as 25 million words waiting to be found.

Use cases

This fantastic site will be invaluable to anyone interested in Bentham’s philosophy.  It will help Bentham Project researchers to find previously unknown references in pages that have not yet been transcribed.  It will allow researchers to quickly investigate Bentham’s concepts and correspondents.  I hope that it will also help volunteer transcribers to find interesting material.

This interface is a prototype beta version.  In the future we want to increase the power of this research tool by connecting it to other digital resources, allowing users to quickly search the manuscripts at the UCL library repository, the Bentham papers database and the Transcription Desk and linking these images to our rich existing metadata.

Similar Keyword Spotting technology (based  on research by the CITlab team at the University of Rostock, another one of the READ project partners) is currently available to all users of the Transkribus platform.  Find out more at the READ project website.

I welcome any feedback on our new search functionality at: transcribe.bentham@ucl.ac.uk

My thanks go to the PRHLT research center, the University of Innsbruck and Chris Riley, Transcription Assistant at the Bentham Project for their support and assistance.

Project Update – Bentham vs the computer

By Louise Seaward, on 23 February 2018

Throughout it’s long history, the Bentham Project has always been interested in the way in which technological advances could be integrated into its work on the scholarly edition of Bentham’s Collected Works.  Transcribe Bentham is currently a proud partner in an international collaboration focused on using innovative computer science techniques to process historical manuscripts.  The mission of the READ (Recognition and Enrichment of Archival Documents) project is to make archival collections more accessible through the development and dissemination of Handwritten Text Recognition (HTR) technology.

This technology is freely available through the Transkribus platform.  Using algorithms of machine learning, it is possible to teach a computer to read a particular kind of handwriting – even Bentham’s!  The technology is trained by being shown images of documents and their accurate transcriptions.  Thanks to the hard work of the Transcribe Bentham volunteers, we are lucky to have a sizeable collection of transcripts that can be used as training data for automated text recognition.

We last summarised our experiments with HTR in a blog post from June 2017.   At that point, we had used technology from the Computational Intelligence Technology Lab (CITlab) at the University of Rostock to produce a model capable of processing the easier papers from the Bentham collection, largely those written by Bentham’s secretaries.  This model can automatically produce transcripts with a Character Error Rate of between 5 and 10%, meaning that 90-95% of characters in the transcript are correct.  The Bentham model is now publicly available in Transkribus under the title ‘English Writing M1’ and has been applied to other collections of eighteenth- and nineteenth-century English handwriting with some success.

Screenshot from Transkribus with automatically generated transcript. Box Add 3350, fo. 158, The British Library (Click to enlarge image)

Although this model copes well with documents where the handwriting and layout are relatively clear, it struggles to recognise the more difficult examples of writing from Bentham’s own hand.  So we decided to take on the challenge of teaching a computer to read some of the very worst examples of Bentham’s handwriting!

We used the Transkribus platform to create training data based on Boxes 30 and 31 of the Bentham Papers held in UCL Special Collections.  These manuscripts were part of Bentham’s lifelong obsession with critiquing the work of William Blackstone, the English jurist who was most famous for his Commentaries on the Laws of England (1765-9).  To create the training data, we uploaded around 200 digital images to Transkribus, segmented each image into lines and then copied over existing transcripts to match each image.

The resulting model generates transcripts with an average Character Error Rate of 26%.  This error rate is unfortunately too high to automatically produce transcripts suitable for scholarly editing.  Nevertheless, it does have the potential to facilitate the full-text search of the Bentham Papers.  Transkribus now includes sophisticated Keyword Spotting technology, which is capable of finding words and phrases in documents, even if they have been mistranscribed by the computer.

Screenshot from Transkribus with automatically generated transcript.  Box 31, fol. 78, UCL Bentham Papers, Special Collections, University College London (Click to enlarge image)

We are working with the Pattern Recognition and Human Language Technology (PRHLT) research centre at the Universitat Politècnica de València and the Digitisation and Digital Preservation group at the University of Innsbruck to present an open-access search functionality for the Bentham Papers.  We are also hoping that volunteers could get involved in this endeavour by checking and correcting the results of significant search queries to ensure their accuracy.

Improving the recognition of Bentham’s handwriting is our other aim and to this end, we will be producing more pages of training data in Transkribus.  The technology moves so fast that the efficiency of this process has already been streamlined thanks to technology from CITlab.  Transkribus can now easily find lines in images (even in documents with complex layouts) and it is also possible to use existing transcripts to automatically train a model, rather than copying them into Transkribus line by line.

If you are interested in following in our footsteps, you are welcome to give Transkribus a try!  You can find more information on the READ website and in the Transkribus How to Guides.

I would like to thank Chris Riley, PhD student and transcription assistant at the Bentham Project, for helping to produce the training data for the latest Bentham model.

Project Update – Report from the British Academy soirée

By Louise Seaward, on 23 June 2017

A guest post by Dr Tim Causer who represented Transcribe Bentham and the Bentham Project at the latest British Academy soirée

Professor Philip Schofield and Dr Tim Causer represented the Bentham Project at the British Academy soirée on 20 June. Over 500 people attended the event and heard talks from a number of British Academy Fellows, and visited stands featuring the work of British Academy Research Projects, of which the Bentham Project is one.

Professor Schofield and Dr Causer, stationed in the Council Room beside Henry Pickersgill’s 1829 portrait of Bentham, discussed with visitors the work of the Project, the production of The Collected Works of Jeremy Bentham, recent open-access publications from UCL Press, and the ongoing and exciting work of the European Commission-funded READ project. Of particular interest to visitors was the Transkribus platform and its Handwritten Text Recognition tools, and the prototype ‘ScanTent’ which, when used in conjunction with the free and forthcoming DocScan app, allows users to efficiently capture images of archival and printed material.

Professor Philip Schofield and Dr Tim Causer at the British Academy

Professor Philip Schofield and Dr Tim Causer at the British Academy

A good time was had by all, particularly under the beneficent eye of Mr Bentham himself!

Project Update – teaching a computer to READ Bentham

By Louise Seaward, on 9 June 2017

The difficulty of Bentham’s handwriting is notorious.  At the Bentham Project, we have years of experience of transcribing Bentham but you will still regularly find us hunched over a manuscript with a magnifying glass or blankly staring at a digital image on a computer screen, zooming in and out on a particular word.

One of the Bentham Project's favourite tools

One of the Bentham Project’s favourite tools

 

Across the last few years, we have been working closely with various teams of computer scientists in the hope of making progress on the automated recognition of Bentham’s writing.  This collaboration started under the tranScriptorium project in 2013 and now continues in its successor project READ (Recognition and Enrichment of Archival Documents).

READ’s mission is to make archival collections more accessible through the development and dissemination of Handwritten Text Recognition (HTR) technology.  This technology is freely available through the Transkribus platform.  Using algorithms of machine learning, it is possible to teach a computer to read a particular style of writing.  The technology is trained by being shown images of documents and their accurate transcriptions.  Anyone can start a test project with around 20,000 words or around 100 pages.

Under the tranScriptorium project, we initially had some success in training a model to process manuscripts from the Bentham collection.  Using around 900 pages of Bentham images and transcripts, researchers from the Pattern Recognition and Human Language Technology (PRHLT) research centre at the Universitat Politècnica de València created a HTR model for us using statistical algorithms called Hidden Markov Models.  This model was able to produce relatively accurate transcriptions of the Bentham papers, with a Character Error Rate of around 18% (meaning that around 82% of the characters in a transcript would be correct).

In the first stage of the READ project, we have already been able to enhance the accuracy of the HTR technology.  The team at the Computational Intelligence Technology Lab (CITlab) at the University of Rostock created a new model using this same dataset.  This model was based on Neural Networks, computational models for machine learning which work similarly to the human brain.  This model can produce automatic transcripts of the Bentham papers with a Character Error Rate of only 5-10%.

Now it’s time to take things up a notch!  In our first experiments with HTR, we put forward ‘easier’ documents for the computer to process.  These tended to be pages written by Bentham’s secretaries where the layout is clear and the handwriting relatively neat.  Now we want to test how the computer copes with some of the worst examples of Bentham’s writing.  We are producing a new set of training data based on a selection of manuscripts which were written by Bentham himself when the philosopher was in his eighties.  Box xxx of the Bentham Papers in UCL Special Collections contains the Blackstone Familiarized papers.  These were part of Bentham’s lifelong obsession with critiquing the work of William Blackstone, the English jurist who was most famous for his Commentaries on the Laws of England (1765-9).  Bentham first turned against Blackstone as a teenage student when he attended his lectures at the University of Oxford.  In several published works and unpublished papers, Bentham argued that Blackstone was an apologist for the obvious inadequacies in the English legal system and blind to the necessity of reform.

 

Screenshot of page from Blackstone Familiarized in Transkribus. UCL Special Collections, Bentham Papers, Box xxx, fo. 156 [Image: UCL Special Collections]

Screenshot of page from Blackstone Familiarized in Transkribus. UCL Special Collections, Bentham Papers, Box xxx, fo. 156 [Image: UCL Special Collections]

The Blackstone Familiarized papers have been digitised by UCL Creative Media Services and transcribed by Professor Philip Schofield, the Director of the Bentham Project and General Editor of the Collected Works of Jeremy Bentham.  The images were uploaded to Transkribus and Chris Riley, a PhD student from the Faculty of Laws, has been marking the lines of text on each image and then copying the transcripts into the platform.

We are aiming to produce 200 pages of ‘difficult’ Bentham training data which can be fed into a new version of our latest HTR model.  We are also interested in comparing the accuracy of different models.  How far does this new material enhance the accuracy of the models we already have and would it be worthwhile to have separate models for Bentham himself and his secretaries?

The prospect of the automated recognition of Bentham’s handwriting would considerably speed up the full transcription of Bentham’s writings and the publication of his Collected Works.  We also want to experiment with HTR technology in a new version of Transcribe Bentham where volunteer transcribers could ask the computer to provide suggested readings of words that they are difficult to decipher.  Until then, we have some more transcribing to do!