X Close

Transcribe Bentham

Home

A Participatory Initiative

Menu

Archive for the 'Transcription' Category

Transcription Update – 8 December 2018 to 4 January 2019

By uczwlse, on 4 January 2019

HAPPY NEW YEAR! The first statistics update of 2019 is looking good.  Volunteers have been participating in our Transcription Challenge. They’re working together to transcribe a list of targeted pages and the productivity is impressive. Thank you to every who has transcribed something lately.

Here are the full statistics for the initiative – as of 4 January 2019.

21,307 manuscript pages have now been transcribed or partially-transcribed. Of these transcripts, 20,483 (96%) have been checked and approved by TB staff.

Over the past four weeks, volunteers have worked on a total of 102 manuscript pages. This means that an average of 26 pages have been transcribed each week during the past month.

Check out the Benthamometer for more information on how much has been transcribed from each box of Bentham’s papers!

Project Update – next stage of our transcription challenge

By uczwlse, on 14 December 2018

Since the summer, our volunteers have been working on something new.  Back in July I launched Transcribe Bentham’s first Transcription Challenge, where I asked volunteers to collaborate on the transcription of certain boxes of Bentham’s writings.  The idea was that with the combined effort of our transcribers, we could completely transcribe several boxes (Boxes 14, 50, 70, 72, 95, 537 and 538) that were close to completion.

Our volunteers responded enthusiastically and have been steadily transcribing the targeted material ever since.  They have now transcribed over 80% of the targeted pages. Their efforts deserve to be commended as much of this material is particularly nasty – difficult handwriting, complex layouts and foreign languages.  So a big thank you goes to everyone who has taken part so far!

After spending a bit of time checking through the content of these boxes, I have found a few extra pages lurking under the radar that still need to be transcribed in full.  Some eagle-eyed transcribers have already spotted and transcribed these outstanding pages – thank you!  The remaining pages still to be transcribed can be found in the below list – and I would be grateful for anyone who can take on the challenge of undertaking these transcriptions.

But I also want to invite volunteers to start work on a new challenge! This challenge is focused on fresh material, much of which is a little easier than that in Boxes 14, 50, 70, 72, 95, 537 and 538. After consultation with my colleagues at the Bentham Project, I have selected a few boxes of material that are likely to be of use to us as we work on the edition of Bentham’s Collected Works over the next few years.

These boxes are:

Volunteers can find lots of pages to transcribe from these boxes by consulting the Untranscribed Manuscripts list. Most of these boxes contain material written by both Bentham and his secretaries.  There is also some French language writing in Box 57 (pages 20-35).

The productivity of our volunteers always impresses me and it is fantastic to see what can be achieved when energies are channelled in a particular direction.  Thank you all!

If you have any questions or comments about the challenge, please let me know by email (transcribe.bentham@ucl.ac.uk).

Material still to transcribe from Boxes 14, 50, 72 and 537:

Box 14

Page Number
Content Difficulty of handwriting Foreign language?
JB/014/309/017 Deontology Difficult
JB/014/309/019 Deontology Difficult
JB/014/309/020 Deontology Difficult
JB/014/447/001 Deontology Difficult

Box 50

Page Number
Content Difficulty of handwriting Foreign language?
JB/050/090/001  Legal procedure  Difficult
JB/050/173/001  Legal procedure  Difficult  French
JB/050/174/001 Legal procedure (table form)  Difficult  French

Box 72

Page Number
Content Difficulty of handwriting
Foreign language?
 JB/072/183/002  Penal code  Difficult  French
 JB/072/183/003  Penal code  Difficult  French
 JB/072/183/004  Penal code  Difficult  French
 JB/072/184/001  Penal code  Difficult  French
 JB/072/186/001  Penal code  Difficult  French
 JB/072/215/001  Penal code (table form)  Difficult  French
 JB/072/216/001  Penal code  Difficult  French
 JB/072/216/002  Penal code  Difficult  French
 JB/072/216/003  Penal code  Difficult  French
 JB/072/216/004  Penal code  Difficult  French
 JB/072/217/001  Penal code  Difficult  French
 JB/072/219/001  Penal code  Moderate  French
 JB/072/219/002  Penal code  Moderate  French
 JB/072/220/001  Penal code  Moderate  French
 JB/072/220/002  Penal code  Moderate  French
 JB/072/220/003  Penal code   Moderate  French
 JB/072/220/004  Penal code   Moderate  French
 JB/072/221/001  Penal code   Moderate  French
 JB/072/221/002  Penal code   Moderate  French
 JB/072/221/003  Penal code   Moderate  French
 JB/072/221/004  Penal code   Moderate  French
 JB/072/222/001  Penal code   Moderate  French

Box 95

Page Number
Content Difficulty of handwriting
Foreign language?
 JB/095/001/001  Turnpike Act  Difficult
 JB/095/003/001  Turnpike Act  Difficult
 JB/095/063/001  Turnpike Act  Difficult
 JB/095/076/001  Sanctions  Difficult
 JB/095/107/001  Legal procedure  Difficult
 JB/095/109/002  Turnpike Act  Difficult
 JB/095/111/002  Turnpike Act  Difficult

Box 537

Page Number
Content Difficulty of handwriting
Foreign language?
 JB/537/363/001  Jeremy to Samuel Bentham  Difficult  French
JB/537/364/001  Jeremy to Samuel Bentham  Difficult  French
JB/537/365/001 Jeremy to Samuel Bentham Difficult French
JB/537/366/002 Jeremy to Samuel Bentham Difficult French

Transcription Update – 10 November to 7 December 2018

By uczwlse, on 10 December 2018

Howdy.  We’re here with the last statistics update of 2018.  We need to say a huge THANK YOU to our volunteers for all the hard work they have put in over the past 12 months.  We would be nothing without you!

Here are the full statistics for the initiative – as of 7 December 2018.

21,205 manuscript pages have now been transcribed or partially-transcribed. Of these transcripts, 20,378 (96%) have been checked and approved by TB staff.

Over the past four weeks, volunteers have worked on a total of 133 manuscript pages. This means that an average of 33 pages have been transcribed each week during the past month.

Check out the Benthamometer for more information on how much has been transcribed from each box of Bentham’s papers!

Project Update – Improving the Automated Recognition of Bentham’s handwriting

By uczwlse, on 28 November 2018

As our volunteer transcribers know, getting to grips with Bentham’s handwriting can be a steep learning curve.  Bentham never wrote particularly neatly and his scrawl became increasingly difficult to comprehend as he grew older.  Since 2013, the Bentham Project has been experimenting with advanced machine learning technology via the Transkribus platform in an attempt to train algorithms to automatically decipher Bentham’s handwriting.  And we have lately seen vastly improved results!

Read about our progress with Handwritten Text Recognition (HTR) technology and the Transkribus platform in blog posts from June 2017, February 2018 and October 2018.

HTR technology is open to anyone around the world thanks to Transkribus and the READ project. Once users have installed the platform, they can set about processing images and transcripts as training data for automated text recognition.  The software uses computational models for machine learning called neural networks.  These networks are trained to recognise a style of writing by being shown images and transcripts of that writing.  Anyone can start a test project in Transkribus by uploading around 75 pages of digitised images to the platform and transcribing each page as fully as possible.  The software learns from everything it is shown and so the more pages of training data, the better!  Find out more about getting started with Transkribus in the Transkribus How to Guides.

When we started working with HTR, it is fair to say that we were somewhat uncertain about the capabilities of the technology.  So we decided to focus on training a model to recognise some of the easier papers in the Bentham collection – those written by Bentham’s secretaries who tend to have neat handwriting.  Using around 900 pages of images and transcripts, we trained a model that is now publicly available to all Transkribus users under the name ‘English Writing M1’.  This model can produce transcripts of pages from the Bentham collection with a Character Error Rate (CER) of between 5-20%.  It produces good transcripts of pages written by Bentham’s secretaries but struggles to decipher Bentham’s own hand.

So our next challenge was to improve the recognition of Bentham’s most difficult handwriting.  For the past 18 months we have been continually creating training data in Transkribus based on very complex pages from the Bentham collection, periodically retraining HTR models and then assessing the results.  Until recently, our best result was a model trained on 81,000 transcribed words (around 340 pages) which used the ‘English Writing M1’ model as a base model.  By using a base model, Transkribus users can give the system a boost and ensure that it builds directly on what it has already learnt from the creation of an earlier model.  In this case, our resulting model could produce transcripts with an average CER of 17.75%.

The great thing about working with Transkribus is that the technology is improving all the time, thanks to the efforts of the computer scientists who work on the READ project.  The latest innovation is HTR+, a new form of Handwritten Text Recognition technology formulated by the CITlab team at the University of Rostock.  HTR+ is based on TensorFlow, a software library developed by Google.  It is works similarly to the existing HTR but processes data much faster, meaning that the algorithms can learn more quickly and so produce better results.  We used HTR+ to train a model on 140,000 transcribed words (or 535 pages) of Bentham’s most difficult handwriting.  This model can generate transcripts with a CER of around 9%.

An automated transcript of very difficult handwriting, using our latest HTR+ model. Image courtesy of UCL Special Collections.

HTR+ is not yet available to all Transkribus users – but users can request access by sending an email to the Transkribus team (email@transkribus.eu)

We are getting closer to the reliable recognition of Bentham’s handwriting and this is very exciting!  As a scholarly editing project dedicated to producing Bentham’s Collected Works, we require highly accurate transcripts as a basis for our work.  The experience of other Transkribus users suggests that transcripts which have a CER of around 5% can be corrected rapidly and easily.  So our next priority is to conduct some tests to see how easy Bentham Project researchers find it to correct and edit transcripts generated by this model where the CER is 9%.

We will also continue creating new pages of training data in Transkribus using images and transcripts of Bentham’s most difficult handwriting.  As well as retraining our current model with additional pages of data, we want to create smaller models focused on specific hands and languages in the Bentham collection.  This new training data could also be used to improve our Keyword Spotting tool, which was set up by the PRHLT research center at the Universitat Politècnica de València.

We are also preparing a large-scale experiment with Text2Img matching technology devised by the CITlab team.  This technology allows users to use existing transcripts as training data for HTR, rather than creating transcripts afresh in Transkribus.  We hope that this technology will allow us to create a new model based on several thousand pages from the Bentham collection – watch this space!

And of course, we can’t forget Transcribe Bentham.  We still plan to be able to integrate HTR technology directly into our crowdsourcing platform over the next few years.  The idea is that users will be able to check and correct automated transcripts or simply transcribe as normal and receive computer-generated suggestions of words that are difficult to decipher.  We believe that new users, who tend to be daunted by the complexity of Bentham’s handwriting, are likely to find these transcription options more attractive.  Experienced users may also appreciate word suggestions to assist their transcription work.

The Bentham Project is at the cutting-edge of this transformational technology and we hope that these advances will ultimately bring us closer to the complete transcription and publication of Bentham’s Collected Works.

My thanks go to Chris Riley, Transcription Assistant at the Bentham Project for his assistance with the preparation of training data in Transkribus.

Transcription Update – 13 October to 9 November 2018

By uczwlse, on 12 November 2018

It’s time for another update on the number of pages transcribed as part of our crowdsourcing initiative.  Thanks to the amazing efforts of our volunteer transcribers we can now report the following stats…

Here are the full statistics for the initiative – as of 9 November 2018.

21,072 manuscript pages have now been transcribed or partially-transcribed. Of these transcripts, 20,190 (95%) have been checked and approved by TB staff.

Over the past four weeks, volunteers have worked on a total of 138 manuscript pages. This means that an average of 35 pages have been transcribed each week during the past month.

Check out the Benthamometer for more information on how much has been transcribed from each box of Bentham’s papers!