X Close

Transcribe Bentham

Home

A Participatory Initiative

Menu

Archive for the 'READ project' Category

UCL–University of Toronto: Transkribus, HTR, and medieval Latin abbreviations

Chris Riley20 April 2021

Following the successful receipt of funding under the University College London–University of Toronto Call for Joint Research Projects and Exchange Activities in 2019, a team composed of Professor Philip Schofield (UCL), Professor Michael Gervers (Toronto), Dr Chris Riley (UCL), Hannah Lloyd (Toronto), and Dr Ariella Elema (Toronto) sought to address two questions surrounding Transkribus, Handwritten Text Recognition (HTR), and abbreviated words within medieval Latin manuscripts:

1.) Could Transkribus be trained consistently to process abbreviated Latin words, which can represent up to half the vocabulary of medieval legal texts, and hence feature in a substantial proportion of the Documents of Early England Data Set (DEEDS) corpus at the University of Toronto?

2.) Could Transkribus be made consistently to recognise hyphenated words which span multiple lines of text (insofar as they are both in Latin and abbreviated)?

Prior to the commencement of the project, attempts to generate HTR models within Transkribus in the hope that they would be able to process material within the DEEDS corpus were unsuccessful and yielded very high Character Error Rates (CERs) and Word Error Rates (WERs)—percentages which reflect the ability of the software programme to interpret individual characters and individual words within untranscribed pages respectively. As shown in Figure 1 below, the WERs for these models, which were generated prior to the award of funding, reached as high as 94.2%, or 54% on average, and the CERs reached as high as 49.6%, or 25.91% on average, making them far from ideal for research purposes, and certainly warranting further efforts in the hope of bringing them down significantly. Figure 1 below shows the model results achieved prior to the beginning of the project.

Our goal was therefore to bring down these WERs and CERs, preferably to below 10%, in order to make the use of HTR a viable alternative to the often costly use of standard human transcription by an individual or group of individuals with the relevant Latin and palaeographical skills. The initial dataset which was used to this end was composed of 272 pages with their corresponding transcripts from the DEEDS database, representing 10,000 lines of text in total.

The first step was to go through the existing subcollections from the DEEDS corpus on Transkribus in order to improve all of the text regions and baselines which the software had added to the manuscripts. This was to ensure that all text was included in the text regions, and all baselines covered every part of each word on each line, as errors here may have caused significant reductions in the WERs and CERs shown in Figure 1. Once this was completed, the transcripts were checked against the images in order to ensure that they were as faithful as possible to the manuscripts before proceeding with the creation of improved HTR models.

On 6–7 February 2020, prior to the creation of several new HTR models based upon this improved groundtruth (that is an accurate transcription and representation of a set of manuscripts), Riley and Lloyd attended the Transkribus User Conference in Innsbruck, Austria, where they delivered a presentation to around 100–150 attendees on the project and the methodology that would be used. This experience, and the advice they received from other attendees, was invaluable in informing their research during the ensuing phases of the project, not least with regard to how they would approach the issue of processing such a large number of Latin abbreviations.

After the Transkribus User Conference, Lloyd visited the Bentham Project at UCL for a period of two weeks. This time was spent refining their existing groundtruth, creating and testing three new HTR models (see below, including Figure 2), building a large dictionary of abbreviations, and testing the first instalment of the ‘abbrevSolver-master’ Python script, developed by Ismail Prada, an independent programmer from Zurich, Switzerland.

After receiving valuable advice regarding how best to address our principal aim of processing Latin abbreviations and after making contact with Prada and experimenting with his script, we created the abbreviation dictionary, consisting of over a hundred abbreviated Latin words, both in their expanded and contracted forms, in which the latter were represented by compatible special characters that best reflected how they appeared in type. These abbreviations were also categorised as prefixes, suffixes, or standalone abbreviations, which would alter how they would be processed by the algorithm.

As the script was new and untested, this process, and the creation of an abbreviation dictionary and special character set with which it was compatible, was especially problematic, with multiple versions of the appropriate tab-separated Excel file containing the abbreviated words and several varieties of special characters being created in an attempt to get it to function as intended.

After a frustrating amount of trial and error, it was decided to proceed with the finding-and-replacing of the abbreviated words without the use of the script. Around a third to a half of the shortened word forms in the abbreviation dictionary were manually replaced, with each word replaced wherever they appeared in each folio and then tagged accordingly. This process was extremely time-consuming, mainly owing to the search function within Transkribus being slightly awkward when navigating large pages of results.

The original script, with which we had a substantial number of problems (attributable to our use of Windows 10 rather than Linux, and Excel rather than Microsoft Visual Studio Code, our choice of certain special characters which turned out to be incompatible, and a period when we were unable to elicit any support from the developer), was also far from perfect in how it would have operated had it worked correctly. One had to run it for each and every .xml file saved through Transkribus and copy and paste the data back and forth for each manuscript image. Given that manually finding and replacing abbreviated words through Transkribus and tagging them accordingly was prohibitively time intensive, it was enormously helpful that, after resuming contact with Prada, he not only fixed the first version of the script and provided feedback on the form of our abbreviation dictionary, but developed a superior API equivalent that permitted processing the transcripts in bulk.

With the newer API script, everything is done by connecting directly to Transkribus, after giving it the collection editor’s username and password and the collection ID (with the format of the abbreviation dictionary remaining the same as the earlier iteration). As well as being quicker, the new script is simpler to use. After running a basic command, the script communicates with Transkribus and uses its find-and-replace algorithm on each subcollection, replacing each term it finds from the abbreviation dictionary with its shorter equivalent and tagging them as abbreviated. Once the API version of the abbrevSolver script had been run, over one hundred abbreviations from our dictionary were incorporated into the transcripts for the existing 272 images almost immediately, before an impressive new model was created, our fifth in total.

Five new HTR models were generated at this stage of the project, using our available dataset as it progressed, with the fourth and fifth having been generated after using the new version of abbrevSolver-master, and with a sixth and seventh being created at a later juncture. These initial five models and their individual statistics are shown in Figure 2 below.

With the exception of the second model, which provided a very good CER but anomalously high WER, we have seen an extremely promising decline in both CERs and WERs over the course of the project. In the fifth in particular, we developed a model that, at least on paper, outperformed many of those that were then publicly available on Transkribus for similar types of material, in terms of their CERs on their train and validation sets.

When compared to the models that were generated prior to the commencement of the project, the two models that were generated after using the new script are extremely good. Specifically, UCL–University of Toronto #5, yielded CERs 13.7% lower and WERs 37.15% lower than those achieved before the project began, and far below the ideal threshold of 10%.

Two batches of material were uploaded onto Transkribus after the creation of our fifth model for testing purposes—namely BL Add. MS 46,487 and Egerton MS 3,031—consisting of 300 images in total. The biggest problem with these two collections was the quality of the images, with their dimensions being too low to work effectively with Transkribus. Soon afterwards, however, we were able to obtain two much more promising batches of material—namely Anglo-American Legal Tradition (AALT) material and material from two cartularies held by Christ Church College, Oxford.

The first round of AALT testing was performed on images and transcripts from CP/25/1/88, held by the National Archives in the UK. For these, we achieved an average CER of 24% and average WER of 50%, which were disappointing compared to the model statistics given above—even though, with three or four of the worst-performing folios excluded, the average CER dropped to 21.58% and the average WER dropped to 48.15%. Our best-performing folio from CP/25/1/88 achieved a CER of 8.82% and a WER of 28.57%, and another achieved a CER of 9.13% and a WER of 27.78%.

A further problem with this material, alongside the one of image quality, was the brevity of each transcript. While they did include several abbreviations, the transcripts themselves only consisted of between two and five lines of text and did not cover whole pages. This made the process of testing much less time-consuming because the groundtruth could be created very quickly, but the standardised form that each folio took, and the inclusion of a large number of words common to each folio, meant that our error rates may have been misleadingly high in some instances where the models performed relatively well.

We were thereafter able to obtain ten further, complete and thus lengthier AALT transcripts. Having uploaded the lengthier AALT transcripts, the latest model was run again and achieved a much better CER of 15.62% and a static WER of 46.78%. There were only a few folios where the results of the shorter transcripts were better than the longer ones, which is very positive, given that the greater number of words left a far greater margin for error.

For comparative purposes, two third party models were then run, namely Charter Scripts XIII–XV and HIMANIS Chancery M1+, with the aim of comparing their results with those from our own models. For 50% of the folios, Charter Scripts XIII–XV out-performed our model, suggesting that the material on which it is based is more akin to the content of AALT than to the material used to generate UCL–UoT #4–5. On average, though, the increase was negligible—namely a 0.35% difference in the CER and a 2.26% difference in the WER. (The standard HIMANIS model was worse than ours in both respects).

Overall, these results were not as promising as those from the material which was used to generate our own models. Compared to the AALT results, our fourth model gave a 14.44% better CER and a 41.93% better WER, which are enormous differences when it comes to the practicability of using Transkribus as an alternative to human transcription. This indicated a significant dissimilarity between the AALT material and our own and would require a much expanded groundtruth in order to see an improvement, something that we went on to do with the Christ Church material discussed below, namely the Cartulary of Eynsham Abbey and the Cartulary of St Fridewide’s Priory.

Approximately 300 images were downloaded for Eynsham and around 180 for St Frideswide, with the available transcripts separated by line and matched to the folios on Transkribus. As with the BL Add MS 46,487 and Egerton MS 3,031 material, however, the image quality was very poor and this was fully reflected in the testing results. This round of testing provided poor WERs of around 82.02% and CERs of around 32.15%, double those from the second round of AALT testing.

After this, Christ Church College was contacted directly. After explaining the nature of our project, and inquiring about obtaining some better quality variants, they very kindly waived their £10 per .tiff file fee and sent a small number of images on which we could perform some testing. These higher-resolution images made a huge difference; our CER improved by 14.42% and our WER by 36.3%.

These images were also tried with the two third-party models mentioned above, Charter Scripts XIII–XV and HIMANIS Chancery M1+, which achieved lower WERs of 39.83% and CERs of 15.21%. For one folio, Generic Charters out-performed our model by 3.19% CER and 5.38% WER. As with the AALT testing attempt, this suggests a dissimilarity between our material, the testing material, and the material used by that model’s creators, thus indicating that the cartulary data should be integrated into our groundtruth and that a new model ought to be created.

During the cartulary testing, several different dictionary settings within Transkribus were experimented with when running each model, with the ‘Language model from training data’ option providing by far the best results. This suggests that we are in fact getting significant added value from our abbreviation dictionary, even above what is possible through using a very extensive third-party Latin dictionary repository composed of around 100,000 words.

The most substantial problem with these cartularies, however, was that only a handful of the images that were sent from Christ Church had their corresponding transcripts included in the DEEDS corpus. Because of this, only two out of the batch of high-resolution images could be tested, both of which being from Eynsham, so the testing results were relatively positive but not ideal.

Christ Church then kindly agreed to provide us with all of the high-resolution files for both the Eynsham and St Frideswide cartularies for which we possessed transcripts (approximately 125 in total). The issue we faced, however, was with our available transcripts of the Eynsham and St Frideswide material, which had to be checked closely against the manuscripts for accuracy before Text2Image matching could take place. In the process of doing so, we uncovered some serious textual issues, for instance some lines in St Frideswide had been transcribed with one letter representing each word with spaces in between, thus making it almost unusable for the purpose of expanding our groundtruth, at least without a more expert eye to correct the Latin.

Elema then checked the text thoroughly. The working transcripts were based on editions of the Eynsham and St Frideswide cartularies published in 1907 and 1895 respectively. Elema corrected the text to reflect more closely what was written in the manuscripts. Where the editors had standardized the Latin orthography and capitalisation, Elema returned the words to their idiosyncratic medieval forms, and where the editions had replaced redundant phrases with ellipses, she filled in the complete text. Since the model was already capable of resolving common symbols and abbreviations, she transcribed short abbreviations, like the symbols for est, et, con, per, and quod in their expanded form. Less common abbreviations and ones that left out more than two or three letters in a row were transcribed as they appeared in the manuscript and noted in a spreadsheet so that they could be tagged later and added to the model’s dictionary of abbreviated words.

The combined sixth model, entitled UCL-University of Toronto #6—created using 150,488 words and 10,519 lines of text in total—achieved an impressive CER on the train set of 1.72%, but owing to an issue which arose when allocating the validation set, the CER on the validation set was very high, namely 7.56%. Because of this anomalously high CER on the validation set, a seventh and final model was created, entitled UCL-University of Toronto #7. This was based upon 140,158 words and 9,780 lines of text, and returned an impressive CER on the train set of 1.39% and a CER on the validation set of 0.80%, which we were very happy with overall.

After these two models had been created, we conducted some testing, both on the two cartularies from Oxford with which we had expanded our original groundtruth, and on the original collections on Transkribus. On the former, the Oxford material, the results were good: on St Frideswide we achieved a CER on the train set of 6.28% and a CER on the validation set of 7.45%; and on Eynsham we achieved a CER of 5.89% on the train set and a CER of 13.68% on the validation set—not perfect, particularly the validation CER for the Eynsham cartulary, but far better than what we had achieved before obtaining high-quality images and correcting our transcripts.

These figures from the Oxford cartulary testing, however, were not quite as promising when compared to some of those achieved when creating discrete, smaller models, unique to each cartulary, without borrowing elements from the collections we had used to generate our first five models. For the discrete St Frideswide model, developed from a modest 14,781 words and 1,113 lines of text, the CERs were 1.93% better than those achieved when applying UCL–University of Toronto #7 to the same material, but the WERs were a negligible 0.03% worse. For the discrete Eynsham model, developed from 6,074 words and 825 lines of text, the CERs were 2.97% better than those achieved when applying UCL–University of Toronto #7 to the same material, and the WERS were a good 7.56% better. Figure 3 below shows the results of models six and seven, as well as the results of the two models based on the Oxford cartularies which were used for comparative purposes.

Most importantly, though, the expansion of our groundtruth with material from Oxford and the creation of our seventh model, did improve our testing results on material from the DEEDS corpus when compared to anything we had been able to achieve by using the first five models, which were created exclusively with DEEDS manuscripts, giving final figures for testing on average across all our collections of a CER of 0.53% and a WER of 1.09%, with some individuals subcollections on Transkribus achieving CERs as low as 0.13%, and some achieving WERs as low as 0.13% also. Figure 4 below shows the average testing results of our seventh model when applied to material from the DEEDS corpus and to St Frideswide and Eynsham, as well as the testing results of the small St Frideswide- and Eynsham-specific smaller comparative models.

In order to showcase the final model and to exhibit to others the nature and scope of the project, on 3 March we hosted a successful virtual launch event, chaired by Schofield and featuring short talks by Gervers, Causer, Lloyd, Riley, and Elema. During the session, Schofield introduced the project and its funders, and provided biographies of each of its speakers, after which Gervers discussed the history of the DEEDS project from its inception in the 1970s, the approximately 60,000 charters of which it is comprised, topic modelling, and the dating of the material. Causer then discussed the Bentham Project, Transcribe Bentham, and the Project’s involvement in the READ programme and the development of Transkribus, before Lloyd discussed the early endeavours to apply HTR technology to material from the DEEDS corpus. Lloyd then handed over to Riley, who proceeded to discuss the preliminary model creation attempts and the results of the first four UCL–University of Toronto models, the use of abbrevSolver-master and its API variant, followed by an analysis of the sixth and seventh models and their testing results on material from the DEEDS corpus and the Oxford cartularies respectively. Between Riley’s discussion of models one to four and models six to seven, Elema discussed her methodology and the challenges she faced and overcame when correcting the groundtruth of the Oxford cartularies. Finally, the speakers answered questions from those in the audience.

The following week, on 10 March, Riley delivered a training workshop, teaching new and intermediate users how to use the Transkribus software platform. After a recap of the previous week’s event, the session covered account creation, installation and setup, Transkribus credits and how many are required to process certain quantities of pages, creating collections and importing manuscript images, applying baselines and text regions, adding transcripts, applying tags, creating and applying HTR+ models (including UCL–UoT #7), analysing results and comparing one’s groundtruth against one’s HTR-generated transcripts, searching and keyword-spotting, and exporting the resulting data.

The total number of attendees across both sessions was approximately 267, and the videos of both events are now available to view on the Bentham Project’s ‘Bentham Seminar’ channel on YouTube.

Our final model and its training data are now publicly available through Transkribus, further information on which may be found here. If you do use it, we would love to hear about your results via email.

We would like to express our sincere gratitude to everyone who has been involved in the project, as well as to our funders and to those who supplied manuscript images, and to everyone who attended the model launch event on 3 March or the workshop on 10 March.

Chris Riley, Bentham Project, Faculty of Laws, UCL

c.riley@ucl.ac.uk

6 May 2020

Project Update – So long to Transcribe Bentham

uczwlse23 April 2019

Hello! I’m checking in with a final blog post before I move on to my new job at The National Archives. My last day in the office will be 26th April.  If you need any assistance after I’ve left, please contact transcribe.bentham@ucl.ac.uk and Bentham Project staff will pick up your message.

Working at the Bentham Project has been an incredibly rewarding experience and one which has certainly caused me more pleasure than pain over the past four years. I hope that in my own small way I have contributed to an improved understanding of Bentham’s life and philosophy.

I knew that transcription would be a large part of my job when I was asked to transcribe a Bentham manuscript during my job interview! Hopefully my eye for Bentham’s handwriting has improved somewhat since then but I still feel a sense of trepidation when I encounter a page from the 1830s (when Bentham’s handwriting was at its worst!).

After initially supporting my Bentham Project colleagues with the editing and researching of Bentham’s papers, I became the coordinator of the Transcribe Bentham crowdsourcing initiative in January 2016. It has been a real privilege to work alongside such a dedicated group of volunteers, many of whom have been with the project for several years. They have blossomed into extraordinarily skilled transcribers who bravely tackle Bentham’s writings, which can at times be chaotic and convoluted. Our volunteers play an invaluable role in helping the Bentham Project to complete the mammoth task of transcribing thousands upon thousands of pages left behind by Bentham. We endeavour to ensure that our volunteers feel part of our community, sending them a monthly newsletter and encouraging them to take part in events like the Bentham Hackathon. I was delighted to be able to support volunteers to transcribe 20,000 pages of Bentham’s writings, an important milestone which was reached in April 2018. Our transcribers add to this total daily and I will enjoy watching their progress continue from afar. I would like to give huge thanks to all of our volunteers for their patience, hard work and company over the past few years. They are living proof that amazing things can be achieved through collaboration between researchers and members of the public.

Transcribe Bentham volunteers Annette Brindle, Simon Croft and Gill Hague celebrating the complete digitisation of Bentham’s papers in June 2018.

I’ve also had the chance to experiment with new ways of transcribing documents thanks to the Bentham Project’s role in the EU-funded READ project. Handwritten Text Recognition (HTR) technology is progressing rapidly and computers are getting better at tackling difficult manuscripts like those written by Bentham. As reported here on the blog, we’re now able to use the Transkribus platform to recognise Bentham’s hand with an average Character Error Rate of just 9% (meaning that 91% of characters are transcribed correctly by the machine). We also have a Keyword Spotting interface where users can work with HTR to search the entirety of the Bentham papers. I’ve been lucky enough to present our results internationally, from the Nordic Countries to the Balkans, and have taught hundreds of academics and archivists in regular Transkribus workshops. I’ve enjoyed spreading the word about Transkribus far and wide through writing the READ project’s blog, newsletter and social media and organising the annual Transkribus User Conference. I have high hopes for the future of HTR and look forward to seeing volunteer skills and machine recognition integrated in a future version of Transcribe Bentham.

On the judging panel at the 2017 Bentham Hackathon.

Of course I need to thank all of my colleagues, in the Bentham Project, in UCL Laws and in the READ project, for the inspiration, support and fun. I have learnt a lot about Bentham, historical research, archives, scholarly editing, digital humanities, public engagement, crowdsourcing and much more, which will all stand me in good stead for my new role at The National Archives.

Project update – master Bentham’s handwriting with Transkribus Learn

uczwlse18 January 2019

Over the years, our volunteers have developed an enviable expertise in deciphering Bentham’s decidedly difficult handwriting.  They can even transcribe pages like this! By contrast, many newcomers are understandably daunted by Bentham’s scrawl – they may start transcribing a page one day but then never return.

A new e-learning website, produced by the University of Innsbruck as part of the READ project, promises to help anyone and everyone get to grips with all kinds of historical handwriting. Transkribus Learn does not replace systematic paleography training but it allows users to practice reading and transcribing individual words, learning as they go.

To try it out:

Transkribus Learn has two transcription modes – ‘Study’ and ‘Test’.

In the former, users can guess and then reveal the transcription of individual words in a manuscript.  In the latter, users will be prompted to transcribe the missing word in a series of examples. At the end, you receive your score and a list of correct and incorrect answers. You can keep studying and testing yourself, as often as you like.

There are two Bentham collections on the site – categorised as ‘easier Bentham (containing writing by Bentham and his secretaries) and ‘difficult Bentham’. Both collections are an ideal training ground for new volunteers, offering the opportunity to practice transcribing different words in rapid succession.

We recommend that new volunteers start with the ‘easier Bentham’ and move onto the more difficult pages once they feel ready.  I hope some of our long-standing volunteers might also have a play and challenge themselves to ready some of Bentham’s nastiest handwriting!

What does that say?? Practice transcribing Bentham and more with Transkribus Learn

Access the Bentham material in Transkribus Learn:

As one of the partners in the READ project, the Bentham Project helped to develop Transkribus Learn. But there’s so much more than Bentham to discover on the site. The site currently contains scripts from the 12th to the 19th centuries in a range of languages. Users can also upload their own documents to the platform as a training exercise for students or volunteers.

The Transkribus team look forward to helping a broad range of people learn valuable new transcription skills!  They welcome any feedback or questions (learn@transkribus.eu).

Project Update – Improving the Automated Recognition of Bentham’s handwriting

uczwlse28 November 2018

As our volunteer transcribers know, getting to grips with Bentham’s handwriting can be a steep learning curve.  Bentham never wrote particularly neatly and his scrawl became increasingly difficult to comprehend as he grew older.  Since 2013, the Bentham Project has been experimenting with advanced machine learning technology via the Transkribus platform in an attempt to train algorithms to automatically decipher Bentham’s handwriting.  And we have lately seen vastly improved results!

Read about our progress with Handwritten Text Recognition (HTR) technology and the Transkribus platform in blog posts from June 2017, February 2018 and October 2018.

HTR technology is open to anyone around the world thanks to Transkribus and the READ project. Once users have installed the platform, they can set about processing images and transcripts as training data for automated text recognition.  The software uses computational models for machine learning called neural networks.  These networks are trained to recognise a style of writing by being shown images and transcripts of that writing.  Anyone can start a test project in Transkribus by uploading around 75 pages of digitised images to the platform and transcribing each page as fully as possible.  The software learns from everything it is shown and so the more pages of training data, the better!  Find out more about getting started with Transkribus in the Transkribus How to Guides.

When we started working with HTR, it is fair to say that we were somewhat uncertain about the capabilities of the technology.  So we decided to focus on training a model to recognise some of the easier papers in the Bentham collection – those written by Bentham’s secretaries who tend to have neat handwriting.  Using around 900 pages of images and transcripts, we trained a model that is now publicly available to all Transkribus users under the name ‘English Writing M1’.  This model can produce transcripts of pages from the Bentham collection with a Character Error Rate (CER) of between 5-20%.  It produces good transcripts of pages written by Bentham’s secretaries but struggles to decipher Bentham’s own hand.

So our next challenge was to improve the recognition of Bentham’s most difficult handwriting.  For the past 18 months we have been continually creating training data in Transkribus based on very complex pages from the Bentham collection, periodically retraining HTR models and then assessing the results.  Until recently, our best result was a model trained on 81,000 transcribed words (around 340 pages) which used the ‘English Writing M1’ model as a base model.  By using a base model, Transkribus users can give the system a boost and ensure that it builds directly on what it has already learnt from the creation of an earlier model.  In this case, our resulting model could produce transcripts with an average CER of 17.75%.

The great thing about working with Transkribus is that the technology is improving all the time, thanks to the efforts of the computer scientists who work on the READ project.  The latest innovation is HTR+, a new form of Handwritten Text Recognition technology formulated by the CITlab team at the University of Rostock.  HTR+ is based on TensorFlow, a software library developed by Google.  It is works similarly to the existing HTR but processes data much faster, meaning that the algorithms can learn more quickly and so produce better results.  We used HTR+ to train a model on 140,000 transcribed words (or 535 pages) of Bentham’s most difficult handwriting.  This model can generate transcripts with a CER of around 9%.

An automated transcript of very difficult handwriting, using our latest HTR+ model. Image courtesy of UCL Special Collections.

HTR+ is not yet available to all Transkribus users – but users can request access by sending an email to the Transkribus team (email@transkribus.eu)

We are getting closer to the reliable recognition of Bentham’s handwriting and this is very exciting!  As a scholarly editing project dedicated to producing Bentham’s Collected Works, we require highly accurate transcripts as a basis for our work.  The experience of other Transkribus users suggests that transcripts which have a CER of around 5% can be corrected rapidly and easily.  So our next priority is to conduct some tests to see how easy Bentham Project researchers find it to correct and edit transcripts generated by this model where the CER is 9%.

We will also continue creating new pages of training data in Transkribus using images and transcripts of Bentham’s most difficult handwriting.  As well as retraining our current model with additional pages of data, we want to create smaller models focused on specific hands and languages in the Bentham collection.  This new training data could also be used to improve our Keyword Spotting tool, which was set up by the PRHLT research center at the Universitat Politècnica de València.

We are also preparing a large-scale experiment with Text2Img matching technology devised by the CITlab team.  This technology allows users to use existing transcripts as training data for HTR, rather than creating transcripts afresh in Transkribus.  We hope that this technology will allow us to create a new model based on several thousand pages from the Bentham collection – watch this space!

And of course, we can’t forget Transcribe Bentham.  We still plan to be able to integrate HTR technology directly into our crowdsourcing platform over the next few years.  The idea is that users will be able to check and correct automated transcripts or simply transcribe as normal and receive computer-generated suggestions of words that are difficult to decipher.  We believe that new users, who tend to be daunted by the complexity of Bentham’s handwriting, are likely to find these transcription options more attractive.  Experienced users may also appreciate word suggestions to assist their transcription work.

The Bentham Project is at the cutting-edge of this transformational technology and we hope that these advances will ultimately bring us closer to the complete transcription and publication of Bentham’s Collected Works.

My thanks go to Chris Riley, Transcription Assistant at the Bentham Project for his assistance with the preparation of training data in Transkribus.

Project Update – Searching Bentham’s manuscripts with Keyword Spotting!

uczwlse15 October 2018

The Bentham Project has been experimenting with the Handwritten Text Recognition (HTR) of Bentham’s manuscripts for the past five years, first as a partner in the tranScriptorium project and now as part of READ.

Read about our progress with HTR and the Transkribus platform in blog posts from June 2017 and  February 2018.

Keyword Spotting

Our results have thus far been impressive, especially considering the immense difficulty of Bentham’s own handwriting.  But automated transcription is not yet at a point where it is sufficiently accurate to be used by Bentham Project researchers as a basis for scholarly editing.  It would be too time-consuming (and probably too irritating!) for us to correct the errors in the computer-generated transcripts of papers written in Bentham’s hand.

However, the current state of the technology is strong enough for keyword searching!  And thanks to a collaboration with the PRHLT research center at the Universitat Politècnica de València (another partner in the READ project) we have some exciting new results to report.  It is now possible to search over 90,000 digital images of the central collections of Bentham’s manuscripts, which are held at Special Collections University College London and The British Library.

A Keyword Spotting search for the word ‘pleasure’

 

Appeal for volunteers!

I have prepared a Google sheet with some suggested search terms in 5 different spreadsheet tabs (Bentham’s neologisms, concepts, people, places and other).

It would be fantastic if people filled in the spreadsheet to record some of their searches, using my suggested search terms and some of their own.  Transcribers could search for subjects they are interested in and then cross-reference to material on the Transcription Desk that they might like to transcribe.

Who knows what we might find??  I hope to share some of these results in my upcoming presentation at the Transkribus User Conference in November 2018.  Thanks in advance for your participation.

Background

The PRHLT team have processed the Bentham papers with cutting-edge HTR and probabilistic word indexing technologies.  This sophisticated form of searching is often called Keyword Spotting.  It is more powerful than a conventional full-text search because it uses statistical models trained for text recognition to search through probability values assigned to character sequences (words), considering most possible readings of each word on a page.

We delivered thousands of images and transcripts to the team in Valencia and gave them access to the data we had already used to train HTR models in Transkribus.  After cleaning our data and using Transkribus technology to divide the images into lines, the team in Valencia trained neural network algorithims to recognise and index the collection.

The result is that this vast collection of Bentham’s papers can be efficiently searched, including those papers that have not yet been transcribed!  The accuracy rates are impressive.  The spots suggest around 84-94% accuracy (6-16% Character Error Rate) when compared with manual transcriptions of Bentham’s manuscripts.  More precisely speaking, laboratory tests show that the word average search precision ranges from 79% to 94%.  This means that, out of 100 average search results, only as few as 6 may fail to actually be the words searched for. The accuracy of spotted words depends on the difficulty of Bentham’s handwriting – although it is possible to find useful results in Bentham’s scrawl!  There could be as many as 25 million words waiting to be found.

Use cases

This fantastic site will be invaluable to anyone interested in Bentham’s philosophy.  It will help Bentham Project researchers to find previously unknown references in pages that have not yet been transcribed.  It will allow researchers to quickly investigate Bentham’s concepts and correspondents.  I hope that it will also help volunteer transcribers to find interesting material.

This interface is a prototype beta version.  In the future we want to increase the power of this research tool by connecting it to other digital resources, allowing users to quickly search the manuscripts at the UCL library repository, the Bentham papers database and the Transcription Desk and linking these images to our rich existing metadata.

Similar Keyword Spotting technology (based  on research by the CITlab team at the University of Rostock, another one of the READ project partners) is currently available to all users of the Transkribus platform.  Find out more at the READ project website.

I welcome any feedback on our new search functionality at: transcribe.bentham@ucl.ac.uk

My thanks go to the PRHLT research center, the University of Innsbruck and Chris Riley, Transcription Assistant at the Bentham Project for their support and assistance.