Project Update – Searching Bentham’s manuscripts with Keyword Spotting!
By uczwlse, on 15 October 2018
The Bentham Project has been experimenting with the Handwritten Text Recognition (HTR) of Bentham’s manuscripts for the past five years, first as a partner in the tranScriptorium project and now as part of READ.
Our results have thus far been impressive, especially considering the immense difficulty of Bentham’s own handwriting. But automated transcription is not yet at a point where it is sufficiently accurate to be used by Bentham Project researchers as a basis for scholarly editing. It would be too time-consuming (and probably too irritating!) for us to correct the errors in the computer-generated transcripts of papers written in Bentham’s hand.
However, the current state of the technology is strong enough for keyword searching! And thanks to a collaboration with the PRHLT research center at the Universitat Politècnica de València (another partner in the READ project) we have some exciting new results to report. It is now possible to search over 90,000 digital images of the central collections of Bentham’s manuscripts, which are held at Special Collections University College London and The British Library.
Appeal for volunteers!
I have prepared a Google sheet with some suggested search terms in 5 different spreadsheet tabs (Bentham’s neologisms, concepts, people, places and other).
It would be fantastic if people filled in the spreadsheet to record some of their searches, using my suggested search terms and some of their own. Transcribers could search for subjects they are interested in and then cross-reference to material on the Transcription Desk that they might like to transcribe.
Who knows what we might find?? I hope to share some of these results in my upcoming presentation at the Transkribus User Conference in November 2018. Thanks in advance for your participation.
The PRHLT team have processed the Bentham papers with cutting-edge HTR and probabilistic word indexing technologies. This sophisticated form of searching is often called Keyword Spotting. It is more powerful than a conventional full-text search because it uses statistical models trained for text recognition to search through probability values assigned to character sequences (words), considering most possible readings of each word on a page.
We delivered thousands of images and transcripts to the team in Valencia and gave them access to the data we had already used to train HTR models in Transkribus. After cleaning our data and using Transkribus technology to divide the images into lines, the team in Valencia trained neural network algorithims to recognise and index the collection.
The result is that this vast collection of Bentham’s papers can be efficiently searched, including those papers that have not yet been transcribed! The accuracy rates are impressive. The spots suggest around 84-94% accuracy (6-16% Character Error Rate) when compared with manual transcriptions of Bentham’s manuscripts. More precisely speaking, laboratory tests show that the word average search precision ranges from 79% to 94%. This means that, out of 100 average search results, only as few as 6 may fail to actually be the words searched for. The accuracy of spotted words depends on the difficulty of Bentham’s handwriting – although it is possible to find useful results in Bentham’s scrawl! There could be as many as 25 million words waiting to be found.
This fantastic site will be invaluable to anyone interested in Bentham’s philosophy. It will help Bentham Project researchers to find previously unknown references in pages that have not yet been transcribed. It will allow researchers to quickly investigate Bentham’s concepts and correspondents. I hope that it will also help volunteer transcribers to find interesting material.
This interface is a prototype beta version. In the future we want to increase the power of this research tool by connecting it to other digital resources, allowing users to quickly search the manuscripts at the UCL library repository, the Bentham papers database and the Transcription Desk and linking these images to our rich existing metadata.
Similar Keyword Spotting technology (based on research by the CITlab team at the University of Rostock, another one of the READ project partners) is currently available to all users of the Transkribus platform. Find out more at the READ project website.
I welcome any feedback on our new search functionality at: email@example.com
My thanks go to the PRHLT research center, the University of Innsbruck and Chris Riley, Transcription Assistant at the Bentham Project for their support and assistance.