[posted by Julianne Nyhan on behalf of Marco Humbel]
This blog post reports on our recent paper: Humbel, M., Nyhan, J., Vlachidis, A., Sloan, K. and Ortolja-Baird, A. (2021), “Named-entity recognition for early modern textual documents: a review of capabilities and challenges with strategies for the future”, Journal of Documentation, https://doi.org/10.1108/JD-02-2021-0032
Link to full text of publisher-accepted version: https://discovery.ucl.ac.uk/id/eprint/10127463/ (note that this is the version of the paper accepted by the publisher before final proofs and so there will be some minor differences between this and the final published version).
Summary of paper:
Named Entity Recognition (NER) is an information extraction technique for identifying, segmenting and labelling phenomena of interest like those of people, organizations and places (Piskorski and Yangarber, 2013). In the article reported on here, we synthesise current research on the application of NER to digitized documents of the early modern period. We also examine NER and authority files from a more critical perspective, and suggest directions to enrich the application of NER going forward. Our findings are based upon an extensive literature review and a case study undertaken by the Leverhulme Trust-funded ‘Enlightenment Architectures: Sir Hans Sloane’s Catalogues of his Collections (2016–2021)’. Our findings suggest that “Currently, it is not possible to benchmark the capabilities of NER as applied to documents of the early modern period”. And we “draw attention to the situated nature of authority files, and current conceptualisations of NER, leading … to the conclusion that more robust reporting and critical analysis of NER approaches and findings is required” (https://www.emerald.com/insight/content/doi/10.1108/JD-02-2021-0032/full/html). We hope our article will be useful for researchers and heritage professionals who seek to use NER on the abundance of digitised sources available for the early modern period.
Discussion of paper:
What is the state of the art of NER as applied to early modern documents? Our frank response is that we currently know only how a particular NER system performs on a specific corpus. We have surveyed 9 projects dating from 2002 to 2019 and found that a number of factors limit the possibilities for a simple comparison. Historical documents of the early modern period present a heterogeneous set of resources consisting not only of different types of material including: manuscripts, collection catalogues, encyclopaedias, or pamphlets. But also, within one corpus, or even one document, we might find various languages (e.g.: Latin, English and French), an unstandardised spelling, and errors in their transcriptions made by scribes, or through text recognition software like OCR (Optical Character Recognition). Our case-study on Sloane’s catalogues also showed extensive XML (Extensible Markup Language) annotation preceding the NER process can hamper performance, particularly when presentational and semantic tags co-occur. NER systems are thus ideally should be applied before annotating a corpus with standards like TEI (Text Encoding Initiative).
All of these factors can impact the accuracy of NER. But the generalization of NER approaches to early modern documents and the transferability of projects’ outcomes is impeded further through an under-reporting of selected approaches like the human labour required for data processing.
Different methods for measuring NER systems’ effectiveness are also used. (for an example see Goldfield, 1993). The inter-annotator agreement score of human annotators sets the benchmark of what should be expected from automated systems (Sperberg‐McQueen, 2016). Inter-annotator agreement scores of 95% and more can be reached for historical corpora (McDonough and Camp, van de, 2017; Erdmann et al., 2016). If we compare the project reports by McDonough et al., 2019 and Won et al., 2018, which are in our survey the most comparable ones, we see that NER systems reach on early modern documents in the best cases accuracies of about 70%. These results make significant human post-processing efforts inevitable, and hold back the benefits that would come with automating the repetitive parts of annotation tasks.
Human domain expertise will also be required in the future because what constitutes an entity can’t always be reduced to a binary yes/no. We discussed these challenges in regard to Sloane’s catalogues in Ortolja-Baird et al., 2019. Our survey shows also that so-called rule-based NER systems are only gradually being superseded by machine learning techniques. This is because machine-learning techniques require huge amounts of training data, which typically are not available to digital humanities projects. Yet, promising results were recently demonstrated on highly structured early modern marriage records (Toledo et al., 2019).
Rule-based NER systems are dependent on authority files and gazetteers (look-up lists for identifying entities). The prevalence of rule-based systems in our survey motivated us to map-out the landscape of authority files for scholarship on the early modern period. These resources could also form the basis for training data for future machine-learning NER techniques. The authority files we have surveyed were created by a number of different actors (heritage institutions and researchers) and are due to the lack of a central registry difficult to find. As others have argued, specialized authority files for the early modern period are rare (Nelson, 2014; McDonough et al., 2019). Authority files seem commonly to be viewed as mere tool for working with source material. But it is known that authority files are often incomplete and as McDonough et al. 2019 observed that general purpose authority files can be inaccurate and at worst insensitive to past and present local languages, reinforcing hegemonic world-views. It is thus necessary to develop critical frameworks for interrogating authority files. The creators of authority-files could support this development by providing more documentation about their compilation rationale.
What is the way forward? In order to support more robust reporting on the capabilities of NER we propose a forum where tools are evaluated according to standards formulated by the early modern research community. Possible models for the nature of such a forum could be corpora and conference series like ConLL (Conference on Computational Natural Language Learning), as they are established within the wider NER community. We also acknowledge that NER is not a neutral intervention, neither are authority-files. A digital tool criticism, as proposed by Koolen et al. (2019) could foster a more critical understanding of NER, its biases and its ethical implications.
The full article is available from the Journal of Documentation. We are grateful to the Leverhulme Trust, which provided the research project grant (rpg-2016-239) for Enlightenment Architectures. Thank you to the Centre for Critical Heritage Studies, UCL for funding part of this work.
We hope that the following list of resources is useful for any colleagues who are interested in applying NER to early modern documents. All links were last accessed on 05.07.2021.
Dyer-Witheford, N., Kjøsen, A. M. and Steinhoff, J. (2019). Inhuman Power: Artificial Intelligence and the Future of Capitalism. London: Pluto Press.
Erdmann, A., Brown, C., Joseph, B., Janse, M., Ajaka, P., Elsner, M. and Marneffe, M.-C. de (2016). Challenges and Solutions for Latin Named Entity Recognition. Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH). Osaka, Japan: The COLING 2016 Organizing Committee, pp. 85–93 https://www.aclweb.org/anthology/W16-4012 (accessed 5 July 2021).
Goldfield, J. D. (1993). An argument for single-author and similar studies using quantitative methods: Is there safety in numbers?. Computers and the Humanities, 27(5–6): 365–74 doi:10.1007/BF01829387.
Koolen, M., Gorp, J. van and Ossenbruggen, J. van (2019). Toward a model for digital tool criticism: Reflection as integrative practice. Digital Scholarship in the Humanities, 34(2): 368–85 doi:10.1093/llc/fqy048.
Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J. and Gómez-Berbís, J. M. (2013). Named Entity Recognition: Fallacies, challenges and opportunities. Computer Standards & Interfaces, 35(5): 482–89 doi:10.1016/j.csi.2012.09.004.
McDonough, K. and Camp, M. van de (2017). Mapping the Encyclopédie: Working Towards an Early Modern Digital Gazetteer. Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities – GeoHumanities’17. Redondo Beach, CA, USA: ACM Press, pp. 16–22 doi:10.1145/3149858.3149861. http://dl.acm.org/citation.cfm?doid=3149858.3149861 (accessed 24 January 2019).
McDonough, K., Moncla, L. and Camp, M. van de (2019). Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33(12): 2498–522 doi:10.1080/13658816.2019.1620235.
Nelson, B. (2014). From Index to Interoperability: The Desideratum of Authority Files in Large-Scale Digital Projects. Scholarly and Research Communication, 5(4) doi:10.22230/src.2014v5n4a192. http://src-online.ca/index.php/src/article/view/192 (accessed 21 February 2019).
Ortolja-Baird, A., Pickering, V., Nyhan, J., Sloan, K. and Fleming, M. (2019). Digital Humanities in the Memory Institution: The Challenges of Encoding Sir Hans Sloane’s Early Modern Catalogues of His Collections. Open Library of Humanities, 5(1): 44 doi:10.16995/olh.409.
Piotrowski, M. (2012). Natural Language Processing for Historical Texts. Morgan&Claypool. Vol. 17. (Synthesis Lectures On Human Language Technologies) http://www.morganclaypool.com/doi/abs/10.2200/S00436ED1V01Y201207HLT017 (accessed 4 June 2018).
Piskorski, J. and Yangarber, R. (2013). Information Extraction: Past, Present and Future. In Poibeau, T., Saggion, H., Piskorski, J. and Yangarber, R. (eds), Multi-Source, Multilingual Information Extraction and Summarization. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 23–49 doi:10.1007/978-3-642-28569-1_2. http://link.springer.com/10.1007/978-3-642-28569-1_2 (accessed 9 May 2019).
Ravenek, W., Heuvel, C. van den and Gerritsen, G. (2017). The ePistolarium: Origins and Techniques. In Utrecht University, NL and Odijk, J. (eds), CLARIN in the Low Countries. Ubiquity Press, pp. 317–23 doi:10.5334/bbi.26. https://www.ubiquitypress.com/site/chapters/10.5334/bbi.26/ (accessed 8 June 2018).
Smith, D. A. and Cordell, R. (2019). A Research Agenda for Historical and Multilingual Optical Character Recognition. Northeastern University https://ocr.northeastern.edu/report/ (accessed 10 March 2019).
Smithies, J., Westling, C., Sichani, A.-M., Mellen, P. and Ciula, A. (2019). Managing 100 Digital Humanities Projects: Digital Scholarship & Archiving in King’s Digital Lab. Digital Humanities Quarterly, 13(1) http://www.digitalhumanities.org/dhq/vol/13/1/000411/000411.html#d3876770e516 (accessed 16 February 2020).
Sperberg‐McQueen, C. M. (2016). Classification and its Structures. In Schreibman, S., Siemens, R. G. and Unsworth, J. (eds), A New Companion to Digital Humanities. Chichester, West Sussex, UK: Wiley/Blackwell, pp. 377–93.
Toledo, J. I., Carbonell, M., Fornés, A. and Lladós, J. (2019). Information extraction from historical handwritten document images with a context-aware neural model. Pattern Recognition, 86: 27–36 doi:https://doi.org/10.1016/j.patcog.2018.08.020.
Won, M., Murrieta-Flores, P. and Martins, B. (2018). Ensemble Named Entity Recognition (NER): Evaluating NER Tools in the Identification of Place Names in Historical Corpora. Frontiers in Digital Humanities, 5 doi:10.3389/fdigh.2018.00002. http://journal.frontiersin.org/article/10.3389/fdigh.2018.00002/full (accessed 14 May 2018).