An Introduction to Text and Data Mining (TDM)
By Ruth Wainman, on 14 January 2019
What is TDM?
There are various definitions of Text and Data Mining (TDM) which cover both the technicalities and utilities of the practice. The UK Intellectual Property Office (IPO) usefully define TDM as: ‘The use of automated analytical techniques to analyse text and data for patterns, trends and other useful information’. Even within TDM, there are different definitions for both text and data mining. Text mining is more commonly seen as the computational process of discovering and extracting knowledge from unstructured data. Data mining, on the other hand, is the computational process of discovering and extracting knowledge from structured data. There has been a surge of interest in the use of TDM in academia across all disciplines ranging from the sciences to the humanities. Yet undertaking TDM has also entailed a whole host of legal and political issues, which have nearly threatened to hinder the practice. These issues have largely centred around copyright, intellectual property rights, licenses and download limits.
Why do TDM?
Firstly, TDM can make research easier for those seeking to examine a large corpus of documents in order to discover underlying trends across multiple datasets. TDM is often cited as a way of increasing the progress of scientific discovery. But TDM is also useful for researchers working in the humanities to mine sources like journals and newspapers. The Research IT Services team at UCL work closely with a number of departments around UCL by collaborating on a range of software projects including Oceanic Exchanges, ForecastCC and the UCL-wide CloudLabs.
Brief overview of TDM Services in British universities and Beyond
There appear to be few libraries, at least in the UK, which provide any comprehensive overview of TDM services. Cambridge University Library currently offers a TDM test kitchen to explore the application of TDM (Text and Data-Mining) methods to Cambridge University Press (CUP) and University Library (UL) collections. The test kitchen provides a ‘live’ environment whereby researchers, CUP and library staff can learn more about TDM methods, share good practice and exchange knowledge about how to overcome challenges. At present, other universities only appear to provide more general advice on what TDM involves alongside the legal and practical issues of undertaking this type of analysis. There are, however, many alternative providers of TDM services which can be searched for through DIRT – Text Mining Tools Directory. Many database providers now offer TDM services at no extra cost although some require the installation of an API after registering first with the publisher. Researchers are often not expected to obtain explicit permission for undertaking text and mining but this is only providing their research has a non-commercial purpose.
Barriers to TDM
In recent years, some changes have been made to the UK’s current intellectual property framework in order to support innovation and growth. The Hargreaves Report (2011) introduced a copyright exception in UK law to allow for the use of analytics for non-commercial use. Yet, there are many barriers to TDM. Some of these issues have been studied in more detail by Michelle Brook, Peter Murray-Rust and Charles Oppenheim. They have argued that there are a number of non-technological barriers that still need to be overcome in order to realise the full potential of TDM. They raise concerns about the legal issues of TDM surrounding copyright law and database rights but also offer some guidelines about how publishers can help to overcome these barriers to research. For example, this includes giving researchers lawful access to original materials and making clear distinctions about what research is regarded as ‘commercial’ and ‘non-commercial’.
How UCL library can help?
There are a number of ways in which the library can assist researchers with TDM. The library can provide advice on the tools available to undertake TDM alongside the type of sources you may wish to consider analysing. They can also refer researchers to other specialists who can assist further with the technicalities and legalities of the TDM. Lastly, the library plays an important role in continuing to promote and build TDM networks across the university.
At a Quick Glance: Publishers and Service Providers which allow TDM
- LSE: http://blogs.lse.ac.uk/impactofsocialsciences/2016/07/12/how-libraries-and-librarians-can-help-with-text-and-data-mining/
- Future TDM: https://www.futuretdm.eu/practitioner-guidelines/
- JISC: https://www.jisc.ac.uk/guides/text-and-data-mining-copyright-exception
- Content Mine: http://contentmine.org/
- CILIP: https://archive.cilip.org.uk/blog/boldly-go-librarians-role-text-data-mining
- Databases that support text and data mining: https://libguides.usc.edu/contentmining/databases