X Close

Open@UCL Blog

Home

Menu

Copyright and Text & Data mining – what do I need to know?

Kirsty6 July 2021

Text and Data Mining (TDM) is a broad term used to cover any advanced techniques for computer-based analysis of large quantities of data of all kinds (numbers, text, images etc). It is a crucial tool in many areas of research, including notably Artificial Intelligence (AI). TDM can be used to reveal significant new facts, relationships and insights from the detailed analysis of vast amounts of data in ways which were not previously possible. An example would be mining medical research literature to investigate the underlying causes of health issues and the efficacy of treatments.

The importance of having copyright exceptions in place to facilitate TDM arises from the fact that the swathes of material which need to be mined are often protected by copyright. That would be true for example of “literary works” of all kinds and of images in many cases. It is frequently the case that researchers will have lawful access to the material but will be prevented from applying TDM techniques because copying the material onto the required computer platform risks legal action for infringement on the part of the copyright owners. “Copying” is of course one of the acts restricted by copyright law and in general the greater the amount and variety of material, the greater the copyright risk.

It is worth remembering that when the Government created an exception for Text and Data Mining in 2014, it meant that the UK was ahead of the game. Other countries did not generally have an exception in their legislation at that time. Since then, other jurisdictions have caught up and, in some cases overtaken the UK. Cutting edge research is a highly competitive area and researchers working in a country which benefits from a generous TDM exception will have a distinct advantage.

The existing exception is still significant from the Open Science perspective in enabling research projects where computer analysis of large quantities of copyright-protected material is required, particularly in the context of AI.

Let’s take a closer look at the UK TDM exception and what it allows us to do, before comparing it briefly with the more recent EU exceptions. The UK exception is to be found in Section 29A of the Copyright, Designs and Patents Act 1988.

What does the exception allow us to do?

Copying copyright-protected works in order to carry out “text and data analysis” (“computational analysis” in the wording of the exception). The need to copy arises because researchers must have have the material to be analysed on a specific platform, to carry out the analysis. The need for the exception then arises because without it, the researcher would require permission from the owner of copyright in each item. Without permission (or an exception), the researchers would be infringing copyright by copying a vast swathe of protected material. That in turn would often make the research impractical to carry out.

Who may do this?

Absolutely anyone, the exception says “a person.” This is wonderfully broad and one of the more favourable aspects of the UK exception. For example you don’t need to be working for/ studying at a particular type of institution to benefit from the exception.

Are there conditions?

You must have lawful access to the material. A prime example would be the text of academic journals. We have lawful access to large numbers of e-journals because UCL Library subscribes to them. The exception would allow a UCL researcher to download large amounts of content from e-journals to carry out detailed analysis using specialised tools. It is important to note that the exception cannot be overridden by contract terms. It follows that a term in an e-journal contract seeking to prevent TDM would have no force, in circumstances where the exception applies. This makes the exception a much more useful tool than it would otherwise be.

As you might expect the copies made for TDM purposes may not be used for other purposes, shared etc under the exception.

Significantly, the analysis must be “…for the sole purpose of research for a non commercial purpose.” This is a major restriction, which would rule out many situations where TDM might be used, for example research by a pharmaceutical company developing new drugs which will be marketed commercially. A major issue with the exception is that it can be unclear at what point “non-commercial” shades into “commercial.” A project which starts out as academic research may take on commercial significance down the line and a piece of research with no commercial aspects may be funded by commercial sponsors. It is an important constraint in the legislation which can also be difficult to be sure about in real life situations. It can stand in the way of joint projects by HEIs and commercial organisations.

Still, in situations where we can claim there is no commercial aspect to the research, the exception is potentially very useful. In addition to material which is already digital it can cover projects where digitisation of copyright- protected print material is required to be analysed. It can be very useful in situations where the copyright status of the source material is unclear, since provided the exception applies, there is no need to investigate further the complexities of copyright in the material.

The new EU TDM exception or rather exceptions

The EU Directive on Copyright in the Digital Single Market (DSM Directive) offers two new exceptions, which EM member states are obliged to transpose. They can be found in Articles 3 and 4 of the Directive.

There are important differences of approach to the UK in the answer to the question:  who may carry out the TDM? Article 3 provides an exception which benefits two defined categories of organisations: “Research organisations” and “Cultural heritage organisations.” Included within those groups are for example universities, museums, publicly funded libraries. Commercial organisations are excluded. It seems that independent researchers, not associated with an organisation would also be excluded, even though their research might be “non-commercial.” In common with the UK legislation, this exception cannot be overridden by contract terms and is therefore a powerful tool. The Directive addresses the question of public-private research collaborations in the recitals to the directive, e.g. recital 11. They are not excluded from benefitting from the Article 3 exception.

Article 4 offers a separate TDM exception which is available to anyone (including commercial organisations) but which is limited in a specific way: If the rights owners explicitly reserve the rights to carry out TDM within their works, then it cannot be mined under the exception. In other words, the EU DSM Directive goes one step further than the UK by offering an exception which can be used to mine lawfully accessible works by commercial organisations (or by anyone else), but it does not apply if the rights owner has explicitly ruled out TDM.  By contrast, commercial organisations would not be able to use the UK exception, unless they can claim the specific research is for a non-commercial purpose.

Guest post by Chris Holland, UCL Copyright Support Officer. For more information or advice contact: copyright@ucl.ac.uk

ORCID Updates for 2021

Kirsty14 April 2021

Over the past year, we have written a number of blog posts talking about ORCID and giving you lots of options for how you can make the best use of your ORCID, including using it to add your research outputs to RPS, and a series of ways that you can automatically populate your ORCID and save time! While all of these posts are still relevant, and we would recommend you having a look, there are a few updates that we wanted to share with you.

ORCID have recently added Data Management Plan as a new work-type you can include in your ORCID, which is great news. In addition to this, ORCID have now made it possible to record funding peer review contributions in your ORCID record by linking your ORCID to Je-S, increasing the number of work types you can add to ORCID to 44!

ORCID have also relaunched the help and support part of their website info.orcid.org to make it easier to access updates, FAQs and blog posts. I really enjoyed this recent post in which they interviewed Dr. Romero-Olivares, assistant professor at New Mexico State University, about her experiences using ORCID throughout her career and the ways that having an ORCID has made maintaining her CV easier over the years.

After this blog was published, ORCID also announced that they have started supporting CRediT – the Contributor Roles Taxonomy. This is a great step, and so keep an eye out if you have published in a journal that uses CRediT to add this to your ORCID record soon!

Finally, ORCID have released a new video tour of the ORCID record that you can see below. In addition to their previous video in our prior posts telling you about what ORCID is and its advantages, this video aims to remind you of the key features of the interface and answering a few questions you may have about how to maintain your personal ORCID record.

A Quick Tour of the ORCID Record from ORCID.

Persistent Identifiers 101

Kirsty27 July 2020

You might have heard the phrase ‘Persistent Identifier or even PID in passing, but what does it actually mean 

A persistent identifier (PID) is a long-lasting reference to a resource. That resource might be a publication, dataset or person. Equally it could be a scientific sample, funding body, set of geographical coordinates, unpublished report or piece of software. Whatever it is, the primary purpose of the PID is to provide the information required to reliably identify, verify and locate it.” – OpenAIRE 

These identifiers either connect to a set of metadata describing an item, or link to the item itself.  

In 2018, the Tickell report was released. It presented independent advice about Open Access, which had implications for the world of PIDs. Adam Tickell recommended that Jisc lead a project to select and promote a range of unique identifiers for different purposes, to try and limit the amount of confusion and duplication in this area.  

The JISC project has been in progress for the last year. They are working on what they describe as ‘priority PIDs’ which cover the following categories:  

  • People 
  • Works 
  • Organisations 
  • Grants 
  • Projects 

So what are the PIDs we need to be aware of? 

People 

The primary PID for people is one that you will already be familiar with if you are a regular reader of the blog. Even if you aren’t, you have probably heard of it – it’s ORCID.  

ORCID is an open identifier for individuals that allows you to secure accurate attribution for all of your outputs. It also functions quite nicely as an online bibliography, and can be used to automatically collect and record your papers in RPS. All in all, it’s pretty useful 

If you want to know more about what you can do with ORCID, have a look at our recent blog post ‘Getting the best out of your ORCID. All of the details about linking ORCID to RPS and vice versa, are available on the blog and the Open Access website 

Works 

The next identifier is for works. It’s another that you have probably seen, even if you don’t know a lot about themDOIDOI stands for Digital Object IdentifierIt’s a unique registration number for a Digital Object. This could be an article or a dataset, but it could equally be an image, a book, or even a chapter in a book. DOIs are unique and persistent which means that if your chosen journal changes publisher, you will still be able to find your article because the DOI is independent and will keep up to date.  

DOIs are most often acquired through a Registration Agency called Crossref, but you will also come across DataCiteBoth of these services do the same job, providing and tracking DOIs, but the underlying tools are slightly different.  

Did you know: if you have the DOI of a paper, an easy way to find that paper is to add https://doi.org/ to the front. The URL this creates will take you to the paper, no matter who published it. For example: 10.1080/08870446.2019.1679373 is DOI, and https://doi.org/10.1080/08870446.2019.1679373 will take you straight to the paper 

Organisations 

The Research Organisation Registry (ROR) is a new PID registry that is being created by key stakeholders, including Crossref and Jisc, to bring more detail and consistency to organisational identifiers. The definition of organisations goes beyond institutions like UCL to include any organisation that is involved in research production or management, so this can include funders, publishers, research institutes and scholarly societies.   

Grants 

Crossref is key in the identification of individual funders and in creating identifiers for research grants. Grant IDs are DOI’s, but connected to grant-specific metadata such as award type, value and investigators. The intent is for funders to register each grant and provide a GrantID, which has the potential to make tracking papers and data linked to individual projects much simpler in the long run. Several hundred grants have been registered already, mostly via Wellcome (With thanks to Rachael Lammey for the clarification 03/08/2020)

Projects 

The Jisc project is supporting Research Activity ID (RAiD), a project based in Australia which creates a unique identifier for a research project. The intent is for this to be the final part of a network of identifiers that will allow people, works, and institutions to be linked to their projects and funders. This will complete the chain and allow accurate attribution and accountability at every stage of the research process.   

How can I get involved? 

The work being undertaken to select and support individual PIDs at each stage of the research process is a good idea, and if it works then it will be a step towards a fully interconnected, open and transparent research process. The next stage of the Jisc project is currently underway, and they are surveying all sectors of the UK research community about awareness, use, and experience of PIDs. If you want to contribute, their survey is open and has just been extended until 21 August!  

PIDs diagram

PIDs environment – Click to enlarge

Spotlight on: Kudos – helping people find, read, understand and cite your research

Kirsty3 June 2020

Kudos (growkudos.com) is not a social networking site, or yet another profile – it’s a toolkit. Kudos is a free service which exists to help you manage your profiles and social media posts more effectively to maximize visibility of your work.

Kudos allows you to claim and describe your work for a variety of audiences, from your colleagues, to potential multi-disciplinary collaborators, to the general public. It also allows each contributor to put a personal statement onto a paper, describing your part in the work and putting your own personal spin on it. For example this publication, chosen at random, has been annotated with a short summary, had an image added, and each of the contributors has added a short personal comment.

Then all you have to do is use the inbuilt tools to share to multiple sources at once. You can even generate trackable links in Kudos for items without DOIs, so that however you do share your work – via email, social media, posters, discussion groups, scholarly networks etc – you can track which of those is really helping you maximize readership.

The metrics generated by these links include the number of people you have reached, the number of views, a global breakdown (which countries is your work attracting attention in), the Altmetric score (how is your work being discussed online), citation counts for publications, and a granular breakdown of the different ways you have communicated and which of these have been most effective. A recent study has shown that explaining and sharing via Kudos takes on average 10 minutes and leads to over 20% more downloads.

Kudos pro

Kudos have recently launched a pro version of their free to use platform, which extends their service beyond publications into the rest of your research, called Kudos Pro. This new service allows you to create profile pages for your work – whether for a specific project, or a general overview of your body of work. These pages are quick and easy to set up using a template. For example, this project, chosen at random, includes links to the profiles of the contributors and institutions, some publications as well as images and an extensive background to the project.

You can link from these pages to relevant materials and outputs, from links to surveys, code, data, images, to links to pre-prints/publications in your institutional repository, publisher website, pre-print server or even Kudos itself – this helps you provide a single ‘entry point’ to which you can direct people looking for more info about your work – while also enabling you to post outputs on other appropriate sites as you normally would.

Kudos Pro also includes a planning tool which can guide you through creating a communication, engagement and impact plan, helping you to identify target audiences, impact goals, and different activities that will help you achieve those goals with your project. You can also gather evidence of engagement and impact within this tool and download the plan and results for reporting, or to submit as part of a grant application to demonstrate the rigour with which you will plan and manage impact of your project.

Free access to Kudos pro

Given that many of the usual ways researchers communicate their work are currently off limits due to the current situation (e.g. conferences, workshops, meetings with stakeholders etc) Kudos have opened up the pro platform so that researchers can use it for free – people can claim their free access by signing up at https://growkudos.com/hub/projects

Kudos are also maintaining a project of their own collating Covid-19 research that has been annotated.