X Close

Open@UCL Blog

Home

Menu

Archive for the 'Guest post' Category

Open Source Software Design for Academia

By Kirsty, on 27 August 2024

Guest post by Julie Fabre, PhD candidate in Systems Neuroscience at UCL. 

As a neuroscientist who has designed several open source software projects, I’ve experienced firsthand both the power and pitfalls of the process. Many researchers, myself included, have learned to code on the job, and there’s often a significant gap between writing functional code and designing robust software systems. This gap becomes especially apparent when developing tools for the scientific community, where reliability, usability, and maintainability are crucial.

My journey in open source software development has led to the creation of several tools that have gained traction in the neuroscience community. One such project is bombcell: a software designed to assess the quality of recorded neural units. This tool replaces what was once a laborious manual process and is now used in over 30 labs worldwide. Additionally, I’ve developed other smaller toolboxes for neuroscience:

These efforts were recognized last year when I received an honourable mention in the UCL Open Science and Scholarship Awards.

In this post, I’ll share insights gained from these experiences. I’ll cover, with some simplified examples from my toolboxes:

  1. Core design principles
  2. Open source best practices for academia

Disclaimer: I am not claiming to be an expert. Don’t view this as a definitive guide, but rather as a conversation starter.


Follow Julie’s lead: Whether you’re directly involved in open source software development or any other aspect of open science and scholarship, or if you simply know someone who has made important contributions, consider applying yourself or nominating a colleague for this year’s UCL Open Science and Scholarship Awards to gain recognition for outstanding work!


Part 1: Core Design Principles

As researchers, we often focus on getting our code to work, but good software design goes beyond just functionality. In order to maintain and build upon your software, following a few principles from the get go will elevate software from “it works” to “it’s a joy to use, maintain and contribute to”.

1. Complexity is the enemy

A primary goal of good software design is to reduce complexity. One effective way to simplify complex functions with many parameters is to use configuration objects. This approach not only reduces parameter clutter but also makes functions more flexible and maintainable. Additionally, breaking down large functions into smaller, more manageable pieces can significantly reduce overall complexity.

Example: Simplifying a data analysis function

For instance, in bombcell we run many different quality metrics, and each quality metric is associated with several other parameters. In the main function, instead of inputting all the different parameters independently:

[qMetric, unitType] = runAllQualityMetrics(plotDetails, plotGlobal, verbose, reExtractRaw, saveAsTSV, removeDuplicateSpikes, duplicateSpikeWindow_s, detrendWaveform, nRawSpikesToExtract, spikeWidth, computeSpatialDecay, probeType, waveformBaselineNoiseWindow, tauR_values, tauC, computeTimeChunks, deltaTimeChunks, presenceRatioBinSize, driftBinSize, ephys_sample_rate, nChannelsIsoDist, normalizeSpDecay, (... many many more parameters ...), rawData, savePath);

they are all stored in a ‘param’ object that is passed onto the function:

[qMetric, unitType] = runAllQualityMetrics(param, rawData, savePath);

This approach reduces parameter clutter and makes the function more flexible and maintainable.

 2. Design for change

Research software often needs to adapt to new hypotheses or methodologies. When writing a function, ask yourself “what additional functionalities might I need in the future?” and design your code accordingly. Implementing modular designs allows for easy modification and extension as research requirements evolve. Consider using dependency injection to make components more flexible and testable. This approach separates the creation of objects from their usage, making it easier to swap out implementations or add new features without affecting existing code.

Example: Modular design for a data processing pipeline

Instead of a monolithic script:

function runAllQualityMetrics(param, rawData, savePath)
% Hundreds of lines of code doing many different things
(...)
end

Create a modular pipeline that separates each quality metric into a different function:

function qMetric = runAllQualityMetrics(param, rawData, savePath)
nUnits = length(rawData);
for iUnit = 1:nUnits
% step 1: calculate percentage spikes missing
qMetric.percSpikesMissing(iUnit) = bc.qm.percSpikesMissing(param, rawData);
% step 2: calculate fraction refractory period violations
qMetric.fractionRPviolations(iUnit) = bc.qm.fractionRPviolations(param, rawData);
% step 3: calculate presence ratio
qMetric.presenceRatio(iUnit) = bc.qm.presenceRatio(param, rawData);
(...)
% step n: calculate distance metrics
qMetric.distanceMetric(iUnit) = bc.qm.getDistanceMetric(param, rawData);
end
bc.qm.saveQMetrics(qMetric, savePath)
end

This structure allows for easy modification of individual steps or addition of new steps without affecting the entire pipeline.

In addition, this structure allows us to define new parameters easily that can then modify the behavior of the subfunctions. For instance we can add different methods (such as adding the ‘gaussian’ option below) without changing how any of the functions are called!

param.percSpikesMissingMethod = 'gaussian';
qMetric.percSpikesMissing(iUnit) = bc.qm.percSpikesMissing(param, rawData);

and then, inside the function:

function percSpikesMissing = percSpikesMissing(param, rawData);
if param.percSpikesMissingMethod == 'gaussian'
(...)
else
(...)
end
end

3. Hide complexity

Expose only what’s necessary to use a module or function, hiding the complex implementation details. Use abstraction layers to separate interface from implementation, providing clear and concise public APIs while keeping complex logic private. This approach not only makes your software easier to use but also allows you to refactor and optimize internal implementations without affecting users of your code.

Example: Complex algorithm with a simple interface

For instance, in bombcell there are many parameters. When we run the main script that calls all quality metrics, we also want to ensure all parameters are present and are in a correct format.

function qMetric = runAllQualityMetrics(param, rawData, savePath)
% Complex input validation that is hidden to the user
param_complete = bc.qm.checkParameterFields(param);

% Core function that calcvulates all quality metrics
nUnits = length(rawData);

for iUnit = 1:nUnits
% steps 1 to n
(...)
end

end

Users of this function don’t need to know about the input validation or other complex calculations. They just need to provide input and options.

4. Write clear code

Clear code reduces the need for extensive documentation and makes your software more accessible to collaborators. Use descriptive and consistent variable names throughout your codebase. When dealing with specific quantities, consider adding units to variable names (e.g., ‘time_ms’ for milliseconds) to improve clarity. You can add comments to explain non-obvious logic and to add general outlines of the steps in your code. Following consistent coding style and formatting guidelines across your project also contributes to overall clarity.

Example: Improving clarity in a data processing function

Instead of an entirely mysterious function

function [ns, sr] = ns(st, t)
ns = numel(st);
sr = ns/t;

Add more descriptive variable and function names and add function headers:

function [nSpikes, spikeRate] = numberSpikes(theseSpikeTimes, totalTime_s)
% Count the number of spikes for the current unit
% ------
% Inputs
% ------
% theseSpikeTimes: [nSpikesforThisUnit × 1 double vector] of time in seconds of each of the unit's spikes.
% totalTime_s: [double] of the total recording time, in seconds.
% ------
% Outputs
% ------
% nSpikes: [double] number of spikes for current unit.
% spikeRate_s : [double] spiking rare for current unit, in seconds.
% ------
nSpikes = numel(theseSpikeTimes);
spikeRate_s = nSpikes/totalTime_s;
end

5. Design for testing

Incorporate testing into your design process from the beginning. This not only catches bugs early but also encourages modular, well-defined components.

Example: Testable design for a data analysis function

For the simple ‘numberSpikes’ function we define above, we can have a few tests to cover various scenarios and edge cases to ensure the function works correctly. For instance, we can test a normal case with a few spikes and an empty spike times input.

function testNormalCase(testCase)
theseSpikeTimes = [0.1, 0.2, 0.3, 0.4, 0.5]; totalTime_s = 1;
[nSpikes, spikeRate] = numberSpikes(theseSpikeTimes, totalTime_s);
verifyEqual(testCase, nSpikes, 5, 'Number of spikes should be 5');
verifyEqual(testCase, spikeRate, 5, 'Spike rate should be 5 Hz');
end

function testEmptySpikeTimes(testCase)
theseSpikeTimes = [];
totalTime_s = 1;
[nSpikes, spikeRate] = numberSpikes(theseSpikeTimes, totalTime_s);
verifyEqual(testCase, nSpikes, 0, 'Number of spikes should be 0 for empty input');
verifyEqual(testCase, spikeRate, 0, 'Spike rate should be 0 for empty input');
end

This design allows for easy unit testing of individual components of the analysis pipeline.

Part 2: Open Source Best Practices for Academia

While using version control and having a README, documentation, license, and contribution guidelines are essential, I have found that these practices have the most impact:

Example Scripts and Toy Data

I have found that the most useful thing you can provide with your software are example scripts, and even better, provide toy data that loads in your example script. Users can then quickly test your software and see how to use it on their own data — and are then more likely to adopt it. If possible, package the example scripts in Jupyter notebooks/MATLAB live scripts (or equivalent) demonstrating key use cases. In bombcell, we provide a small dataset (Bombcell Toy Data on GitHub) and a MATLAB live script that runs bombcell on this small toy dataset (Getting Started with Bombcell on GitHub). 

Issue-Driven Improvement

To manage user feedback effectively, enforce the use of an issue tracker (like GitHub Issues) for all communications. This approach ensures that other users can benefit from conversations and reduces repetitive work. When addressing questions or bugs, consider if there are ways to improve documentation or add safeguards to prevent similar issues in the future. This iterative process leads to more robust and intuitive software.

Citing

Make your software citable quickly. Before (or instead) of publishing, you can generate a citable DOI using software like Zenodo. Consider also publishing in the Journal of Open Source Software (JOSS) for light peer review. Clearly outline how users should cite your software in their publications to ensure proper recognition of your work.

Conclusion

These practices can help create popular, user-friendly, and robust academic software. Remember that good software design is an iterative process, and continuously seeking feedback and improving your codebase (and sometimes entirely rewriting/refactoring parts) will lead to more robust code.

To go deeper into principles of software design, I highly recommend reading “A Philosophy of Software Design” by John Ousterhout or “The Good Research Code Handbook” by Patrick J. Mineault.

Get involved! 

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Join our mailing list, and follow us on X, formerly Twitter and LinkedIn, to stay connected for updates, events, and opportunities.

 

 

 

UCL Open Science & Scholarship Awards – Update from Mike and Gesche!

By Kirsty, on 21 August 2024

As part of our work at the Office this year, we’ve made it a priority to stay connected with all of our award winners. Some of them shared their experiences during our conference, and we’re already well on our way to planning another exciting Awards ceremony for this year’s winners!

You can apply now for the UCL Open Science & Scholarship Awards 2024 to celebrate UCL students and staff who are advancing and promoting open science and scholarship. The awards are open to all UCL students, PhD candidates, professional services, and academic staff across all disciplines. There’s still time to submit your applications and nominations in all categories— the deadline is 1 September!

To give you some inspiration for what’s possible in open science, Mike Fell has given us an update on the work that he and Gesche have done since receiving their award last year:


In autumn last year, we were surprised and really happy to hear we’d received the first UCL Open Scholarship Awards. Even more so when we heard at the ceremony about the great projects that others at UCL are doing in this space.

The award was for work we’d done (together with PhD colleague Nicole Watson) to improve transparency, reproducibility, and quality (TReQ) or research in applied multidisciplinary areas like energy. This included producing videos, writing papers, and delivering teaching and related resources.

Of course, it’s nice for initiatives you’ve been involved in to be recognized. But even better have been some of the doors this recognition has helped to open. Shortly after getting the award, we were invited to write an opinion piece for PLOS Climate on the role of open science in addressing the climate crisis. We also engaged with leadership at the Center for Open Science.

More broadly – although it’s always hard to draw direct connections – we feel the award has had career benefits. Gesche was recently appointed Professor of Environment & Human Health at University of Exeter, and Director the European Centre for Environment and Human Health. As well as highlighting her work on open science, and the award, in her application, this now provides an opportunity to spread the work further beyond the bounds of UCL and our existing research projects.

There’s still a lot to do, however. While teaching about open science is now a standard part of the curriculum for graduate students in our UCL department (and Gesche planning this for the ECEHH too), we don’t have a sense that this is common in energy research, other applied research fields, and education more broadly. It’s still quite rare to see tools like pre-analysis plans, reporting guidelines, and even preprints employed in energy research.

A new research centre we are both involved in, the UKRI Energy Demand Research Centre, has been up and running for a year, and with lots of the setup stage now complete and staff in place, we hope to pick up a strand of work in this area. Gesche is the data champion for the Equity theme of that centre. The new focus must be on how to better socialize open research practices and make them more a part of the culture of doing energy research. We look forward to continuing to work with UCL Open Science in achieving that goal.

Get involved!

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Join our mailing list, and follow us on X, formerly Twitter and LinkedIn, to stay connected for updates, events, and opportunities.

 

 

 

Copyright and AI, Part 2: Perceived Challenges, Suggested Approaches and the Role of Copyright literacy

By Rafael, on 15 August 2024

Guest post by Christine Daoutis (UCL), Alex Fenlon (University of Birmingham) and Erica Levi (Coventry University).

This blog post is part of a collaborative series between the UCL Office for Open Science and Scholarship and the UCL Copyright team exploring important aspects of copyright and its implications for open research and scholarship. 

A grey square from which many colourful, wavy ribbons with segments in shades of white, blue, light green, orange and black radiate outward against a grey background.

An artist’s illustration of AI by Tim West. Photo by Google DeepMind from Pexels.

A previous post outlined copyright-related questions when creating GenAI materials—questions related to ownership, protection/originality, and infringement when using GenAI. The post discussed how answers to these questions are not straightforward, largely depend on what is at stake and for whom, and are constantly shaped by court cases as they develop.

What does this uncertainty mean for students, academics, and researchers who use GenAI and, crucially, for those in roles that support them? To what extent does GenAI create new challenges, and to what extent are these uncertainties inherent in working with copyright? How can we draw on existing expertise to support and educate on using GenAI, and what new skills do we need to develop?

In this post, we summarise a discussion we led as part of our workshop for library and research support professionals at the Research Libraries UK (RLUK) annual conference in March 2024. This year’s conference title was New Frontiers: The Expanding Scope of the Modern Research Library. Unsurprisingly, when considering the expanding scope of libraries in supporting research, GenAI is one of the first things that comes to mind.

Our 16 workshop participants came from various roles, research institutions, and backgrounds. What they had in common was an appetite to understand and support copyright in the new context of AI, and a collective body of expertise that, as we will see, is very useful when tackling copyright questions in a novel context. The workshop consisted of presentations and small group discussions built around the key themes outlined below.

Perceived Challenges and Opportunities
Does the research library community overall welcome GenAI? It is undoubtedly viewed as a way to make scholarship easier and faster, offering practical solutions—for example, supporting literature reviews or facilitating draft writing by non-English speakers. Beyond that, several participants see an opportunity to experiment, perhaps becoming less risk-averse, and welcome new tools that can make research more efficient in new and unpredictable ways.

However, concerns outweigh the perceived benefits. It was repeatedly mentioned that there is a need for more transparent, reliable, sustainable, and equitable tools before adopting them in research. Crucially, users need to ask themselves what exactly they are doing when using GenAI, their intention, what sources are being used, and how reliable the outputs are.

GenAI’s concerns over copyright were seen as an opportunity to place copyright literacy at the forefront. The need for new guidance is evident, particularly around the use of different tools with varying terms and conditions, and it is also perceived as an opportunity to revive and communicate existing copyright principles in a new light.

Suggested Solutions
One of the main aims of the workshop was to address challenges imposed by GenAI. Participants were very active in putting forward ideas but expressed concerns and frustration. For example, they questioned the feasibility of shaping policy and processes when the tools themselves constantly evolve, when there is very little transparency around the sources used, and when it is challenging to reach agreement even on essential concepts. Debates on whether ‘copying’ is taking place, whether an output is a derivative of a copyrighted work, and even whether an output is protected are bound to limit the guidance we develop.

Drawing from Existing Skills and Expertise
At the same time, it was acknowledged that copyright practitioners already have expertise, guidance, and educational resources relevant to questions about GenAI and copyright. While new guidance and training are necessary, the community can draw from a wealth of resources to tackle questions that arise while using GenAI. Information literacy principles should still apply to GenAI. Perhaps the copyright knowledge and support are already available; what is missing is a thorough understanding of the new technologies, their strengths, and limitations to apply existing knowledge to new scenarios. This is where the need for collaboration arises.

Working Together
To ensure that GenAI is used ethically and creatively, the community needs to work collaboratively—with providers, creators, and users of those tools. By sharing everyday practices, decisions, guidance, and processes will be informed and shaped. It is also important to acknowledge that the onus is not just on the copyright practitioners to understand the tools but also on the developers to make them transparent and reliable. Once the models become more transparent, it should be possible to support researchers better. This is even more crucial in supporting text and data mining (TDM) practices—critical in many research areas—to limit further restrictions following the implementation of AI models.

Magic Changes
With so much excitement around AI, we felt we should ask the group to identify the one magic change that would help remove most of the concerns. Interestingly, the consensus was that clarity around the sources and processes used by GenAI models is essential. How do the models come up with their answers and outputs? Is it possible to have clearer information about the sources’ provenance and the way the models are trained, and can this inform how authorship is established? And what criteria should be put in place to ensure the models are controlled and reliable?

This brings the matter back to the need for GenAI models to be regulated—a challenging but necessary magic change that would help us develop our processes and guidance with much more confidence.

Concluding Remarks
While the community of practitioners waits for decisions and regulations that will frame their approach, it is within their power to continue to support copyright literacy, referring to new and exciting GenAI cases. Not only do those add interest, but they also highlight an old truth about copyright, namely, that copyright-related decisions always come with a degree of uncertainty, risk, and awareness of conflicting interests.

About the authors 

Christine Daoutis is the UCL Copyright Support Officer at UCL. Christine provides support, advice and training on copyright as it applies to learning, teaching and research activities, with a focus on open science practices. Resources created by Christine include the UCL Copyright Essentials tutorial and the UCL Copyright and Your Teaching online tutorial.

Alex Fenlon is the Head of Copyright and Licensing within Libraries and Learning Resources at the University of Birmingham. Alex and his team provide advice and guidance on copyright matters, including text, data mining, and AI, to ensure that all law and practice are understood by all.

Erica Levi is the Digital repository and Copyright Lead at Coventry University. Erica has created various resources to increase awareness of copyright law and open access through gamification. Her resources are available on her website.

Get involved!

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Join our mailing list, and follow us on X, formerly Twitter and LinkedIn, to be part of the conversation and stay connected for updates, events, and opportunities.

 

 

 

Text and Data Mining (TDM) and Your Research: Copyright Implications and New Website Guidance

By Rafael, on 13 May 2024

This the second blog post of our collaborative series between the UCL Office for Open Science and Scholarship and the UCL Copyright team. Here, we continue our exploration of important aspects of copyright and its implications for open research and scholarship. In this instalment, we examine Text and Data Mining (TDM) and its impact on research along with the associated copyright considerations.

Data processing concept illustration

Image by storyset on Freepik.

The development of advanced computational tools and techniques for analysing large amounts of data has opened up new possibilities for researchers. Text and Data Mining (TDM) is a broad term referring to a range of ‘automated analytical techniques to analyse text and data for patterns, trends, and useful information’ (Intellectual Property Office definition). TDM has many applications in academic research across disciplines (Intellectual Property Office definition). TDM has many applications in academic research across disciplines.

In an academic context, the most common sources of data for TDM include journal articles, books, datasets, images, and websites. TDM involves accessing, analysing, and often reusing (parts of) these materials. As these materials are, by default, protected by copyright, there are limitations around what you can do as part of TDM. In the UK, you may rely on section 29A of the Copyright, Designs and Patents Act, a copyright exception for making copies for text and data analysis for non-commercial research. You must have lawful access to the materials (for example via a UCL subscription or via an open license). However, there are often technological barriers imposed by publishers preventing you from copying large amounts of materials for TDM purposes – measures that you must not try to circumvent. Understanding what you can do with copyright materials, what may be more problematic and where to get support if in doubt, should help you manage these barriers when you use TDM in your research.

The copyright support team works with e-resources, the Library Skills librarians, and the Office for Open Science and Scholarship to support the TDM activities of UCL staff and students. New guidance is available on the copyright website. TDM libguide and addresses questions that often arise during TDM, including:

  • Can you copy journal articles, books, images, and other materials? What conditions apply?
  • What do you need to consider when sharing the outcomes of a TDM analysis?
  • What do publishers and other suppliers of the TDM sources expect you to do?

To learn more about copyright (including how it applies to TDM):

Get involved!

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Stay connected for updates, events, and opportunities. Follow us on X, formerly Twitter, LinkedIn, and join our mailing list to be part of the conversation!

 

 

Launching today: Open Science Case Studies

By Kirsty, on 29 April 2024

Announcement from Paul Ayris, Pro-Vice Provost, UCL Library, Culture, Collections and Open Science

A close up of old leather-bound books on a shelfHow can Open Science/Open Research support career progression and development? How does the adoption of Open Science/Open Research approaches benefit individuals in the course of their career?

The UCL Open Science Office, in conjunction with colleagues across UCL, has produced a series of Case Studies showing how UCL academics can use Open Science/Open Research approaches in their plans for career development, in applications for promotion and in appraisal documents.

In this way, Open Science/Open Research practice can become part of the Research Culture that UCL is developing.

The series of Case Studies covers each of the 8 pillars of Open Science/Open Research. They can be found on a new webpage: Open Science Case Studies 4 UCL.

It is only fair that academics should be rewarded for developing their skills and adopting best practice in research and in its equitable dissemination. The Case Studies show how this can be done, and each Case Study identifies a Key Message which UCL academics can use to shape their activities.

Examples of good practice are:

  • Publishing outputs as Open Access outputs
  • Sharing research data which is used as the building block of academic books and papers
  • Creating open source software which is then available for others to re-use and develop
  • Adopting practices allied to Reproducibility and Research Integrity
  • The responsible use of Bibliometrics
  • Public Engagement: Citizen Science and Co-Production as mechanisms to deliver results

Contact the UCL Open Science Office for further information at openscience@ucl.ac.uk.

UCL open access output: 2023 state-of-play

By Kirsty, on 15 April 2024

Post by Andrew Gray (Bibliometrics Support Officer) and Dominic Allington Smith (Open Access Publications Manager)

Summary

UCL is a longstanding and steadfast supporter of open access publishing, organising funding and payment for gold open access, maintaining the UCL Discovery repository for green open access, and monitoring compliance with REF and research funder open access requirements.  Research data can  be made open access in the Research Data Repository, and UCL Press also publish open access books and journals.

The UCL Bibliometrics Team have recently conducted research to analyse UCL’s overall open access output, covering both total number of papers in different OA categories, and citation impact.  This blog post presents the key findings:

  1. UCL’s overall open access output has risen sharply since 2011, flattened around 80% in the last few years, and is showing signs of slowly growing again – perhaps connected with the growth of transformative agreements.
  2. The relative citation impact of UCL papers has had a corresponding increase, though with some year-to-year variation.
  3. UCL’s open access papers are cited around twice as much, on average, as non-open-access papers.
  4. UCL is consistently the second-largest producer of open access papers in the world, behind Harvard University.
  5. UCL has the highest level of open access papers among a reference group of approximately 80 large universities, at around 83% over the last five years.

Overview and definitions

Publications data is taken from the InCites database.  As such, the data is primarily drawn from InCites papers attributed to UCL, filtered down to only articles, reviews, conference proceedings, and letters. It is based on published affiliations to avoid retroactive overcounting in past years: existing papers authored by new starters at UCL are excluded.

The definition of “open access” provided by InCites is all open access material – gold, green, and “bronze”, a catch-all category for material that is free-to-read but does not meet the formal definition of green or gold. This will thus tend to be a few percentage points higher than the numbers used for, for example, UCL’s REF open access compliance statistics.

Data is shown up to 2021; this avoids any complications with green open access papers which are still under an embargo period – a common restriction imposed by publishers when pursuing this route – in the most recent year.

1. UCL’s change in percentage of open access publications over time

(InCites all-OA count)

The first metric is the share of total papers recorded as open access.  This has grown steadily over time over the last decade, from under 50% in 2011 to almost 90% in 2021, with only a slight plateau around 2017-19 interrupting progress.

2. Citation impact of UCL papers over time

(InCites all-OA count, Category Normalised Citation Impact)

The second metric is the citation impact for UCL papers.  These are significantly higher than average: the most recent figure is above 2 (which means that UCL papers receive over twice as many citations as the world average; the UK university average is ~1.45) and continue a general trend of growing over time, with some occasional variation. Higher variation in recent years is to some degree expected, as it takes time for citations to accrue and stabilise.

3. Relative citation impact of UCL’s closed and Open Access papers over time

(InCites all-OA count, Category Normalised Citation Impact)

The third metric is the relative citation rates compared between open access and non-open access (“closed”) papers. Open access papers have a higher overall citation rate than closed papers: the average open access paper from 2017-21 has received around twice as many citations as the average closed paper.

4. World leading universities by number of Open Access publications

(InCites all-OA metric)

Compared to other universities, UCL produces the second-highest absolute number of open access papers in the world, climbing above 15,000 in 2021, and has consistently been the second largest publisher of open access papers since circa 2015.

The only university to publish more OA papers is Harvard. Harvard typically publishes about twice as many papers as UCL annually, but for OA papers this gap is reduced to about 1.5 times more papers than UCL.

5. World leading universities by percentage of Open Access publications

(5-year rolling average; minimum 8000 publications in 2021; InCites %all-OA metric)

UCL’s percentage of open access papers is consistently among the world’s highest.  The most recent data from InCites shows UCL as having the world’s highest level of OA papers (82.9%) among institutions with more than 8,000 papers published in 2021, having steadily risen through the global ranks in previous years.

Conclusion

The key findings of this research are very good news for UCL, indicating a strong commitment by authors and by the university to making work available openly.  Furthermore, whilst high levels of open access necessarily lead to benefits relating to REF and funder compliance, the analysis also indicates that making research outputs open access leads, on average, to a greater number citations, providing further justification for this support, as being crucial to communicating and sharing research outcomes as part of the UCL 2034 strategy.

Get involved!

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Stay connected for updates, events, and opportunities. Follow us on X, formerly Twitter, LinkedIn, and join our mailing list to be part of the conversation!

 

The Predatory Paradox – book review

By Kirsty, on 29 February 2024

Guest post from Huw Morris, Honorary Professor of Tertiary Education, UCL Institute of Education. If anyone would like to contribute to future blogs, please get in touch.

Book cover: "The Predatory Paradox: Ethics, Politics, and Practices in Contemporary Scholarly Publishing'. Review of The Predatory Paradox: Ethics, Politics and the Practices in Contemporary Scholarly Publishing (2023). Amy Koerber, Jesse Starkey, Karin Ardon-Dryer, Glenn Cummins, Lyombe Eko and Kerk Kee. Open Book Publishers. DOI 10.11647/obp.0364.

We are living in a publishing revolution, in which the full consequences of changes to the ways research and other scholarly work are prepared, reviewed and disseminated have yet to be fully felt or understood. It is just over thirty years since the first open access journals began to appear on email groups in html and pdf formats.

It is difficult to obtain up-to-date and verifiable estimates of the number of journals published globally. There are no recent journal articles which assess the scale of this activity. However, recent online blog sources suggest that there are at least 47,000 journals available worldwide of which 20,318 are provided in open access format (DOAJ, 2023; WordsRated, 2023). The number of journals is increasing at approximately 5% per annum and the UK provides an editorial home for the largest proportion of these titles.

With this rapid expansion questions have been raised about whether there are too many journals, whether they will continue to exist in their current form, and if so how can readers and researchers assess the quality of the editorial processes they have adopted (Brembs et al., 2023; Oosterhaven, 2015)

This new book, ‘The Predatory Paradox,’ steps into these currents of change and seeks not only to comment on developments, but also to consider what the trends mean for academics, particularly early career researchers, for journal editors and for the wider academic community. The result is an impressive collection of chapters which summarise recent debates and report the authors’ own research examining the impact of these changes on the views of researchers and crucially their reading and publishing habits.

The book is divided into seven chapters, which consider the ethical and legal issues associated with open access publishing, as well as the consequences for assessing the quality of articles and journals. A key theme in the book, as the title indicates, is tracking the development of concern about predatory publishing. Here the book mixes a commentary on the history of this phenomenon with information gained from interviews and the authors’ reflections on the impact of editorial practices on their own publication plans. In these accounts the authors demonstrate that it is difficult to tightly define what constitutes a predatory journal because peer review and editorial processes are not infallible, even at the most prestigious journals. These challenges are illustrated by the retelling of stories about recent scientific hoaxes played on so-called predatory journals and other more respected titles. These hoaxes include the submission of articles with a mix of poor research designs, bogus data, weak analyses and spurious conclusions. Building on insights derived from this analysis, the book’s authors provide practical guidance about how to avoid being lured into publishing in predatory journals and how to avoid editorial practices that lack integrity. They also survey the teaching materials used to deal with these issues in the training of researchers at the most research-intensive US universities.

One of the many excellent features of the book is its authors practicing much of what they preach. The book is available for free via open access in a variety of formats. The chapters which draw on original research provide links to the underpinning data and analysis. At the end of each chapter there is also a very helpful summary of the key takeaway messages, as well as a variety of questions and activities that can be used to prompt reflection on the text or as the basis for seminar and tutorial activities.

Having praised the book for its many fine features, it is important to note the questions it raises about defining quality research which could have been more fully answered. The authors summarise their views about what constitutes quality research under a series of headings drawing on evidence from interviews with researchers in a range of subject areas they conclude that quality research defies explicit definition. They suggest, following Harvey and Green, that it is multi-factorial and changes over time with the identity of the reviewer and reader. This uncertainty, while partially accurate, has not prevented people from rating the quality of other peoples’ research or limited the number of people putting themselves forward for these types of review.

As the book explains, peer review by colleagues with an expertise in the same specialism, discipline or field is an integral part of the academic endeavour. Frequently there are explicit criteria against which judgements are made, whether for grant awards, journal reviewing or research assessment. The criteria may be unclear, open to interpretation, overly narrow or overly wide, but they do exist and have been arrived at through collective review and confirmed by processes involving many reviewers.

Overall I would strongly recommend this book and suggest that it should be required or background reading on research methods courses for doctoral and research masters programmes. For other readers who are not using this book as part of a course of study, I would recommend also reading research assessment guidelines for research council and funding body websites and advice to authors provided by long established journals in their field. In addition, it is worth looking at the definitions and reports on research activity provided by Research Excellence Framework panels in the UK and their counterparts in other nations.

Get involved!

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Stay connected for updates, events, and opportunities. Follow us on X, formerly Twitter, LinkedIn, and join our mailing list to be part of the conversation!

Getting a Handle on Third-Party Datasets: Researcher Needs and Challenges

By Rafael, on 16 February 2024

Guest post by Michelle Harricharan, Senior Research Data Steward, in celebration of International Love Data Week 2024.

ARC Data Stewards have completed the first phase of work on the third-party datasets project, aiming to help researchers better access and manage data provided to UCL by external organisations.

alt=""

The problem:

Modern research often requires access to large volumes of data generated outside of universities. These datasets, provided to UCL by third parties, are typically generated during routine service delivery or other activities and are used in research to identify patterns and make predictions. UCL research and teaching increasingly rely on access to these datasets to achieve their objectives, ranging from NHS data to large-scale commercial datasets such as those provided by ‘X’ (formerly known as Twitter).

Currently, there is no centrally supported process for research groups seeking to access third-party datasets. Researchers sometimes use departmental procedures to acquire personal or university-wide licenses for third-party datasets. They then transfer, store, document, extract, and undertake actions to minimize information risk before using the data for various analyses. The process to obtain third-party data involves significant overhead, including contracts, compliance (IG), and finance. Delays in acquiring access to data can be a significant barrier to research. Some UCL research teams also provide additional support services such as sharing, managing access to, licensing, and redistributing specialist third-party datasets for other research teams. These teams increasingly take on governance and training responsibilities for these specialist datasets. Concurrently, the e-resources team in the library negotiates access to third-party datasets for UCL staff and students following established library procedures.

It has long been recognized that UCL’s processes for acquiring and managing third-party data are uncoordinated and inefficient, leading to inadvertent duplication, unnecessary expense, and underutilisation of datasets that could support transformative research across multiple projects or research groups. This was recognised in the “Data First, 2019 UCL Research Data Strategy”.

What we did:

Last year, the ARC Data Stewards team reached out to UCL professional services staff and researchers to understand the processes and challenges they faced regarding accessing and using third-party research datasets. We hoped that insights from these conversations could be used to develop more streamlined support and services for researchers and make it easier for them to find and use data already provided to UCL by third parties (where this is within licensing conditions).

During this phase of work, we spoke with 14 members of staff:

  • 7 research teams that manage third-party datasets
  • 7 members of professional services that support or may support the process, including contracts, data protection, legal, Information Services Division (databases), information security, research ethics and integrity, and the library.

What we’ve learned:

An important aspect of this work involved capturing the existing processes researchers use when accessing, managing, storing, sharing, and deleting third-party research data at UCL. This enabled us to understand the range of processes involved in handling this type of data and identify the various stakeholders involved—or who potentially need to be involved. In practice, we found that researchers follow similar processes to access and manage third-party research data, depending on the security of the dataset. However, as there is no central, agreed procedure to support the management of third-party datasets in the organization, different parts of the process may be implemented differently by different teams using the methods and resources available to them. We turned the challenges researchers identified in accessing and managing this type of data into requirements for a suite of services to support the delivery and management of third-party datasets at UCL.

Next steps:

 We have been working on addressing some of the common challenges researchers identified. Researchers noted that getting contracts agreed and signed off takes too long, so we reached out to the RIS Contract Services Team, who are actively working to build additional capacity into the service as part of a wider transformation programme.

Also, information about accessing and managing third-party datasets is fragmented, and researchers often don’t know where to go for help, particularly for governance and technical advice. To counter this, we are bringing relevant professional services together to agree on a process for supporting access to third-party datasets.

Finally, respondents noted that there is too much duplication of data. The costs for data are high, and it’s not easy to know what’s already available internally to reuse. In response, we are building a searchable catalogue of third-party datasets already licensed to UCL researchers and available for others to request access to reuse.

Our progress will be reported to the Research Data Working Group, which acts as a central point of contact and a forum for discussion on aspects of research data support at UCL. The group advocates for continual improvement of research data governance.

If you would like to know more about any of these strands of work, please do not hesitate to reach out (email: researchdata-support@ucl.ac.uk). We are keen to work with researchers and other professional services to solve these shared challenges and accelerate research and collaboration using third-party datasets.

Get involved!

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Stay connected for updates, events, and opportunities. Follow us on X, formerly Twitter, and join our mailing list to be part of the conversation!

FAIR Data in Practice

By Rafael, on 15 February 2024

Guest post by Victor Olago, Senior Research Data Steward and Shipra Suman, Research Data Steward, in celebration of International Love Data Week 2024.

Image depicting the FAIR guiding principles for data resources: Findable, Accessible, Interoperable, and Reusable. Created by SangyaPundir.

Credit: Sangya Pundir, CC BY-SA 4.0 via Wikimedia Commons

The problem:

We all know sharing is caring, and so data needs to be shared to explore its full potential and usefulness. This makes it possible for researchers to answer questions that were not the primary research objective of the initial study. The shared data also allows other researchers to replicate the findings underpinning the manuscript, which is important in knowledge sharing. It also allows other researchers to integrate these datasets with other existing datasets, either already collected or which will be collected in the future.

There are several factors that can hamper research data sharing. These might include a lack of technical skill, inadequate funding, an absence of data sharing agreements, or ethical barriers. As Data Stewards we support appropriate ways of collecting, standardizing, using, sharing, and archiving research data. We are also responsible for advocating best practices and policies on data. One of such best practices and policies includes the promotion and the implementation of the FAIR data principles.

FAIR is an acronym for Findable, Accessible Interoperable and Reusable [1]. FAIR is about making data discoverable to other researchers, but it does not translate exactly to Open Data. Some data can only be shared with others once security considerations have been addressed. For researchers to use the data, a concept-note or protocol must be in place to help gatekeepers of that data understand what each data request is meant for, how the data will be processed and expected outcomes of the study or sub study. Findability and Accessibility is ensured through metadata and enforcing the use of persistent identifiers for a given dataset. Interoperability relates to applying standards and encoding such as ICD-10, ICDO-3 [2] and, lastly, Reusability means making it possible for the data to be used by other researchers.

What we are doing:

We are currently supporting a data reuse project at the Medical Research Council Clinical Trials Unit (MRC CTU). This project enables the secondary analysis of clinical trial data. We use pseudonymisation techniques and prepare metadata that goes along with each data set.

Pseudonymisation helps process personal data in such a way that the data cannot be attributed to specific data subjects without the use of additional information [3]. This reduces the risks of reidentification of personal data. When data is pseudonymized direct identifiers are dropped while potentially identifiable information is coded. Data may also be aggregated. For example, age is transformed to age groups. There are instances where data is sampled from the original distribution, allowing only sharing of the sample data. Pseudonymised data is still personal data which must be protected with GDPR regulation [4].

The metadata makes it possible for other researchers to locate and request access to reuse clinical trials data at MRC CTU. With the extensive documentation that is attached, when access is approved, reanalysis and or integration with other datasets are made possible.  Pseudonymisation and metadata preparation helps in promoting FAIR data.

We have so far prepared one data-pack for RT01 studies which is ‘A randomized controlled trial of high dose versus standard dose conformal radiotherapy for localized prostate cancer’ which is currently in review phase and almost ready to share with requestors. Over the next few years, we hope to repeat and standardise the process for past, current and future studies of Cancer, HIV, and other trials.

References:    

  1. 8 Pillars of Open Science.
  2. Digital N: National Clinical Coding Standards ICD-10 5th Edition (2022), 5 edn; 2022.
  3. Anonymisation and Pseudonymisation.
  4. Complete guide to GDPR compliance.

Get involved!

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Stay connected for updates, events, and opportunities. Follow us on X, formerly Twitter, and join our mailing list to be part of the conversation!

Finding Data Management Tools for Your Research Discipline

By Rafael, on 14 February 2024

Guest post by Iona Preston, Research Data Support Officer, in celebration of International Love Data Week 2024.

Various gardening tools arranged on a dark wooden background

Photo by Todd Quackenbush on Unsplash.

While there are a lot of general resources to support good research data management practices – for example UCL’s Research Data Management webpages – you might sometimes be looking for something a bit more specific. It’s good practice to store your data in a research data repository that is subject specific, where other people in your research discipline are most likely to search for data. However, you might not know where to begin your search. You could be looking for discipline-specific metadata standards, so your data is more easily reusable by academic colleagues in your subject area. This is where subject-specific research data management resources become valuable. Here are some resources for specific subject areas and disciplines that you might find useful: 

  • The Research Data Management Toolkit for Life Sciences
    This resource guides you through the entire process of managing research data, explaining which tools to use at each stage of the research data lifecycle. It includes sections on specific life science research areas, from plant sciences to rare disease data. These sections also cover research community-specific repositories and examples of metadata standards. 
  • Visual arts data skills for researchers: Toolkits
    This consists of two different tutorials covering an introduction to research data management in the visual arts and how to create an appropriate data management plan. 
  • Consortium of European Social Science Data Archives
    CESSDA brings together data archives from across Europe in a searchable catalogue. Their website includes various resources for social scientists to learn more about data management and sharing, along with an extensive training section and a Data Management Expert Guide to lead you through the data management process. 
  • Research Data Alliance for Disciplines (various subject areas)
    The Research Data Alliance is an international initiative to promote data sharing. They have a webpage with special interest groups in various academic research areas, including agriculture, biomedical sciences, chemistry, digital humanities, social science, and librarianship, with useful resource lists for each discipline. 
  • RDA Metadata Standards Catalogue (all subject areas)
    This directory helps you find a suitable metadata scheme to describe your data, organized by subject area, featuring specific schemes across a wide range of academic disciplines. 
  • Re3Data (all subject areas)
    When it comes to sharing data, we always recommend you check if there’s a subject specific repository first, as that’s the best place to share. If you don’t know where to start finding one, this is a great place to look with a convenient browse feature to explore available options within your discipline.

These are only some of the different discipline specific tools that are available. You can find more for your discipline on the Research Data Management webpages. If you need any help and advice on finding data management resources, please get in touch with the Research Data Management team on lib-researchsupport@ucl.ac.uk 

Get involved!

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Stay connected for updates, events, and opportunities. Follow us on X, formerly Twitter, and join our mailing list to be part of the conversation!