X Close

Open@UCL Blog

Home

Menu

Archive for August, 2024

Open Source Software Design for Academia

By Kirsty, on 27 August 2024

Guest post by Julie Fabre, PhD candidate in Systems Neuroscience at UCL. 

As a neuroscientist who has designed several open source software projects, I’ve experienced firsthand both the power and pitfalls of the process. Many researchers, myself included, have learned to code on the job, and there’s often a significant gap between writing functional code and designing robust software systems. This gap becomes especially apparent when developing tools for the scientific community, where reliability, usability, and maintainability are crucial.

My journey in open source software development has led to the creation of several tools that have gained traction in the neuroscience community. One such project is bombcell: a software designed to assess the quality of recorded neural units. This tool replaces what was once a laborious manual process and is now used in over 30 labs worldwide. Additionally, I’ve developed other smaller toolboxes for neuroscience:

These efforts were recognized last year when I received an honourable mention in the UCL Open Science and Scholarship Awards.

In this post, I’ll share insights gained from these experiences. I’ll cover, with some simplified examples from my toolboxes:

  1. Core design principles
  2. Open source best practices for academia

Disclaimer: I am not claiming to be an expert. Don’t view this as a definitive guide, but rather as a conversation starter.


Follow Julie’s lead: Whether you’re directly involved in open source software development or any other aspect of open science and scholarship, or if you simply know someone who has made important contributions, consider applying yourself or nominating a colleague for this year’s UCL Open Science and Scholarship Awards to gain recognition for outstanding work!


Part 1: Core Design Principles

As researchers, we often focus on getting our code to work, but good software design goes beyond just functionality. In order to maintain and build upon your software, following a few principles from the get go will elevate software from “it works” to “it’s a joy to use, maintain and contribute to”.

1. Complexity is the enemy

A primary goal of good software design is to reduce complexity. One effective way to simplify complex functions with many parameters is to use configuration objects. This approach not only reduces parameter clutter but also makes functions more flexible and maintainable. Additionally, breaking down large functions into smaller, more manageable pieces can significantly reduce overall complexity.

Example: Simplifying a data analysis function

For instance, in bombcell we run many different quality metrics, and each quality metric is associated with several other parameters. In the main function, instead of inputting all the different parameters independently:

[qMetric, unitType] = runAllQualityMetrics(plotDetails, plotGlobal, verbose, reExtractRaw, saveAsTSV, removeDuplicateSpikes, duplicateSpikeWindow_s, detrendWaveform, nRawSpikesToExtract, spikeWidth, computeSpatialDecay, probeType, waveformBaselineNoiseWindow, tauR_values, tauC, computeTimeChunks, deltaTimeChunks, presenceRatioBinSize, driftBinSize, ephys_sample_rate, nChannelsIsoDist, normalizeSpDecay, (... many many more parameters ...), rawData, savePath);

they are all stored in a ‘param’ object that is passed onto the function:

[qMetric, unitType] = runAllQualityMetrics(param, rawData, savePath);

This approach reduces parameter clutter and makes the function more flexible and maintainable.

 2. Design for change

Research software often needs to adapt to new hypotheses or methodologies. When writing a function, ask yourself “what additional functionalities might I need in the future?” and design your code accordingly. Implementing modular designs allows for easy modification and extension as research requirements evolve. Consider using dependency injection to make components more flexible and testable. This approach separates the creation of objects from their usage, making it easier to swap out implementations or add new features without affecting existing code.

Example: Modular design for a data processing pipeline

Instead of a monolithic script:

function runAllQualityMetrics(param, rawData, savePath)
% Hundreds of lines of code doing many different things
(...)
end

Create a modular pipeline that separates each quality metric into a different function:

function qMetric = runAllQualityMetrics(param, rawData, savePath)
nUnits = length(rawData);
for iUnit = 1:nUnits
% step 1: calculate percentage spikes missing
qMetric.percSpikesMissing(iUnit) = bc.qm.percSpikesMissing(param, rawData);
% step 2: calculate fraction refractory period violations
qMetric.fractionRPviolations(iUnit) = bc.qm.fractionRPviolations(param, rawData);
% step 3: calculate presence ratio
qMetric.presenceRatio(iUnit) = bc.qm.presenceRatio(param, rawData);
(...)
% step n: calculate distance metrics
qMetric.distanceMetric(iUnit) = bc.qm.getDistanceMetric(param, rawData);
end
bc.qm.saveQMetrics(qMetric, savePath)
end

This structure allows for easy modification of individual steps or addition of new steps without affecting the entire pipeline.

In addition, this structure allows us to define new parameters easily that can then modify the behavior of the subfunctions. For instance we can add different methods (such as adding the ‘gaussian’ option below) without changing how any of the functions are called!

param.percSpikesMissingMethod = 'gaussian';
qMetric.percSpikesMissing(iUnit) = bc.qm.percSpikesMissing(param, rawData);

and then, inside the function:

function percSpikesMissing = percSpikesMissing(param, rawData);
if param.percSpikesMissingMethod == 'gaussian'
(...)
else
(...)
end
end

3. Hide complexity

Expose only what’s necessary to use a module or function, hiding the complex implementation details. Use abstraction layers to separate interface from implementation, providing clear and concise public APIs while keeping complex logic private. This approach not only makes your software easier to use but also allows you to refactor and optimize internal implementations without affecting users of your code.

Example: Complex algorithm with a simple interface

For instance, in bombcell there are many parameters. When we run the main script that calls all quality metrics, we also want to ensure all parameters are present and are in a correct format.

function qMetric = runAllQualityMetrics(param, rawData, savePath)
% Complex input validation that is hidden to the user
param_complete = bc.qm.checkParameterFields(param);

% Core function that calcvulates all quality metrics
nUnits = length(rawData);

for iUnit = 1:nUnits
% steps 1 to n
(...)
end

end

Users of this function don’t need to know about the input validation or other complex calculations. They just need to provide input and options.

4. Write clear code

Clear code reduces the need for extensive documentation and makes your software more accessible to collaborators. Use descriptive and consistent variable names throughout your codebase. When dealing with specific quantities, consider adding units to variable names (e.g., ‘time_ms’ for milliseconds) to improve clarity. You can add comments to explain non-obvious logic and to add general outlines of the steps in your code. Following consistent coding style and formatting guidelines across your project also contributes to overall clarity.

Example: Improving clarity in a data processing function

Instead of an entirely mysterious function

function [ns, sr] = ns(st, t)
ns = numel(st);
sr = ns/t;

Add more descriptive variable and function names and add function headers:

function [nSpikes, spikeRate] = numberSpikes(theseSpikeTimes, totalTime_s)
% Count the number of spikes for the current unit
% ------
% Inputs
% ------
% theseSpikeTimes: [nSpikesforThisUnit × 1 double vector] of time in seconds of each of the unit's spikes.
% totalTime_s: [double] of the total recording time, in seconds.
% ------
% Outputs
% ------
% nSpikes: [double] number of spikes for current unit.
% spikeRate_s : [double] spiking rare for current unit, in seconds.
% ------
nSpikes = numel(theseSpikeTimes);
spikeRate_s = nSpikes/totalTime_s;
end

5. Design for testing

Incorporate testing into your design process from the beginning. This not only catches bugs early but also encourages modular, well-defined components.

Example: Testable design for a data analysis function

For the simple ‘numberSpikes’ function we define above, we can have a few tests to cover various scenarios and edge cases to ensure the function works correctly. For instance, we can test a normal case with a few spikes and an empty spike times input.

function testNormalCase(testCase)
theseSpikeTimes = [0.1, 0.2, 0.3, 0.4, 0.5]; totalTime_s = 1;
[nSpikes, spikeRate] = numberSpikes(theseSpikeTimes, totalTime_s);
verifyEqual(testCase, nSpikes, 5, 'Number of spikes should be 5');
verifyEqual(testCase, spikeRate, 5, 'Spike rate should be 5 Hz');
end

function testEmptySpikeTimes(testCase)
theseSpikeTimes = [];
totalTime_s = 1;
[nSpikes, spikeRate] = numberSpikes(theseSpikeTimes, totalTime_s);
verifyEqual(testCase, nSpikes, 0, 'Number of spikes should be 0 for empty input');
verifyEqual(testCase, spikeRate, 0, 'Spike rate should be 0 for empty input');
end

This design allows for easy unit testing of individual components of the analysis pipeline.

Part 2: Open Source Best Practices for Academia

While using version control and having a README, documentation, license, and contribution guidelines are essential, I have found that these practices have the most impact:

Example Scripts and Toy Data

I have found that the most useful thing you can provide with your software are example scripts, and even better, provide toy data that loads in your example script. Users can then quickly test your software and see how to use it on their own data — and are then more likely to adopt it. If possible, package the example scripts in Jupyter notebooks/MATLAB live scripts (or equivalent) demonstrating key use cases. In bombcell, we provide a small dataset (Bombcell Toy Data on GitHub) and a MATLAB live script that runs bombcell on this small toy dataset (Getting Started with Bombcell on GitHub). 

Issue-Driven Improvement

To manage user feedback effectively, enforce the use of an issue tracker (like GitHub Issues) for all communications. This approach ensures that other users can benefit from conversations and reduces repetitive work. When addressing questions or bugs, consider if there are ways to improve documentation or add safeguards to prevent similar issues in the future. This iterative process leads to more robust and intuitive software.

Citing

Make your software citable quickly. Before (or instead) of publishing, you can generate a citable DOI using software like Zenodo. Consider also publishing in the Journal of Open Source Software (JOSS) for light peer review. Clearly outline how users should cite your software in their publications to ensure proper recognition of your work.

Conclusion

These practices can help create popular, user-friendly, and robust academic software. Remember that good software design is an iterative process, and continuously seeking feedback and improving your codebase (and sometimes entirely rewriting/refactoring parts) will lead to more robust code.

To go deeper into principles of software design, I highly recommend reading “A Philosophy of Software Design” by John Ousterhout or “The Good Research Code Handbook” by Patrick J. Mineault.

Get involved! 

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Join our mailing list, and follow us on X, formerly Twitter and LinkedIn, to stay connected for updates, events, and opportunities.

 

 

 

UCL Open Science & Scholarship Awards – Update from Mike and Gesche!

By Kirsty, on 21 August 2024

As part of our work at the Office this year, we’ve made it a priority to stay connected with all of our award winners. Some of them shared their experiences during our conference, and we’re already well on our way to planning another exciting Awards ceremony for this year’s winners!

You can apply now for the UCL Open Science & Scholarship Awards 2024 to celebrate UCL students and staff who are advancing and promoting open science and scholarship. The awards are open to all UCL students, PhD candidates, professional services, and academic staff across all disciplines. There’s still time to submit your applications and nominations in all categories— the deadline is 1 September!

To give you some inspiration for what’s possible in open science, Mike Fell has given us an update on the work that he and Gesche have done since receiving their award last year:


In autumn last year, we were surprised and really happy to hear we’d received the first UCL Open Scholarship Awards. Even more so when we heard at the ceremony about the great projects that others at UCL are doing in this space.

The award was for work we’d done (together with PhD colleague Nicole Watson) to improve transparency, reproducibility, and quality (TReQ) or research in applied multidisciplinary areas like energy. This included producing videos, writing papers, and delivering teaching and related resources.

Of course, it’s nice for initiatives you’ve been involved in to be recognized. But even better have been some of the doors this recognition has helped to open. Shortly after getting the award, we were invited to write an opinion piece for PLOS Climate on the role of open science in addressing the climate crisis. We also engaged with leadership at the Center for Open Science.

More broadly – although it’s always hard to draw direct connections – we feel the award has had career benefits. Gesche was recently appointed Professor of Environment & Human Health at University of Exeter, and Director the European Centre for Environment and Human Health. As well as highlighting her work on open science, and the award, in her application, this now provides an opportunity to spread the work further beyond the bounds of UCL and our existing research projects.

There’s still a lot to do, however. While teaching about open science is now a standard part of the curriculum for graduate students in our UCL department (and Gesche planning this for the ECEHH too), we don’t have a sense that this is common in energy research, other applied research fields, and education more broadly. It’s still quite rare to see tools like pre-analysis plans, reporting guidelines, and even preprints employed in energy research.

A new research centre we are both involved in, the UKRI Energy Demand Research Centre, has been up and running for a year, and with lots of the setup stage now complete and staff in place, we hope to pick up a strand of work in this area. Gesche is the data champion for the Equity theme of that centre. The new focus must be on how to better socialize open research practices and make them more a part of the culture of doing energy research. We look forward to continuing to work with UCL Open Science in achieving that goal.

Get involved!

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Join our mailing list, and follow us on X, formerly Twitter and LinkedIn, to stay connected for updates, events, and opportunities.

 

 

 

Copyright and AI, Part 2: Perceived Challenges, Suggested Approaches and the Role of Copyright literacy

By Rafael, on 15 August 2024

Guest post by Christine Daoutis (UCL), Alex Fenlon (University of Birmingham) and Erica Levi (Coventry University).

This blog post is part of a collaborative series between the UCL Office for Open Science and Scholarship and the UCL Copyright team exploring important aspects of copyright and its implications for open research and scholarship. 

A grey square from which many colourful, wavy ribbons with segments in shades of white, blue, light green, orange and black radiate outward against a grey background.

An artist’s illustration of AI by Tim West. Photo by Google DeepMind from Pexels.

A previous post outlined copyright-related questions when creating GenAI materials—questions related to ownership, protection/originality, and infringement when using GenAI. The post discussed how answers to these questions are not straightforward, largely depend on what is at stake and for whom, and are constantly shaped by court cases as they develop.

What does this uncertainty mean for students, academics, and researchers who use GenAI and, crucially, for those in roles that support them? To what extent does GenAI create new challenges, and to what extent are these uncertainties inherent in working with copyright? How can we draw on existing expertise to support and educate on using GenAI, and what new skills do we need to develop?

In this post, we summarise a discussion we led as part of our workshop for library and research support professionals at the Research Libraries UK (RLUK) annual conference in March 2024. This year’s conference title was New Frontiers: The Expanding Scope of the Modern Research Library. Unsurprisingly, when considering the expanding scope of libraries in supporting research, GenAI is one of the first things that comes to mind.

Our 16 workshop participants came from various roles, research institutions, and backgrounds. What they had in common was an appetite to understand and support copyright in the new context of AI, and a collective body of expertise that, as we will see, is very useful when tackling copyright questions in a novel context. The workshop consisted of presentations and small group discussions built around the key themes outlined below.

Perceived Challenges and Opportunities
Does the research library community overall welcome GenAI? It is undoubtedly viewed as a way to make scholarship easier and faster, offering practical solutions—for example, supporting literature reviews or facilitating draft writing by non-English speakers. Beyond that, several participants see an opportunity to experiment, perhaps becoming less risk-averse, and welcome new tools that can make research more efficient in new and unpredictable ways.

However, concerns outweigh the perceived benefits. It was repeatedly mentioned that there is a need for more transparent, reliable, sustainable, and equitable tools before adopting them in research. Crucially, users need to ask themselves what exactly they are doing when using GenAI, their intention, what sources are being used, and how reliable the outputs are.

GenAI’s concerns over copyright were seen as an opportunity to place copyright literacy at the forefront. The need for new guidance is evident, particularly around the use of different tools with varying terms and conditions, and it is also perceived as an opportunity to revive and communicate existing copyright principles in a new light.

Suggested Solutions
One of the main aims of the workshop was to address challenges imposed by GenAI. Participants were very active in putting forward ideas but expressed concerns and frustration. For example, they questioned the feasibility of shaping policy and processes when the tools themselves constantly evolve, when there is very little transparency around the sources used, and when it is challenging to reach agreement even on essential concepts. Debates on whether ‘copying’ is taking place, whether an output is a derivative of a copyrighted work, and even whether an output is protected are bound to limit the guidance we develop.

Drawing from Existing Skills and Expertise
At the same time, it was acknowledged that copyright practitioners already have expertise, guidance, and educational resources relevant to questions about GenAI and copyright. While new guidance and training are necessary, the community can draw from a wealth of resources to tackle questions that arise while using GenAI. Information literacy principles should still apply to GenAI. Perhaps the copyright knowledge and support are already available; what is missing is a thorough understanding of the new technologies, their strengths, and limitations to apply existing knowledge to new scenarios. This is where the need for collaboration arises.

Working Together
To ensure that GenAI is used ethically and creatively, the community needs to work collaboratively—with providers, creators, and users of those tools. By sharing everyday practices, decisions, guidance, and processes will be informed and shaped. It is also important to acknowledge that the onus is not just on the copyright practitioners to understand the tools but also on the developers to make them transparent and reliable. Once the models become more transparent, it should be possible to support researchers better. This is even more crucial in supporting text and data mining (TDM) practices—critical in many research areas—to limit further restrictions following the implementation of AI models.

Magic Changes
With so much excitement around AI, we felt we should ask the group to identify the one magic change that would help remove most of the concerns. Interestingly, the consensus was that clarity around the sources and processes used by GenAI models is essential. How do the models come up with their answers and outputs? Is it possible to have clearer information about the sources’ provenance and the way the models are trained, and can this inform how authorship is established? And what criteria should be put in place to ensure the models are controlled and reliable?

This brings the matter back to the need for GenAI models to be regulated—a challenging but necessary magic change that would help us develop our processes and guidance with much more confidence.

Concluding Remarks
While the community of practitioners waits for decisions and regulations that will frame their approach, it is within their power to continue to support copyright literacy, referring to new and exciting GenAI cases. Not only do those add interest, but they also highlight an old truth about copyright, namely, that copyright-related decisions always come with a degree of uncertainty, risk, and awareness of conflicting interests.

About the authors 

Christine Daoutis is the UCL Copyright Support Officer at UCL. Christine provides support, advice and training on copyright as it applies to learning, teaching and research activities, with a focus on open science practices. Resources created by Christine include the UCL Copyright Essentials tutorial and the UCL Copyright and Your Teaching online tutorial.

Alex Fenlon is the Head of Copyright and Licensing within Libraries and Learning Resources at the University of Birmingham. Alex and his team provide advice and guidance on copyright matters, including text, data mining, and AI, to ensure that all law and practice are understood by all.

Erica Levi is the Digital repository and Copyright Lead at Coventry University. Erica has created various resources to increase awareness of copyright law and open access through gamification. Her resources are available on her website.

Get involved!

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Join our mailing list, and follow us on X, formerly Twitter and LinkedIn, to be part of the conversation and stay connected for updates, events, and opportunities.