X Close

Open@UCL Blog

Home

Menu

Open Source Software Design for Academia

By Kirsty, on 27 August 2024

Guest post by Julie Fabre, PhD candidate in Systems Neuroscience at UCL. 

As a neuroscientist who has designed several open source software projects, I’ve experienced firsthand both the power and pitfalls of the process. Many researchers, myself included, have learned to code on the job, and there’s often a significant gap between writing functional code and designing robust software systems. This gap becomes especially apparent when developing tools for the scientific community, where reliability, usability, and maintainability are crucial.

My journey in open source software development has led to the creation of several tools that have gained traction in the neuroscience community. One such project is bombcell: a software designed to assess the quality of recorded neural units. This tool replaces what was once a laborious manual process and is now used in over 30 labs worldwide. Additionally, I’ve developed other smaller toolboxes for neuroscience:

These efforts were recognized last year when I received an honourable mention in the UCL Open Science and Scholarship Awards.

In this post, I’ll share insights gained from these experiences. I’ll cover, with some simplified examples from my toolboxes:

  1. Core design principles
  2. Open source best practices for academia

Disclaimer: I am not claiming to be an expert. Don’t view this as a definitive guide, but rather as a conversation starter.


Follow Julie’s lead: Whether you’re directly involved in open source software development or any other aspect of open science and scholarship, or if you simply know someone who has made important contributions, consider applying yourself or nominating a colleague for this year’s UCL Open Science and Scholarship Awards to gain recognition for outstanding work!


Part 1: Core Design Principles

As researchers, we often focus on getting our code to work, but good software design goes beyond just functionality. In order to maintain and build upon your software, following a few principles from the get go will elevate software from “it works” to “it’s a joy to use, maintain and contribute to”.

1. Complexity is the enemy

A primary goal of good software design is to reduce complexity. One effective way to simplify complex functions with many parameters is to use configuration objects. This approach not only reduces parameter clutter but also makes functions more flexible and maintainable. Additionally, breaking down large functions into smaller, more manageable pieces can significantly reduce overall complexity.

Example: Simplifying a data analysis function

For instance, in bombcell we run many different quality metrics, and each quality metric is associated with several other parameters. In the main function, instead of inputting all the different parameters independently:

[qMetric, unitType] = runAllQualityMetrics(plotDetails, plotGlobal, verbose, reExtractRaw, saveAsTSV, removeDuplicateSpikes, duplicateSpikeWindow_s, detrendWaveform, nRawSpikesToExtract, spikeWidth, computeSpatialDecay, probeType, waveformBaselineNoiseWindow, tauR_values, tauC, computeTimeChunks, deltaTimeChunks, presenceRatioBinSize, driftBinSize, ephys_sample_rate, nChannelsIsoDist, normalizeSpDecay, (... many many more parameters ...), rawData, savePath);

they are all stored in a ‘param’ object that is passed onto the function:

[qMetric, unitType] = runAllQualityMetrics(param, rawData, savePath);

This approach reduces parameter clutter and makes the function more flexible and maintainable.

 2. Design for change

Research software often needs to adapt to new hypotheses or methodologies. When writing a function, ask yourself “what additional functionalities might I need in the future?” and design your code accordingly. Implementing modular designs allows for easy modification and extension as research requirements evolve. Consider using dependency injection to make components more flexible and testable. This approach separates the creation of objects from their usage, making it easier to swap out implementations or add new features without affecting existing code.

Example: Modular design for a data processing pipeline

Instead of a monolithic script:

function runAllQualityMetrics(param, rawData, savePath)
% Hundreds of lines of code doing many different things
(...)
end

Create a modular pipeline that separates each quality metric into a different function:

function qMetric = runAllQualityMetrics(param, rawData, savePath)
nUnits = length(rawData);
for iUnit = 1:nUnits
% step 1: calculate percentage spikes missing
qMetric.percSpikesMissing(iUnit) = bc.qm.percSpikesMissing(param, rawData);
% step 2: calculate fraction refractory period violations
qMetric.fractionRPviolations(iUnit) = bc.qm.fractionRPviolations(param, rawData);
% step 3: calculate presence ratio
qMetric.presenceRatio(iUnit) = bc.qm.presenceRatio(param, rawData);
(...)
% step n: calculate distance metrics
qMetric.distanceMetric(iUnit) = bc.qm.getDistanceMetric(param, rawData);
end
bc.qm.saveQMetrics(qMetric, savePath)
end

This structure allows for easy modification of individual steps or addition of new steps without affecting the entire pipeline.

In addition, this structure allows us to define new parameters easily that can then modify the behavior of the subfunctions. For instance we can add different methods (such as adding the ‘gaussian’ option below) without changing how any of the functions are called!

param.percSpikesMissingMethod = 'gaussian';
qMetric.percSpikesMissing(iUnit) = bc.qm.percSpikesMissing(param, rawData);

and then, inside the function:

function percSpikesMissing = percSpikesMissing(param, rawData);
if param.percSpikesMissingMethod == 'gaussian'
(...)
else
(...)
end
end

3. Hide complexity

Expose only what’s necessary to use a module or function, hiding the complex implementation details. Use abstraction layers to separate interface from implementation, providing clear and concise public APIs while keeping complex logic private. This approach not only makes your software easier to use but also allows you to refactor and optimize internal implementations without affecting users of your code.

Example: Complex algorithm with a simple interface

For instance, in bombcell there are many parameters. When we run the main script that calls all quality metrics, we also want to ensure all parameters are present and are in a correct format.

function qMetric = runAllQualityMetrics(param, rawData, savePath)
% Complex input validation that is hidden to the user
param_complete = bc.qm.checkParameterFields(param);

% Core function that calcvulates all quality metrics
nUnits = length(rawData);

for iUnit = 1:nUnits
% steps 1 to n
(...)
end

end

Users of this function don’t need to know about the input validation or other complex calculations. They just need to provide input and options.

4. Write clear code

Clear code reduces the need for extensive documentation and makes your software more accessible to collaborators. Use descriptive and consistent variable names throughout your codebase. When dealing with specific quantities, consider adding units to variable names (e.g., ‘time_ms’ for milliseconds) to improve clarity. You can add comments to explain non-obvious logic and to add general outlines of the steps in your code. Following consistent coding style and formatting guidelines across your project also contributes to overall clarity.

Example: Improving clarity in a data processing function

Instead of an entirely mysterious function

function [ns, sr] = ns(st, t)
ns = numel(st);
sr = ns/t;

Add more descriptive variable and function names and add function headers:

function [nSpikes, spikeRate] = numberSpikes(theseSpikeTimes, totalTime_s)
% Count the number of spikes for the current unit
% ------
% Inputs
% ------
% theseSpikeTimes: [nSpikesforThisUnit × 1 double vector] of time in seconds of each of the unit's spikes.
% totalTime_s: [double] of the total recording time, in seconds.
% ------
% Outputs
% ------
% nSpikes: [double] number of spikes for current unit.
% spikeRate_s : [double] spiking rare for current unit, in seconds.
% ------
nSpikes = numel(theseSpikeTimes);
spikeRate_s = nSpikes/totalTime_s;
end

5. Design for testing

Incorporate testing into your design process from the beginning. This not only catches bugs early but also encourages modular, well-defined components.

Example: Testable design for a data analysis function

For the simple ‘numberSpikes’ function we define above, we can have a few tests to cover various scenarios and edge cases to ensure the function works correctly. For instance, we can test a normal case with a few spikes and an empty spike times input.

function testNormalCase(testCase)
theseSpikeTimes = [0.1, 0.2, 0.3, 0.4, 0.5]; totalTime_s = 1;
[nSpikes, spikeRate] = numberSpikes(theseSpikeTimes, totalTime_s);
verifyEqual(testCase, nSpikes, 5, 'Number of spikes should be 5');
verifyEqual(testCase, spikeRate, 5, 'Spike rate should be 5 Hz');
end

function testEmptySpikeTimes(testCase)
theseSpikeTimes = [];
totalTime_s = 1;
[nSpikes, spikeRate] = numberSpikes(theseSpikeTimes, totalTime_s);
verifyEqual(testCase, nSpikes, 0, 'Number of spikes should be 0 for empty input');
verifyEqual(testCase, spikeRate, 0, 'Spike rate should be 0 for empty input');
end

This design allows for easy unit testing of individual components of the analysis pipeline.

Part 2: Open Source Best Practices for Academia

While using version control and having a README, documentation, license, and contribution guidelines are essential, I have found that these practices have the most impact:

Example Scripts and Toy Data

I have found that the most useful thing you can provide with your software are example scripts, and even better, provide toy data that loads in your example script. Users can then quickly test your software and see how to use it on their own data — and are then more likely to adopt it. If possible, package the example scripts in Jupyter notebooks/MATLAB live scripts (or equivalent) demonstrating key use cases. In bombcell, we provide a small dataset (Bombcell Toy Data on GitHub) and a MATLAB live script that runs bombcell on this small toy dataset (Getting Started with Bombcell on GitHub). 

Issue-Driven Improvement

To manage user feedback effectively, enforce the use of an issue tracker (like GitHub Issues) for all communications. This approach ensures that other users can benefit from conversations and reduces repetitive work. When addressing questions or bugs, consider if there are ways to improve documentation or add safeguards to prevent similar issues in the future. This iterative process leads to more robust and intuitive software.

Citing

Make your software citable quickly. Before (or instead) of publishing, you can generate a citable DOI using software like Zenodo. Consider also publishing in the Journal of Open Source Software (JOSS) for light peer review. Clearly outline how users should cite your software in their publications to ensure proper recognition of your work.

Conclusion

These practices can help create popular, user-friendly, and robust academic software. Remember that good software design is an iterative process, and continuously seeking feedback and improving your codebase (and sometimes entirely rewriting/refactoring parts) will lead to more robust code.

To go deeper into principles of software design, I highly recommend reading “A Philosophy of Software Design” by John Ousterhout or “The Good Research Code Handbook” by Patrick J. Mineault.

Get involved! 

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Join our mailing list, and follow us on X, formerly Twitter and LinkedIn, to stay connected for updates, events, and opportunities.

 

 

 

The benefits and barriers to code sharing as an Early Career Researcher

By Kirsty, on 14 September 2021

Guest post by Louise Mc Grath-Lone, Research Fellow (UCL Institute of Health Informatics), Rachel Pearson, Research Assistant (UCL Institute of Child Health) and Ania Zylbersztejn, Research Fellow (UCL Institute of Child Health)

In July 2021, we held a session on code sharing as part of the UCL Festival of Code and were thrilled to have almost 90 attendees from 9 out of UCL’s 11 faculties – highlighting that researchers from across a wide range of disciplines are interested in sharing their code.

The aims of the session were to highlight the benefits of code sharing, to explore some of the barriers to code sharing that Early Carly Researchers may experience, and to offer some practical advice about establishing, maintaining and contributing to a code repository.

In this blog, we summarise the benefits and barriers to code sharing we discussed in the session taking into account the views that participants shared.

What is code sharing and what are the benefits?

Code sharing covers a range of activities, including sharing code privately (e.g., with your colleagues as part of internal code review) or publicly (e.g., as part of a journal article submission).

For Early Career Researchers in academia, there are many benefits to sharing code including:

Reducing duplication of effort: For activities such as data cleaning and preparation, code sharing is an important method of reducing duplication of effort among the research community.

Capturing the work you put into data management: The processes of managing large datasets are time-consuming, but this effort is often not apparent in traditional research outputs (such as journal articles). Sharing code is one way of demonstrating the work that goes into data management activities.

Improving the transparency and reproducibility of your work: Code sharing allows others to understand, validate and extend what you did in your research.

Enabling the continuity of your work: Many researchers spend the early years of their career on fixed-term contracts. Code sharing is a way to enable the continuity of your work after you’ve moved on by allowing others to build on it. This increases the chances of it reaching the publication stage and your efforts and inputs being recognised in the form of a journal article.

Building your reputation and networks: Code sharing is a way to build your reputation and grow your networks which can lead to opportunities for collaboration.

Providing opportunities for teaching and learning: By sharing code and by looking at code that others have shared, Early Career Researchers have opportunities to both teach and learn.

Demonstrating a commitment to Open Science principles: Code sharing is increasingly valued by research funders (e.g. the Wellcome Trust) and is a tangible way to show your commitment to Open Science principles which are part of UCL’s Academic Framework and important for career progression.

Despite the clear benefits to code sharing, at the start of our session just 1 in 4 participants (26%) said that they often or always share code. However, by the end of the session, almost all participants (90%) said that they definitely or probably will share their code in the future.

What are the barriers to code sharing as an Early Career Researcher and how we can overcome them?

We asked participants what has put them off sharing their code in the past. The most common responses were:

The time and effort required: Ideally, you would write perfectly formatted and commented code on the first go – however, in reality, it often does not work out like this. As you update code and encounter bugs, code can often become messy and considerable time/effort needed to get it to point it can be understood by someone outside the research project. We discussed the importance shifting your perception of ‘shareable’ code. Sharing any code, even if messy, is far more helpful than sharing nothing at all.

Lack of confidence and concerns about criticism: Many researchers who write code as part of their work have very little (or no!) formal training. This means that sharing code can be daunting. For example, researchers may be worried about others finding errors in their code; however, sharing code can help to catch bugs in code early on and can bolster your confidence and reassure you that your code is correct. In the session, we also discussed how getting involved with online coding communities that emphasize inclusivity and support (e.g., R Ladies, Tidy Tuesday or one of the UCL Coding Clubs) can help grow confidence and provide a kinder environment in which to share code publicly.

Not knowing how to share or who to share with: A lack of formal training means that many researchers are unsure about where or how to share code, including not knowing which license to use to enable appropriate reuse of code. We discussed the need for more training opportunities, encouraged setting up your own code review groups (like a journal club, but for sharing and discussing code).

Worry that code will be reused without permission: Some participants were worried about plagiarism and their hard work being re-used without their knowledge or permission. However, hosting your code in a repository like GitHub allows you to choose suitable licence for re-use of your code to prevent undesired use while still supporting open science! You can also see how many people have accessed your code.

How can Early Career Researchers get started with code sharing?

Preparing code to share can take time and, as they work to secure their future within academia, many Early Career Researchers may already feel overloaded and pulled in different directions (e.g., teaching, institutional citizenship, engagement work, producing publications, attending conferences, research management, etc.). However, code sharing is hugely beneficial for a career in academia and so we would encourage all Early Career Researchers to try to find the time to share code by viewing it as an opportunity to invest in your future self. For example, you could:

  • Adopt a coding style guide to help produce clear and uniform code with good comments from the outset. This will reduce effort end when you come to share code (and help your future self when you look at your code many years later and have inevitably forgotten what it all does!
  • Join a UCL Coding Clubs or online community to learn tips from others about coding and sharing code.
  • Learn to use a code repository like GitHub. As part of our session, we delivered an introductory tutorial on how to use GitHub with links to other useful resources (available here).

How can UCL support Early Career Researchers to share code?

We ended the session by asking the participants how UCL could better support them to share their code. Some of the ideas suggested by Early Career Researchers were:

More training on writing and sharing code: For example, one suggestion was that UCL could create a Moodle training course for code sharing. Training about best practice in coding (across several languages) to help Early Career Researchers to write code right the first time would also be helpful.

Simple, accessible guidance about code sharing: This might include checklists or 1-to-1 advice sessions, in particular, to help Early Career Researchers to select the right licenses.

Embed code sharing as best practice at all levels: Encouraging and supporting senior researchers to share code so that it becomes embedded as good practice at all levels would provide a good example for and encourage more junior members of staff. It would also help to ensure that the time and training required to prepare code for sharing is built into grant applications.

Knowledge sharing opportunities: More events and opportunities to discuss how research groups share code to share best practice across faculties throughout UCL.

 

We would like to thank everyone who attended our session – “Code sharing for Early Career Researchers: the good the bad and the ugly!” – at the UCL Festival of Code for their time and contributions to the lively discussions. All the materials from the session are available here, including an introductory tutorial to getting started with code sharing using GitHub. We would also like to thank the organisers of the UCL Festival of Code for their help and support.