X Close

Centre for Advanced Research Computing

Home

ARC is UCL's research, innovation and service centre for the tools, practices and systems that enable computational science and digital scholarship

Menu

Archive for the 'Technologies' Category

The importance of collaboration: The latest engagement between DiRAC and ARC

By Connor Aird, on 26 April 2024

When time is scarce on a research project, it is important to continuously plan and effectively collaborate with the whole team. A good example of this is the DiRAC project, Spontaneous Symmetry Breaking in 3d Models of Fermions with Prof Simon Hands (PI) which, due to a funding deadline, had to be delivered in 5 weeks. This project aims to explore the phase diagram of a relativistic field theory of fermions using a code base developed by Prof Hands et al, known as thirring-rhmc. However, the collaboration with ARC and Prof Hands covered a much smaller scope.  

Aims 

Our aim was to migrate the work of a PhD student (Dr Jude Worthy) into the default branch of the thirring-rhmc code base. Once this was completed, the intention was that some performance improvements could be investigated as part of the project. Jude’s work implemented a higher accuracy but consequently lower performance formulation (Wilson kernel) of something already implemented in the code base (Shamir kernel). This reduction in performance is the reason for the desire to gain some performance improvements. However, from the PI’s initial comments, it was clear that the key aim remained the code migration – “…I’m increasingly convinced it only makes sense to pursue this research program further if an improved formulation is employed, so the Shamir -> Wilson transition as essential”. 

Obstacles 

Several obstacles threatened the success of this project. Development on the original version of thirring-rhmc had continued throughout Jude’s PhD but unfortunately git had not been used to develop the Wilson kernel. Therefore, the two codes had diverged significantly with no clear indication as to what degree. Due to this divergence, it was vital to develop a continuous testing suite to have any chance of success. However, the outputs of thirring-rhmc are statistical in nature and can, whilst remaining correct, vary significantly with only slight changes to the code. Therefore, a lot of domain specific knowledge would be required to design these tests. 

What we did 

This project’s strict time constraints required us to take a methodical approach to planning our work. For each task, we defined a clear definition of done and ensured we understood how that individual piece of work helped progress towards our key aim. Continuously planning our tasks in this way was essential to our success. 

The lack of clarity around what changes in the Wilson kernel were significant meant our first task was to set up reliable unit tests. With these tests in place, we could confidently alter the code and catch any breaking changes we might introduce. Helpfully, some stale tests were already present in the repository. With Simon’s domain knowledge, we were able to update these existing tests to create a working test suite. When these tests failed and highlighted issues we couldn’t solve independently, we were able to quickly reach a solution through regular communication with Simon. Simon’s domain knowledge was an invaluable asset throughout the project. As a bonus, we were able to demonstrate the confidence regular testing gave us when carrying out large refactors and migrations. This will hopefully increase the chances of Simon’s research team continuing to maintain and build upon these tests, therefore preventing the tests going stale again. This is a great example of how close collaboration between RSEs and Researchers can benefit both parties. 

This close collaboration and communication with Simon helped to quickly increase our knowledge of the code base and research domain. Due to this better understanding, we identified the likely causes of two known issues with Jude’s code. Most notably, we identified that the inflated value of an input parameter was a key reason for the Wilson kernels reduced performance.  

Conclusion 

To conclude, RSEs and Researchers work best together when they effectively communicate. Siloing the domain knowledge of these two parties only reduces the chances of success. Our projects are collaborations and can only succeed if we work in this way from the very beginning. 

Randomising Blender scene properties for semi-automated data generation

By Ruaridh Gollifer, on 12 December 2023

Blender is a free and open-source software for 3D geometry rendering. Uses include modelling, simulation, animation, virtual reality applications, and more recently synthetic datasets generation. This last application is of particular interest in the field of medical imaging, where often there is limited real data that can be used to train machine learning models. By creating large amounts of synthetic but realistic data, we can improve the performance of models in tasks such as polyp detection in image guided surgery. Synthetic data generation has other advantages since using tools like Blender gives us more control and we can generate a variety of ground truth data from segmentation masks to optic flow fields, which in real data would be very challenging to generate or would involve extensive time consuming manual labelling. Another advantage of this approach is that often we can easily scale up our synthetic datasets by randomising parameters of the modelled 3D geometry. There can be challenges to make the data realistic and representative of the real data. 

The Problem 

The aim was to develop an add-on that would help researchers and medical imaging experts determine which range of parameter values make realistic synthetic images. Prior to the project, the dataset generation involved a more laborious process of manually creating scenes in Blender with parameters changed manually for introducing variation in the datasets. A more efficient process was needed during the prototyping of synthetic dataset generation to decide what range of parameters make sense visually, and therefore in the future, to more easily extend to other use cases.

What we did 

In collaboration with the UCL Wellcome / EPSRC Centre for Interventional and Surgical Sciences (WEISS), research software engineers from ARC have developed a Blender add-on to randomise relevant parameters for the generation of datasets for polyp detection within the colon. The add-on was originally developed to render a highly diverse and (near) photo-realistic synthetic dataset of laparoscopic surgery camera views. To replicate the different camera positions used in surgery as well as the shape and appearance of the tissues, we focused on randomising three main components of the scene: camera transforms (camera orientation and location), geometry and materials. However, we allowed for more flexibility beyond these 3 main groups of parameters, implementing utilities to randomise other user-defined properties. The software also allows the following features: 1) setting the minimum and maximum bounds through an input file, 2) setting a randomisation seed for reproducibility, 3) exporting output parameters for a chosen number of frames to an output file. The add-on includes testing through Pytest, documentation for users and developers, example input and output files and a sample Blender scene.

The outcomes 

Version 1.0.0 of the Blender Randomiser is available under a BSD 3-Clause License. The GitHub repo is public where the software can be downloaded and installed with instructions provided on how to use the add-on. Examples of what can be produced in Blender can be found at the UCL Research Data Repository (N.B. these examples were produced manually prior to completion of this project).

Developer notes are also available to allow contributions. 

 

Sofia Minano and Ruaridh Gollifer

k-Plan now available to researchers!

By Sam Cunliffe, on 11 December 2023

One of ARC’s longest-running collaborations is with the Biomedical Ultrasound Group. Over the past three years, we’ve been developing a graphical user interface to simulate ultrasound treatment plans!

The k-Plan Logo

This software is called k-Plan, and licences are now available for sale through UCL’s commercial partner, BrainBox (who also sell ultrasound transducers).

Screenshot of the k-Plan GUI

If you’re interested in medical ultrasound, and think this software might help you: you can read the full UCL press release, or you can see some more snapshots of k-Plan in action.

The people behind the work…

Our collaboration is managed and led by Bradley Treeby. As well as me, there’s a full roster of research software engineers who’ve worked hard at various times over the last three years to make this happen:

  • Panayiotis Georgiou, ex-UCL now ARM.
  • Timothy Spain, ex-UCL now NERSC, 🇳🇴.
  • Ilektra Christidi, ARC, UCL.
  • Alessandro Felder, ARC, UCL.
  • Orod Razeghi, ex-UCL now University of Cambridge.
  • Idil Ozdemir, ARC, UCL.
  • Connor Aird, ARC, UCL.

We also have collaborators from the Brno University of Technology who work behind the scenes on the middleware and back-end of k-Plan and run the planning simulations in the cloud.

Using continuous integration efficiently

By Matt Graham, on 3 November 2023

Use of continuous integration (CI) is an important part of creating robust and reproducible research software, however running automated jobs on services such as GitHub Actions and GitLab CI/CD comes with an associated energy, and so environmental, cost.

As an example, in the 30 days from 2nd October to 1st November 2023, the UCL/TLOmodel repository ran GitHub Actions workflow jobs which used 1937 hours of runner time. As a (very rough) back-of-the-envelope estimate, assuming an average runner power consumption of 12W[1] this equates to a total monthly CI energy usage for this project of 23kWh, which is about 10% of a typical UK’s households monthly electricity usage or the monthly energy usage of around 8 UCL employee’s laptops[2].

Below are some simple ways to reduce GitHub Actions and GitLab CI/CD runner usage without compromising the gains that automated testing and deployment brings. These approaches also have the side benefit for private GitHub repositories of reducing usage of the free Actions minutes quota.

  1. Automatically cancel redundant jobs
    For jobs triggered by pushing commits to a pull or merge request branch, we may push new commits and trigger new job runs while previous runs are still in progress. Typically we will only care about the job results on the latest commit, and so the previous job runs will be redundant. The concurrency.group and concurrency.cancel-in-progress properties in GitHub Actions workflows can be used to automatically cancel already in progress job runs at different grouping levels – for example for workflows triggered for the same pull request, branch or tag. In GitLab, jobs which have the interruptible property set to true will be automatically cancelled when the Auto-cancel redundant pipelines project setting is enabled. Both GitHub and GitLab also allow manually skipping job runs when pushing new commits by including a token such as [skip ci] or [ci skip] in the commit message.
  2. Use filters and rules to only run jobs when needed
    Commonly some checks are only relevant to a subset of the files in a repository. For example if a branch only involves changes to a Markdown README file we probably do not need to run a test suite for the source code in a repository. In GitHub Actions we can use the optional paths and paths-ignore properties of the push and pull-request triggers to only run workflow jobs when the files changed do or do not match one or more patterns. Similarly in GitLab we can use rules to set conditions on a when a job runs, with specifically the rules:changes property allowing limiting runs to when files matching specified patterns are changed. It may also make sense to have jobs only run when pull requests are no longer drafts and are marked as ready to review. This can be achieved in GitHub Actions using a combination of the on.pull_requests.types property and using an if condition on the job that checks github.event.pull_request.draft is false.
  3. Set the scheduled job frequency appropriately
    Both GitHub Actions and GitLab CI/CD pipelines allow running jobs on a schedule using crontab syntax. While we sometimes refer to nightly jobs, it is worth considering carefully what the appropriate frequency is for such scheduled jobs based on the use case and wider context of the project, and whether running jobs at for example a weekly frequency might be sufficient. For example running daily jobs to check whether tests for a package pass against the latest compatible versions of upstream dependencies may make sense for a a package with a large userbase and development team where breaking changes in upstream packages are both likely to be encountered in practice quickly, and where there is likely to be sufficient developer capacity to resolve the issues in a similar timeframe. On the other hand for a package with fewer users and contributors, having a test job identify such breaking changes at a weekly frequency may be sufficient.
  4. Cache job dependencies
    A common step in CI jobs is setting up dependencies on the runner, often using a language specific package manager like npm or pip. Rather than redownloading and building dependencies afresh on each run, we can use caching features to reuse the dependencies built in previous jobs (providing the new run uses the same versions of dependencies). The GitHub cache action provides a generic approach for performing caching in GitHub Actions jobs, while language specific setup actions such as setup-python have built in support for setting up dependency caching. GitLab similarly provides support for caching dependencies.
  5. Set timeouts to avoid misbehaving jobs running for long times
    The default timeout for GitHub Actions jobs is 6 hours. If you expect a job to run in much less time than that, setting a more conservative timeout can avoid misbehaving jobs which hang or run very slowly, from inadvertently using a large amount of runner time. The default timeout on GitLab is shorter at 60 minutes but can similarly be adjusted.
  6. Chain jobs intelligently and fail fast
    Commonly we will run an assortment of test and checks as part of CI workflows with a corresponding wide range of run times. Code formatting and linting checks will typically be the quickest to run, while test suites may contain both quicker unit tests as well as integration and system tests which can take longer to run. Ensuring faster running checks and tests run first, and halting further jobs from running if these fail, will avoid wasting compute time running slow tests unnecessarily before changes to fix the faster running checks are made. Taking this one step further, for very quick checks such as linting and formatting, frameworks like pre-commit can be used to ensure checks are run automatically when committing changes, reducing the chance of CI jobs failing and needing to be rerun in the first place. In GitHub Actions the jobs.<job_id>.needs property can be used to specify that other jobs must successfully complete before another job runs. When running a matrix strategy job which creates one job run for each of a set of configurations (corresponding to for example different versions, operating systems or groups of tests) the jobs.<job_id>.strategy.fail-fast property can be set to true to cancel any other still in progress job runs in the matrix if a single run fails. GitLab pipeline jobs also have a needs property for specifying dependencies.

  1. Estimating the power usage of a CI job runner is complicated as typically jobs will be run on virtual machines (VMs) on cloud-hosted servers, with generally there being multiple VMs running in parallel on each server and potentially multiple job runners per VM. The analysis in Aldossary and Djemame (2019) suggests a VM with 4 virtual central processing units (CPUs) running on a host with eight-core Intel Xeon E3-1230 V2 CPU and 16GB of DDR3 RAM when running with an active load with 80% CPU utilization can be attributed a mean power consumption of around 40W (see Figure 7). If we assume four job runners per VM (one per virtual CPU) this corresponds to around 10W per runner. To account for the additional power overhead for data centre infrastructure such as cooling, we assume a power usage effectiveness of 1.2 (based on figures provided by Microsoft for Azure), giving an average overall power draw of 12W per runner. ↩︎

  2. This is based on an assumption of an average power draw of 20W (under a mix of idling and working at load) for a Dell Latitude 5410 laptop with a 36.5 hour working week and four weeks in a month giving an estimate of 2.92kWh used per laptop per month. ↩︎

Simulating light propagation through matter.

By Sam Cunliffe, on 31 October 2023

Observing how light interacts with materials allows us to develop non-invasive medical imaging techniques, that rely on these interactions to assemble an image or infer an appropriate diagnosis.

Light interacts with materials in many different ways. One of the most commonly observed interactions is dispersion; which causes white light to split into individual colours, creating phenomena like rainbows (light from the sun dispersing through raindrops). Another commonly observed interaction is refraction; which causes light to change direction as it passes between two materials, responsible for straight objects like straws appearing to be disjointed when placed into water. To completely describe what is going on in these interactions, we have to use a system of equations known as Maxwell’s equations. We also have to consider some additional parameters that describe the particular material(s) that the light is interacting with. In their most general form, Maxwell’s equations are very complex but have the advantage that almost all materials and interactions can be modelled by them. Solving these equations is, in general, impossible to do with pen and paper, so we need software to do this for us.

Software like this has a wide variety of applications in biomedical optics; notably optical coherence tomography (non-invasive medical imaging of the eye), multiphoton microscopy, and wavefront shaping. For example; we can use this software to model light propagating in the retina: simulating a retina scan. Then we can perform a retina scan for a patient in real life, and use our simulation to better understand the scan. Retinal scans often hint at a particular change to the retina, without being definitive, in the early stages of disease. We can use our simulation to test what types of changes to a retina can lead to observed signatures in an image and therefore help in achieving a diagnosis.

The Problem

In collaboration with the UCL Medical Physics and Biomedical Engineering department, developers from ARC have worked to open up a legacy C and MATLAB library which simulates light propagating through matter. This software was initially developed as part of a PhD thesis approximately 20 years ago and has been continuously developed since then. However, the need to rapidly answer research questions led to the code becoming less sustainable and harder for others to use. Whilst the core functionality was already there; the library needed updating to a more modern language and aligning with the FAIR4SW principles.

What we did

The aim of the project was to be able to provide users with a program that they can give custom input which describes the material they want to simulate, pass this to the software and receive an output they can use in further analysis. We wanted users not to have to worry about the internal workings of the software; only having to download the library code, build and install it once, and be ready for future analyses. We used modern build tools to standardise the build and install of the software, we aimed to make our instructions as straightforward and operating-system-independent as possible. We also set up automated testing of the software and wrote example scripts that users can modify to easily create input files in the correct format.

The outcomes

Version 1.0.1 of the Time Domain Maxwell Solver (TDMS), is now available under a GPL-3.0 license. You can download from GitHub, and install and run on all operating systems. The project has a public-facing website and a growing collection of examples. We also have developer documentation so anyone can contribute in the future.

TDMS 1.0.1 now has a number of new features, including the option to switch between different solver methods (how the simulation is performed), select custom regions over which to compute (to save wasting computation time), and the ability select different techniques for extracting output information through interpolation.

The ARC software engineers were a joy to work with. They brought knowledge of modern software engineering practice and quickly understood the code, and the underlying physics, as required to very effectively re-engineer the code. This collaboration with ARC will hopefully allow for a new range of users to access TDMS and significantly increase its impact.

Will Graham and Sam Cunliffe

A world map for website health checks with Grafana

By t.band, on 30 June 2021

UCL’s Geochronology software “IsoplotR” is available as an R package (just the calculations or with a web GUI) or online in a number of different locations. We already have Prometheus installed on one of our machines collecting metrics from various pieces of software we have running, and we have Grafana to display nice dashboards of these metrics. But could we add a little world map with little green or red dots for whether the various IsoplotR installations are up or down?

IsoplotR does not export Prometheus metrics, but polling the /version.txt HTTP endpoint on IsoplotR’s web interface should be enough to tell if it is running. So it looks like we need some sort of exporter to translate from  “can I hit /version.txt” to something Prometheus is happy with. There is of course a standard, official way to do this, but it is not obvious from the documentation. The answer is the Blackbox Exporter.

How do we use the Blackbox Exporter? Once again, the documentation is not clear. Do we configure Blackbox? Or do we configure Prometheus? Or both? My guess that configuring Prometheus would work nicely was borne out; just cargo-culting the example configuration and changing the list of endpoints to poll was enough.

I decided to run Blackbox from a Docker image, but the documentation here says to run:

docker run --rm -d -p 9115:9115 --name blackbox_exporter \
    -v `pwd`:/config prom/blackbox-exporter:master \
    --config.file=/config/blackbox.yml

But this just results in errors in the log, because we haven’t made a blackbox.yml file (and don’t want to). So instead we run:

docker run --rm -d -p 9115:9115 --name blackbox_exporter \
    prom/blackbox-exporter:master

That’s better!

curl "http://127.0.0.1:9115/probe?module=http_2xx"\
"&target=http://chinageology.org:8080/version.txt"

Produces some nice Prometheus output.

So having added the configuration to Prometheus and restarted it, we can check that it works from the Prometheus web front end. So I need a PromQL query. Eventually I figured out that the Prometheus output we got in the last stage was going to be a huge clue, so I tried this:

probe_success

and got a list of websites with their up/down status!

Now on to Grafana. We will need to add a world map plugin. I installed Grafana as a snap in Ubuntu Server, so the commands needed to install this plugin were different from the documentation:

sudo grafana.grafana-cli plugins install grafana-worldmap-panel

snap restart grafana

Now, looking at my Grafana browser tab I was expecting to see this Worldmap panel to be available. But it wasn’t. I could see it in the plugins list, but the new panel type wasn’t available to choose. It took me way too long to hit F5 and see if a refreshed page would show it to me. It did.

Now the tricky bit. We know I want “probe_success” as the PromQL query, but the Worldmap panel has confusing options. Should I be using Table or Time Series output? Table didn’t work for me, producing a Data error “TypeError: this.datapoints is undefined”. So, Time Series it is. Worldmap has options for how to get the locations: “country”, “state”, “geohash”… none of these are right; I don’t have anything in the query or values returned that mentions the location of the website polled. We probably have to go with “json endpoint” which allows us to provide a mapping from … something … to a location. But map from what?

Eventually this is what I figured out: the Worldmap panel looks at the legend of the data and tries to turn that into a location. If you choose “json endpoint” you can supply your own mapping. So, I set the query legend to:

{{instance}}

Then I added a json file in /var/www/html where an nginx installation on the same machine would see it and serve it up. The file looks a bit like this:

[
 {
  "key": "http://isoplotr.es.ucl.ac.uk/version.txt",
  "latitude": 51.525304,
  "longitude": -0.133806,
  "name": "London"
 },
 {
  "key": "http://isoplotr.geo.utexas.edu/version.txt",
  "latitude": 30.285623,
  "longitude": -97.736181,
  "name": "Austin"
 },
 ...
]

And I checked I could load this file in a browser from another machine (this might be overkill). Then on the Worldmap panel setting I could set “Location Data” to “json endpoint” and “Endpoint url” to the URL that serves up the JSON. And it worked! A few tweaks to make it look nice (like setting a single threshold to 1 and the colors to red and green) and we have our map!

So, not the most elegant or robust solution (I’ll have to maintain that isopotr-locations.json file by hand instead of being able to have the locations in with the Prometheus configuration) but it looks nice.