By Bruno C N Silva, on 21 May 2013
There is a fundamental dissonance between how we do HPC today and how Cloud services are presented to their users.
In HPC users are directly exposed to the finiteness of resources by having to work with a scheduler, which more often than not wraps around their application – the computation happens implicitly and in a deferred manner, in a specific software stack – some work has to be done to port an application to a platform which in theory should be the same – linux and gnu running a scheduler and numerical libraries. Invariably the user has to wait for a near-non-deterministic amount of time to obtain their results.
With Cloud services, users are given resources nearly immediately and they have a very high degree of flexibility in terms of creating their own infrastructure to suit the needs of a particular workflow (or at least that is the promise), where infrastructure becomes software. There are performance bottlenecks for strong scaling applications mainly, but for other work loads networking is not much of an issue, with the exception of intense I/O on each machine (possibly). Researchers can prototype infrastructure at a fairly low cost. However the issues of financial and technological lock-in that were discussed earlier in the thread may kick in.
What we are finding is that the constraints to run workloads on the cloud are stemming from the following issues (apologies for poor wording beforehand):
1) cost (data transfer out of the public cloud seems to be the problem still)
2) institutional and research council accounting
3) software licensing (think Gaussian or Matlab)
4) data protection and privacy
5) put your favourite regulatory constraint here…
At UCL we had a number of conversations with researchers who use the cloud regularly and there is an interesting use case for institutional Research Computing resources, which is to schedule a “platform/application/database as a service” session at a given time and date via an advanced reservation to allow for unconstrained prototyping. This gives researchers some level of subsidy (or not) by the institution and data protection. We are therefore starting an R&D activity to explore this and other uses for “cloud”, such as an actual private compute cloud service.
We are also looking at expanding our service first to cycle scavenging using VMs on UCL desktops, and possibly bursting to the cloud when that resource becomes contended. This is of course for pleasantly parallel work loads only – strong scaling work loads which depend on interconnect (with a few exceptions) will remain at the institution (or consortium), until there is a comparably good commercial proposition.
By Owain A Kenway, on 16 May 2013
In our efforts to make Legion more user-friendly, we have implemented a set of wrapper scripts we call “gerun” (Grid Engine Run) which you can invoke instead of “mpirun” to launch MPI jobs.
The advantage of gerun is that you don’t need to pass machine files, numbers of processors or RSH replacement programs to it – this is all worked out for you from your job.
This means that (for example) that whereas before for different MPI implementations (QLogic, OpenMPI, Intel MPI etc.) you had to pass a bunch of different options in your job scripts depending on the MPI used (so if your program was called $HOME/src/madscience/mad:
mpirun -m $TMPDIR/machines -np $NSLOTS $HOME/src/madscience/mad
mpirun -machinefile $TMPDIR/machines -np $NSLOTS $HOME/src/madscience/mad
mpirun --rsh=ssh -machinefile $TMPDIR/machines -np $NSLOTS $HOME/src/madscience/mad
depending on the MPI you used), you can now use the following, regardless of which MPI implementation you are using (as long as you have the right modules loaded):
If we add new MPIs to the system, we will also add support for them in gerun so you won’t need to learn new options.
We are updating our documentation on the web-site to use this method for launching jobs.
You can still use mpirun to launch applications if you prefer.
By Owain A Kenway, on 16 May 2013
Over the past year or so we have been working hard to make the Condor High Throughput service more attractive to users, with the aim that many of the “high throughput” (i.e. composed of many serial jobs) work-loads on Legion could be moved to this service which is relatively uncontended.
With this aim in mind, we now have a Research Computing managed software area, which is visible to all Condor jobs (and Condor jobs only) as drive K:\. The software currently deployed in this area includes:
1) Enthought Python EPD 7.2-2 – This is a python distribution which comes with many of the useful modules pre-installed (e.g. Numpy/Scipy). We have the deployed the same version as is available on Legion, so if your python scripts run with the version of EPD available from modules on Legion, then they should run on Condor!
2) MinGW + MSYS – This is a distribution of GNU compilers and some shell tools intended for building applications from source as native Windows binaries.
3) Cygwin – A fairly large install of Cygwin tools, including GNU compilers, various shell and archiving utilities, useful for porting/running ported UNIX applications on Windows, running shell scripts and so on.
We are currently working on having an R install, with the same packages as are available on Legion where possible which will be available later this year. The availability of MinGW and Cygwin mean that it should no longer be necessary to have access to a 32 bit Windows machine to compile your code for Condor as the build process can be scripted and submitted as a job – if you have difficulty doing this, we are keen to help.
The Condor service consists of all the Windows Myriad desktops dotted about UCL and at last count was composed of around 2,200 cores which sit idle when they are not being used for desktop tasks.
Service Overview here:
User Guide here:
Application for Condor accounts for people with an existing UCL account is light-weight – simply drop an e-mail to rc-support[at]ucl.ac.uk saying you are interested and we will add you to the relevant UNIX group.