Beyond the data revolution
By regfbec, on 31 December 2017
The data explosion has become a media cliché, affecting almost all aspects of contemporary life. In the biomedical sphere, the “omics” revolutions are continuing to generate terabytes of biomolecular data, increasingly supplemented by multiple other enormous data sets, often linked to each other, which include images, and a growing set of behavioural and clinical parameters. It is a truism that computational tools are required to process and interpret this data, and as I explored in a previous blog, training individuals who can work at the interface between computer science and biomedical science is one of the major challenges in medical and scientific training. But two important problems still limit the impact of the data revolution on the progress of biomedical research and its eventual translation into improved health.
The first recurring problem, common to research centres around the world, is how best to integrate the individuals with the computational skills into the biomedical research enterprise. Computer scientists and life scientists are trained to speak in a different language, and possess quite different skill sets. It is a long and arduous process to train a mathematician or computer scientist to truly understand the nature of the problem being worked on by their biologist colleagues. And it is at least as difficult, and arguably even harder, for a life scientist to take the reverse path. In many institutions, individuals providing computational skills act as data analysts, providing a service to one or more research groups. Indeed this has led to the emergence of new specialised bioinformatics training courses. Such core facilities have proved invaluable, especially to provide low level data processing, and acting as the human interface between the data generating machine and the biologist wanting the data. Nevertheless, this model does not harness the full power of computational science to the biomedical enterprise. The interests and excitement which drives the scientists who are creating the extraordinary current explosion in computer science are usually not captured by providing a largely technical service to their life science colleagues. And individuals in such centres are often stretched, working on many simultaneous projects, and with little time and energy to integrate cutting edge computer science research into their everyday analysis.
The second problem exists at the level of the data itself. From one point-of-view, the high dimensional complex data sets being generated today can be regarded simply as extensions of the simpler oligo-molecular studies of classical biology. RNAseq, for example, certainly allows one to capture the levels of thousands of transcripts, rather than one or two at a time. But after the raw data has been converted into gene lists, analysis often involves homing in on selected genes or sets of genes of ‘interest’ and basically ignoring the rest of the data. Similar considerations apply to many other examples of ‘big data’. But without a clear conceptual framework for handling the data as a cohesive data sets, and not as a series of measurements on many individual and unrelated parameters, it is hardly surprising that the laboratory scientist is overwhelmed by the overwhelming flow of new data. All too often, the computational scientist working on a project is challenged with the question ‘what does this data mean?”. The biologist almost expects some magical interpretation which will reveal some fundamental but unknown answer to an unasked question lying buried within the data set. But computer scientists are not magicians !
I would like to propose that mathematical modelling can offer some of the answers to both the challenges outlined in the previous two paragraphs. In applied mathematics, statistics, computer sciences and indeed the physical sciences in general, model building is at the very heart of the scientific enterprise. Such models are of many types : statistical models, mechanistic models of physical processes, deterministic or stochastic models, agent-based models etc. They can be used to link theory and application. The complexity of model output often outstrips the intrinsic complexity of the model itself. Even an ordinary differential equation, by the addition of a simple time delay term, can produce output of ‘chaotic’ complexity. Nevertheless, models provide ways of describing, exploring and capturing the behaviour of real life processes in a way which no other approach can do. And they provide a focus for linking theory and application, a nexus for the combined efforts of biologist, mathematician and computer scientist.
The introduction of modelling into the life sciences is, of course, not new. Enzymology, pharmacology, genetics, structural biology : there is a long list of subjects in which models have played fundamental roles. But today we need to work on new classes of models which capture and deal with big data. Systems biology has begun to provide such models. Of course, at one level a model provides testable hypotheses and suggest new experiments which can be used to verify or falsify the model. This cycle between model building and experimentation is well accepted. But at a much deeper and more general level, models can provide the essential link between data and biological understanding : they provide a way to understand large data sets at the scale of the data itself. They provide an antidote to reductionist molecular biology, which has provided the spectacular advances of the past century, but which is struggling in the face of the current data flood. At the same time, it can provide the focal point for drawing together the physical and life scientists. Mathematicians and computer scientists alike will be excited and drawn in by the models which speak in a language they can understand, and provide challenges they are fully equipped to explore. Indeed, exploring models which capture and encompass ‘big data’ is one of the major challenges in contemporary computer science. But a good model will also inspire and motivate the experimental scientist. Models can provide the common talking point which will drive true integration of mathematics, computer science and biology.
PS These thoughts arose as I was preparing a new teaching module called “Mathematical modelling in biomedicine” which will start in January 2018.