Tracking the spread of science with machine learning
By Basil Mahfouz, on 18 November 2021
On 3-5 November 2021, I joined research professionals from across the Network for Advancing and Evaluating the Societal Impact of Science (AESIS) to discuss state of the art methods for evaluating the impact of research. Participants showcased institutional best practices, stakeholder engagement strategies, as well as how to leverage emerging data sources. In this blog, I reflect on the conversations initiated at the conference, drawing upon insights gained throughout my research at STEaPP.
To solve global challenges, such as the climate crisis, scientists are racing to develop and deploy technologies at an unprecedented scale. As part of the Sustainable Development Goals, governments are now seeking to “substantially increase” spending on research and development (R&D), and need tools for identifying and enabling high impact scientific breakthroughs. Emerging data-driven systems offer research organisations an unprecedented opportunity to go beyond justifying the value of their research and move towards scaling the diffusion of scientific knowledge.
Despite their widespread use, citations do not measure the impact of science across society, rather bibliometric tools are designed to track the spread of research publications within academia. To overcome the limitation, scientists write case studies, manually explaining the impact of their research. While narratives might provide a glimpse into the benefits of single research projects, the approach is laborious and difficult to scale to an institutional level or beyond. The United Kingdom’s Research Excellence Framework exercise of 2014, for instance, cost over £ 250 million, not including the expenses universities incurred developing almost 7,000 case studies.
Emerging digital technologies are unlocking alternative, data-driven tools for tracking mentions of research across the web. Many pioneering research organisations are already leveraging alternative metrics to complement conventional citation scores. The real opportunity, however, lies in applying machine learning to uncover the dynamics of how the public interacts with science, which forms part of my doctoral research at UCL STEaPP and our partnership with Elsevier’s International Centre for the Study of Scientific Research.
For instance, by augmenting alternative metrics with Named-entity Recognition, a computational process for categorising individuals and institutions, researchers can identify stakeholder groups that they may not have originally considered. Further semantic analysis can also determine the profiles of stakeholder groups, shedding light on how different types of entities or individuals engage with various elements of research. Equipped with this knowledge, research organisations can better understand the specific needs of their end-users, leading to tailored research strategies.
Temporal analysis is another, equally critical opportunity. Not all entities engage with research at the same time, so data-driven analysis of timelines can map the pathways of scientific impact. By uncovering who the early adopters of research are, and how they influence the spread of knowledge, researchers can design more effective outreach and communication campaigns. Temporal analysis could also shed light on the role and value of knowledge brokers across the knowledge chain, enabling research organisations to engage intermediaries for maximum impact.
Finally, different sectors have varying network structures and dynamics of interactions. Comparing the spread of scientific concepts across disciplines can shed light on the nuances of how knowledge spreads across industries and geographies. Mapping the underlying structures can help determine to what extent the impact of science is influenced by systemic and structural elements. With the right interventions, perhaps institutional dynamics can be evolved towards enabling more effective diffusion of science across sectors.
Despite data-driven methods providing a promising opportunity for tracking the impact of science, significant challenges remain. Unlike formal methods of citation, which follow strict referencing processes, content shared on social media and across the web do not always have clear attributions. Current tools track the online mention research via unique identifiers, which captures a small portion of all possible research mentions. At a collective scope, however, this can be mitigated by applying natural language processing methods to track the aggregate spread of emerging scientific principles.
A 2018 joint report by the World Bank, United Nations, and OECD emphasised that the rate of innovation will “to a large extent” determine the likelihood of achieving the goals of the Paris climate Agreement. It is not enough to simply increase research funding. Rather, by understanding the pathways of knowledge diffusion, players across the R&D sector can develop better systems that facilitate the spread and utilisation of new science and technology throughout society.
Bio: Basil Mahfouz is a Doctoral Candidate at UCL’s Department of Science, Technology, Engineering, and Public Policy, supervised by Professor Sir Geoff Mulgan. In partnership with Elsevier’s International Centre for the Study of Scientific Research, his PhD seeks to apply data-driven methods to track the dynamics and impact of science.