X Close

Open@UCL Blog

Home

Menu

Archive for the 'Case Study' Category

Getting a Handle on Third-Party Datasets: Researcher Needs and Challenges

By Rafael, on 16 February 2024

Guest post by Michelle Harricharan, Senior Research Data Steward, in celebration of International Love Data Week 2024.

ARC Data Stewards have completed the first phase of work on the third-party datasets project, aiming to help researchers better access and manage data provided to UCL by external organisations.

alt=""

The problem:

Modern research often requires access to large volumes of data generated outside of universities. These datasets, provided to UCL by third parties, are typically generated during routine service delivery or other activities and are used in research to identify patterns and make predictions. UCL research and teaching increasingly rely on access to these datasets to achieve their objectives, ranging from NHS data to large-scale commercial datasets such as those provided by ‘X’ (formerly known as Twitter).

Currently, there is no centrally supported process for research groups seeking to access third-party datasets. Researchers sometimes use departmental procedures to acquire personal or university-wide licenses for third-party datasets. They then transfer, store, document, extract, and undertake actions to minimize information risk before using the data for various analyses. The process to obtain third-party data involves significant overhead, including contracts, compliance (IG), and finance. Delays in acquiring access to data can be a significant barrier to research. Some UCL research teams also provide additional support services such as sharing, managing access to, licensing, and redistributing specialist third-party datasets for other research teams. These teams increasingly take on governance and training responsibilities for these specialist datasets. Concurrently, the e-resources team in the library negotiates access to third-party datasets for UCL staff and students following established library procedures.

It has long been recognized that UCL’s processes for acquiring and managing third-party data are uncoordinated and inefficient, leading to inadvertent duplication, unnecessary expense, and underutilisation of datasets that could support transformative research across multiple projects or research groups. This was recognised in the “Data First, 2019 UCL Research Data Strategy”.

What we did:

Last year, the ARC Data Stewards team reached out to UCL professional services staff and researchers to understand the processes and challenges they faced regarding accessing and using third-party research datasets. We hoped that insights from these conversations could be used to develop more streamlined support and services for researchers and make it easier for them to find and use data already provided to UCL by third parties (where this is within licensing conditions).

During this phase of work, we spoke with 14 members of staff:

  • 7 research teams that manage third-party datasets
  • 7 members of professional services that support or may support the process, including contracts, data protection, legal, Information Services Division (databases), information security, research ethics and integrity, and the library.

What we’ve learned:

An important aspect of this work involved capturing the existing processes researchers use when accessing, managing, storing, sharing, and deleting third-party research data at UCL. This enabled us to understand the range of processes involved in handling this type of data and identify the various stakeholders involved—or who potentially need to be involved. In practice, we found that researchers follow similar processes to access and manage third-party research data, depending on the security of the dataset. However, as there is no central, agreed procedure to support the management of third-party datasets in the organization, different parts of the process may be implemented differently by different teams using the methods and resources available to them. We turned the challenges researchers identified in accessing and managing this type of data into requirements for a suite of services to support the delivery and management of third-party datasets at UCL.

Next steps:

 We have been working on addressing some of the common challenges researchers identified. Researchers noted that getting contracts agreed and signed off takes too long, so we reached out to the RIS Contract Services Team, who are actively working to build additional capacity into the service as part of a wider transformation programme.

Also, information about accessing and managing third-party datasets is fragmented, and researchers often don’t know where to go for help, particularly for governance and technical advice. To counter this, we are bringing relevant professional services together to agree on a process for supporting access to third-party datasets.

Finally, respondents noted that there is too much duplication of data. The costs for data are high, and it’s not easy to know what’s already available internally to reuse. In response, we are building a searchable catalogue of third-party datasets already licensed to UCL researchers and available for others to request access to reuse.

Our progress will be reported to the Research Data Working Group, which acts as a central point of contact and a forum for discussion on aspects of research data support at UCL. The group advocates for continual improvement of research data governance.

If you would like to know more about any of these strands of work, please do not hesitate to reach out (email: researchdata-support@ucl.ac.uk). We are keen to work with researchers and other professional services to solve these shared challenges and accelerate research and collaboration using third-party datasets.

Get involved!

alt=""The UCL Office for Open Science and Scholarship invites you to contribute to the open science and scholarship movement. Stay connected for updates, events, and opportunities. Follow us on X, formerly Twitter, and join our mailing list to be part of the conversation!

The benefits and barriers to code sharing as an Early Career Researcher

By Kirsty, on 14 September 2021

Guest post by Louise Mc Grath-Lone, Research Fellow (UCL Institute of Health Informatics), Rachel Pearson, Research Assistant (UCL Institute of Child Health) and Ania Zylbersztejn, Research Fellow (UCL Institute of Child Health)

In July 2021, we held a session on code sharing as part of the UCL Festival of Code and were thrilled to have almost 90 attendees from 9 out of UCL’s 11 faculties – highlighting that researchers from across a wide range of disciplines are interested in sharing their code.

The aims of the session were to highlight the benefits of code sharing, to explore some of the barriers to code sharing that Early Carly Researchers may experience, and to offer some practical advice about establishing, maintaining and contributing to a code repository.

In this blog, we summarise the benefits and barriers to code sharing we discussed in the session taking into account the views that participants shared.

What is code sharing and what are the benefits?

Code sharing covers a range of activities, including sharing code privately (e.g., with your colleagues as part of internal code review) or publicly (e.g., as part of a journal article submission).

For Early Career Researchers in academia, there are many benefits to sharing code including:

Reducing duplication of effort: For activities such as data cleaning and preparation, code sharing is an important method of reducing duplication of effort among the research community.

Capturing the work you put into data management: The processes of managing large datasets are time-consuming, but this effort is often not apparent in traditional research outputs (such as journal articles). Sharing code is one way of demonstrating the work that goes into data management activities.

Improving the transparency and reproducibility of your work: Code sharing allows others to understand, validate and extend what you did in your research.

Enabling the continuity of your work: Many researchers spend the early years of their career on fixed-term contracts. Code sharing is a way to enable the continuity of your work after you’ve moved on by allowing others to build on it. This increases the chances of it reaching the publication stage and your efforts and inputs being recognised in the form of a journal article.

Building your reputation and networks: Code sharing is a way to build your reputation and grow your networks which can lead to opportunities for collaboration.

Providing opportunities for teaching and learning: By sharing code and by looking at code that others have shared, Early Career Researchers have opportunities to both teach and learn.

Demonstrating a commitment to Open Science principles: Code sharing is increasingly valued by research funders (e.g. the Wellcome Trust) and is a tangible way to show your commitment to Open Science principles which are part of UCL’s Academic Framework and important for career progression.

Despite the clear benefits to code sharing, at the start of our session just 1 in 4 participants (26%) said that they often or always share code. However, by the end of the session, almost all participants (90%) said that they definitely or probably will share their code in the future.

What are the barriers to code sharing as an Early Career Researcher and how we can overcome them?

We asked participants what has put them off sharing their code in the past. The most common responses were:

The time and effort required: Ideally, you would write perfectly formatted and commented code on the first go – however, in reality, it often does not work out like this. As you update code and encounter bugs, code can often become messy and considerable time/effort needed to get it to point it can be understood by someone outside the research project. We discussed the importance shifting your perception of ‘shareable’ code. Sharing any code, even if messy, is far more helpful than sharing nothing at all.

Lack of confidence and concerns about criticism: Many researchers who write code as part of their work have very little (or no!) formal training. This means that sharing code can be daunting. For example, researchers may be worried about others finding errors in their code; however, sharing code can help to catch bugs in code early on and can bolster your confidence and reassure you that your code is correct. In the session, we also discussed how getting involved with online coding communities that emphasize inclusivity and support (e.g., R Ladies, Tidy Tuesday or one of the UCL Coding Clubs) can help grow confidence and provide a kinder environment in which to share code publicly.

Not knowing how to share or who to share with: A lack of formal training means that many researchers are unsure about where or how to share code, including not knowing which license to use to enable appropriate reuse of code. We discussed the need for more training opportunities, encouraged setting up your own code review groups (like a journal club, but for sharing and discussing code).

Worry that code will be reused without permission: Some participants were worried about plagiarism and their hard work being re-used without their knowledge or permission. However, hosting your code in a repository like GitHub allows you to choose suitable licence for re-use of your code to prevent undesired use while still supporting open science! You can also see how many people have accessed your code.

How can Early Career Researchers get started with code sharing?

Preparing code to share can take time and, as they work to secure their future within academia, many Early Career Researchers may already feel overloaded and pulled in different directions (e.g., teaching, institutional citizenship, engagement work, producing publications, attending conferences, research management, etc.). However, code sharing is hugely beneficial for a career in academia and so we would encourage all Early Career Researchers to try to find the time to share code by viewing it as an opportunity to invest in your future self. For example, you could:

  • Adopt a coding style guide to help produce clear and uniform code with good comments from the outset. This will reduce effort end when you come to share code (and help your future self when you look at your code many years later and have inevitably forgotten what it all does!
  • Join a UCL Coding Clubs or online community to learn tips from others about coding and sharing code.
  • Learn to use a code repository like GitHub. As part of our session, we delivered an introductory tutorial on how to use GitHub with links to other useful resources (available here).

How can UCL support Early Career Researchers to share code?

We ended the session by asking the participants how UCL could better support them to share their code. Some of the ideas suggested by Early Career Researchers were:

More training on writing and sharing code: For example, one suggestion was that UCL could create a Moodle training course for code sharing. Training about best practice in coding (across several languages) to help Early Career Researchers to write code right the first time would also be helpful.

Simple, accessible guidance about code sharing: This might include checklists or 1-to-1 advice sessions, in particular, to help Early Career Researchers to select the right licenses.

Embed code sharing as best practice at all levels: Encouraging and supporting senior researchers to share code so that it becomes embedded as good practice at all levels would provide a good example for and encourage more junior members of staff. It would also help to ensure that the time and training required to prepare code for sharing is built into grant applications.

Knowledge sharing opportunities: More events and opportunities to discuss how research groups share code to share best practice across faculties throughout UCL.

 

We would like to thank everyone who attended our session – “Code sharing for Early Career Researchers: the good the bad and the ugly!” – at the UCL Festival of Code for their time and contributions to the lively discussions. All the materials from the session are available here, including an introductory tutorial to getting started with code sharing using GitHub. We would also like to thank the organisers of the UCL Festival of Code for their help and support.

Open Access Week: the first ReproHack ♻ @ UCL

By Kirsty, on 17 November 2020

The Research Software Development Group hosted the first ReproHack at UCL as part of the Open Access Week events run this year by the Office for Open Science and Scholarship. This was not only the first event of this type at UCL, but the first time a reprohack ran for a full week.

What’s a Reprohack?

A ReproHack is a hands-on reproducibility hackathon where participants attempt to reproduce the results of a research paper from published code and data and share their experiences with the group and the papers authors.

As it normally happens on hackathons, this is also a learning experience! During a Reprohack the participants, besides contributing to measure the reproducibility of published papers, also learn how to implement better reproducibility practices into their research and appreciate the high value of sharing code for Open Science.

An important aspect of the Reprohacks is that the authors themselves are the ones who put forward their papers to be tested. If you’ve published a paper and provided code and data with it, you can submit your papers for future editions of Reprohack! Your paper may be chosen by future reprohackers and provide you with feedback about the reproducibility of your paper! The feedback form is well designed so you get a complete overview of what went well and what could be improved.

Reprohacks are open to all domains! Any programming language used in the papers are accepted, however papers with code using open source programming languages are more likely to be to chosen by the participants as it may be easier to install it on their computers.

In this particular edition, the UCL Research Software Development Group was available throughout the week to provide support to the reprohackers.

What did I miss?

This was the first Reprohack at UCL! You missed all the excitement that first-time events bring with them! But do not worry, there will be more Reprohacks!

This event was particularly challenging with the same difficulties we have been fighting for the last nine months trying to run events online, but we had gain some experience already with other workshops and training sessions we run so everything went smoothly!

The event started with a brief introduction of what the event was going to be like, an ice-breaker to get the participants talking and a wonderful keynote by Daniela Ballari entitled “Why computational reproducibility is important?”. Daniela provided a great introduction to the event (did you know that only a ~20% of the published literature has only the “potential” of being computationally reproduced? and that most of it can’t be because either the software is not free, the data provided is incomplete, or it misses which version of the software was used? [Culina, 2020]), linking to resources like The Turing Way and providing five selfish reasons to work on reproducibility. She put these reasons in context of our circles of influences like how these practices benefits the author, their team, the reviewers and the overall community. The questions and answers that followed the talk were also very insightful! Daniela is a researcher in Geoinformation and Geostatics and never trained as a software developer, so she had to learn her way to make her research reproducible and her efforts in that front were highlighted in the selfish reasons she proposed in her talk.

The rest of the event consisted on ReproHacking-hacking-hacking! We separated into groups and started to choose papers. We then disconnected from the call and each participant or team worked as they preferred over the next days to try to reproduce the paper(s) they chose. At the end of the week we reconvened together to share how far we’d got and what we learned on the way.

In total we reviewed four papers, only one participant managed to reproduce the whole paper, the rest (me included) were stuck on the process. We found that full reproducibility is not easy! If the version of a software is not mentioned, then it becomes very difficult to find why something is not working as it should. But we also had a lot of fun and the participants were happy that there is a community at UCL that fights for reproducibility!

This ReproHack also counted with Peter Schmidt interviewing various participants for Code for thought, a podcast that will be published soon! Right now he’s the person running RSE Stories on this side of the Atlantic, a podcast hosted by Vanessa Sochat.

What’s next?

We will run this again! When? Not sure. We would like to run it twice a year, maybe again during the Open Access week and another session sometime between March-April. Are you interested in helping to organise it? Give me a shout! We can make a ReproHack that fits better for our needs (and our researchers!)

Thanks

Million thanks to Daniela Ballari, her talk was very illustrative and helpful to set the goals of the event!

Million thanks to Anna Krystalli too, a fellow Research Software Engineer at the University of Sheffield as she was the creator of this event and provided a lot of help to get us ready! She’s a Software Sustainability Institute Fellow and the SSI gave the initial push for this to exist. We also want to thank the RSE group at Sheffield as we were using some of their resources to run the event!

I also want to thank the organisers of ReproHack in LatinR (thanks Florencia!) as their event was just weeks before ours and seeing how they organised was super helpful!

Case study: Disseminating early research findings to influence decision-makers

By Nazlin Bhimani, on 6 November 2020

A classroom in Uganda

Photograph by Dr Simone Datzberger

Recently a researcher asked for our advice on the best way to disseminate her preliminary findings from a cross-disciplinary research project on COVID-19. She wanted to ensure policy makers in East Africa had immediate access to the findings so that they could make informed decisions. The researcher was aware that traditional models of publishing were not appropriate, not simply because of the length of time it generally takes for an article to be peer-reviewed and published, but because the findings would, most likely, be inaccessible to her intended audience in a subscription-based journal.

The Research Support and Open Access team advised the researcher to take a two-pronged approach which would require her to: (1) upload the working paper with the preliminary findings in a subject-specific open-access preprint service; and (2) to publicise the research findings in an online platform that is both credible and open access. We suggested she use SocArXiv and publish a summary of her findings in The Conversation Africa, which has a special section on COVID-19. The Conversation has several country-specific editions for Australia, Canada English, Canada French, France, Global Perspectives, Indonesia, New Zealand, Spain, United Kingdom and the United States, and is a useful vehicle to get academic research read by decision makers and the members of the public. We also suggested that the researcher publicise the research on the IOE London Blog.

What are ‘working papers’ & ‘preprint services’?

UCL’s Institute of Education has a long-standing tradition of publishing working papers to signal work-in-progress, share initial findings, and elicit feedback from other researchers working in the same area. The preprint service used thus far at the IOE is RePEc (Research Papers in Economics), which includes papers in education and the related social sciences). RePEc is indexed by the database publisher EBSCO (in EconLit) and by Google Scholar and Microsoft Academic Search. Commercial platforms such as ResearchGate also trawl through RePEc and index content. Until it was purchased by Elsevier in May 2016, the Social Science Research Network or SSRN was the other popular preprint repository used by IOE researchers although its content is indexed mainly for its conference proceedings. The sale of SSRN to Elsevier resulted in a fallout between authors and the publisher, and this resulted in SocArXiv entering the scene. SocArXiv is an open access, open source preprint server for the social sciences which accepts text files, data and code. It is the brainchild of the non-profit Centre for Open Science (COS) whose mission is to increase openness, integrity and reproducibility of research – values that are shared by UCL and are promoted on this blog and by the newly formed Office of Open Science and Scholarship (for more information see also the Pillars of Open Science). In the spirit of openness, most papers on SocArXiv use the creative commons license CC-By Attribution-NonCommercial-NonDerivatives 4.0 International, which safeguards the rights of the author. As papers on SocArXiv are automatically assigned Digital Object Identifiers (DOI), they discoverable on the web, particularly as Scholar indexes SocArXiv content.

What are the benefits of using preprint servers?

Whilst research repositories such as UCL Discovery are curtailed by publisher policies on what research can be made open access, this is not always the case for papers submitted on preprint subject repositories. Without wanting to repeat what my colleague Patrycja Barczymska has already written in her post on preprints I can confirm that in addition to signalling the research findings and eliciting feedback, other benefits to depositing in preprint servers include the enhanced discoverability as most will automatically generate DOIs at the time a paper is uploaded, the possibility of obtaining early citations and the alternative metrics that indicate interest (e.g. the number of downloads, mentions, etc.)  that services such as SocArXiv provide. Researchers can also list open-access working papers in funding applications.

Does uploading a working paper in a pre-print server hinder the publication of the final paper?

Researchers are concerned, and rightly so, that publishers may not publish their final research output if preliminary findings are deposited in preprint servers as working papers. However, more often than not, working papers are exactly that – work in progress. They are not the final article that gets submitted for publication.  It is also likely that the preliminary findings and conclusions in the working paper will be somewhat different from the final version of the paper. It is worth knowing that some of the key social sciences publishers, such as SageSpringer, and Taylor and Francis / Routledge and Wiley, explicitly state that they will accept content that has been deposited on a preprint server, as long as it is a non-commercial preprint service. In other words, researchers must not upload the working papers on platforms such as academia.edu and ResearchGate.

These ‘preprint-friendly’ publishers simply ask that the author informs them of the existence of a preprint and provides the DOI of the working paper at the time of submitting their article. Some ask that authors update the preprint to include the bibliographic details, including the new DOI, when their article is published, and that authors add a statement requesting readers to cite the published article rather than the preprint publication. Although a definitive list of individual journal policies does not exist, submission guidelines generally clarify issues related to preprints. Researchers may want to use the Sherpa Romeo service (and Sherpa Juliet for key funder policies) to obtain additional information.

More than a success story

The above case demonstrates how preliminary research findings can be shared expeditiously and in an open environment to aid the decision-making process.  It also demonstrates that open-access subject-specific preprint services can be beneficial to promoting both the research and the researcher, and that there is now wider acceptance among publishers that the traditional models of publishing are not always viable. This is especially true where cutting-edge research is required as in the case of research on COVID-19.