Chapter 5: Major Multi-Stakeholder Initiatives — Defining the Future of COVID-19 Observational Research

OpenSAFELY, UK

Brief description of the OpenSAFELY initiative and data

OpenSAFELY is a secure, transparent, open-source software platform for analysis of UK EHR data. It allows the user to create a trusted research environment associated with their own database or to add a layer of privacy to an existing database which can then be opened for wider use.

Created with public and charitable funding for the benefit of population health, OpenSAFELY was the product of collaboration between the DataLab⁴¹ at the University of Oxford, the London School of Hygiene and Tropical Medicine EHR Research Group,⁴² The Phoenix Partnership (TPP) and EMIS Health EHR systems, National Health Service (NHS) England, and NHSX (a joint unit of NHS England and the Department of Health and Social Care). Development of the initiative was planned before COVID-19, but when the pandemic emerged, the initiative sped up rapidly and was quickly deployed within the secure data centers of the two largest electronic health record providers in the UK NHS—TPP and EMIS. This allowed researchers to access NHS records while minimizing the sharing of confidential patient information.

How does OpenSAFELY work?

Without access to raw data, the platform allows researchers to build a “dummy” data set (of any size, but typically of 100,000 records) that is modelled from the real data. The dummy data set contains the same structure, variable names, and variable types that would be created if the same code was submitted to run on the real data. For each variable, the researcher can set expectations for the distribution of that variable in the dummy data (e.g., 15% of records should have an asthma diagnosis). Researchers then develop their analysis code for statistical testing, graphs, tables, and dashboards against this dummy data, using open tools and services including GitHub. When both the data management code and data analysis code are capable of running to completion on the dummy data, the user can securely send the code into the live data environment to be executed against the real patient data. Outputs generated by the code are stored on the secure server until at least 2 trained output checkers review the outputs for any privacy concerns, including redacting tables or figures that have counts less than 5. The redacted tables are then released back to GitHub for the researcher to review and use in publications.⁴³

Every time a researcher subsequently changes their code, it is automatically checked by built-in algorithms to ensure it can be run without errors. This design enforces the principle that no analysis should happen without being prespecified in code. This clean separation of study design and execution makes undisclosed “p-hacking” impossible. This also means the code can be shared so that interested stakeholders can see what was done, and patients, professionals, and policymakers can confirm that this vast store of data has only been used for intended purposes.

Although some code may remain private while an analysis is in development, all code and results executed via OpenSAFELY are published, and the results of the analysis (with disclosure controls applied) are made publicly available at the time of journal submission or at 12 months after the first code was executed. In addition, code lists⁴⁴ are available open-source for inspection and reuse the moment they are created for analysis at https://www.opencodelists.org. After completion of each analysis, only minimally disclosive summary data are released (in summary tables or figures) outside the secure environment after strict disclosure checks and redactions.

Common research tasks such as data aggregation, case matching, time-based numerator/denominator pairs, low number suppression, and statistical summaries are provided as libraries (called “actions”) that are reusable in any supported language (currently Python, R, and Stata). Actions are tested and improved over time. Anyone can contribute new actions to the action library.

OpenSAFELY and COVID-19 observational research: UK big data, shared code

The OpenSAFELY platform can be used on NHS data (currently under an emergency authorization) to support research that will deliver urgent results related to the global COVID-19 emergency. That means approved researchers can execute code against the pseudonymized primary care records of over 58 million people via OpenSAFELY-TPP and OpenSAFELY-EMIS. The records are also linkable to pseudonymized person-level data sets from other data providers using a salted hash (a securely pseudonymized identification key) generated from NHS numbers.

Under certain conditions, and after review and approval of their project, individual researchers have long been able to gain access to specific NHS data sets. But the COVID-19 pandemic meant that a wide range of questions needed to be asked across a large body of data as quickly as possible. The OpenSAFELY approach allowed for analysis of patient data without moving it out of the secure environments where it already resides—or even granting access to the raw data at all.

Beyond EHR data, OpenSAFELY incorporates many external national data sets. Data are available from the Secondary Uses Service (SUS) (the main repository for health care data in the UK), Hospital Episodes Statistics (HES), the NHS Emergency Care Data Set, and death certificate information from the Office of National Statistics. The platform also has access to data collected by the Intensive Care National Audit & Research Centre⁴⁵ and the International Severe Acute Respiratory and emerging Infection Consortium.⁴⁶ COVID-19-specific information comes from the COVID-19 Patient Notification System (standardized data to underpin national death analysis), plus the Second Generation Surveillance System (SGSS) and COVID-19 Hospitalizations in England Surveillance System (CHESS).⁴⁷

Gaining appropriate research ethics approval is a prerequisite for using the OpenSAFELY platform. (Information governance for OpenSAFELY-TPP and OpenSAFELY-EMIS is handled by NHS England.) Research proposals are assessed by NHS England and the OpenSAFELY collaboration to ensure they support relevant research and planning activities in response to the COVID-19 emergency. As of September 2021, the collaborative is creating a public dashboard⁴⁸ that lists all approved projects, including purpose, contact information and affiliated organization of researchers, the date when the first code was executed, and links to published material. Participant consent is not required under regulation 3(4) of the Health Service Regulations for Control of Patient Information (2002),⁴⁹ and legal gateways involved for General Data Protection Regulations do not require consent.

OpenSAFELY case studies

Within 6 weeks of the launch of the initiative, a group of 30 researchers had completed and submitted their first analysis of NHS data for pre-print: “OpenSAFELY: Factors associated with COVID-19 death in 17 million patients”⁵⁰ (appearing in final form in Nature in July 2020). On behalf of NHS England, the OpenSAFELY team had used the platform to quantify a range of potential risk factors for COVID-19-related death. It was the largest cohort study conducted by any country to date—one that covered 40% of adults in the UK. The underlying population included all adult patients registered with a general practice using TPP software who had at least 1 year of data history prior to February 1, 2020. Of the 17,278,392 individuals meeting this definition, 10,926 deaths from COVID-19 had been reported in Office for National Statistics (ONS) data by May 6, 2020. Findings of the analysis included associations between COVID-19 death and older age, male sex, non-White ethnicity, lower socio-economic status (SES), and many factors related to comorbid conditions and medical history: respiratory, cardiovascular, cerebrovascular, neurological, hepatic, renal, immunosuppressive conditions, recent cancer history, and diabetes. The final paper included an age-stratified analysis and a model with age as an interaction term to inform more accurate risk prediction.

The collaborative followed up on the topic with a more sophisticated model for predicting COVID-19 related death (relative and absolute estimates of risk in the context of changing levels of circulating virus) and factors associated with deaths due to COVID-19 versus deaths due to other causes in the same time period.⁵¹

Later in 2020, the OpenSAFELY Collaborative published “Risk of COVID-19-related death among patients with chronic obstructive pulmonary disease or asthma prescribed inhaled corticosteroids” in Lancet Respiratory Medicine.⁵² An observational cohort study investigated the association between inhaled corticosteroids (ICS) and COVID-19-related death among people with chronic obstructive pulmonary disease (COPD) or asthma. Early reports of hospital admissions during the COVID-19 pandemic showed a lower prevalence of asthma and COPD than might be expected for an acute respiratory disease such as COVID-19, leading to speculation that ICS might be protective. The paper found no association, and the National Institute for Health and Care Excellence (NICE) responded by noting the study in their rapid guideline for severe asthma during COVID.⁵³

Other OpenSAFELY studies published in 2020 addressed issues that were being actively discussed and debated at the time. One investigated the effectiveness of routine hydroxychloroquine use for prevention of COVID-19 mortality (not for treatment of COVID-19).⁵⁴ No evidence of benefit or harm was found after adjustment for patient characteristics including existing health conditions. Another assessed the association between routinely prescribed non-steroidal anti-inflammatory drugs (NSAIDs) and deaths from COVID-19, finding no harmful effect.⁵⁵

OpenSAFELY researchers also investigated ethnic disparities in COVID-19,⁵⁶ risks for people with learning disabilities,⁵⁷ and risks for people with or without children in their household.⁵⁸

When the SARS-CoV-2 variant B.1.1.7 emerged in England, a rapid analysis was done to estimate the case fatality risk for variant to non-variant cases, adjusting for demographic factors and comorbidities. Early vaccination trends were examined. The collaborative also published observations on how to identify care home residents in the data and trends in the clinical coding of long COVID-19 in primary care.⁵⁹

OpenSAFELY impact and perspective

Alex Walker, a lead DataLab researcher, says the rapid expansion of the initiative was born out of urgency: “We needed to know what groups of people were most vulnerable, how the virus affected them, which drugs might help or hinder when treating patients, and what happened to patients who recovered from infection. To answer these questions quickly, we needed an unprecedented amount of clinical patient data.” Walker says that now that those data have been gathered, “OpenSAFELY gives us the power to quickly respond to emerging clinical population health and policy challenges with precise data and open methods.” With this tool, “researchers can continue to provide immediate answers to extremely urgent questions in any future health emergency."⁶⁰

So what is the impact on the future of RWE? Ben Goldacre, the Director of DataLab and joint principal investigator for OpenSAFELY, says, “COVID-19 could—and should—be the turning point; however, it all hangs on modern, open, reproducible techniques.” He refers to OpenSAFELY as an example of a “collaborative data science ecosystem.”⁶¹

Sebastian Bacon cites DataLab’s roots in evidence-based medicine, saying that the goal was to “do evidence-based medicine better by getting epidemiologists, clinicians, researchers, and software engineers all working together.” In personal correspondence (September 2021), Bacon wrote that OpenSAFELY “is best practices encoded in software . . . you are forced to do things safety and efficiently.”

In most analyses of EHR data, even on the same database, data management tasks are achieved by a wide range of individualized methods, using different tools, platforms, and programming languages. (Ben Goldacre equates this with “building a fridge every time you need a cold beer.”)⁶¹ By restricting the user to OpenSAFELY tools, variable definitions and code are created so they can be read, understood, and adapted by other users. Differing variable definitions can be seen and tested against one another. Activity on the platform is publicly logged and any code used for data management and analysis is shared to enable scientific review and efficient reuse. In this way, reusable code and “reusable knowledge objects” are created.

When asked if it was inconvenient to be restricted to a set of tools and kept at a distance from the data, Bacon says, “Yes. But that’s exactly the reason there have been poor security practices in the past—or poor privacy practices—because it’s not convenient . . . people use the word ‘password’ for their password, because it’s not convenient to have strong passwords. That’s the central problem we’re trying to solve. Yes, some aspects of OpenSAFELY are slow compared to doing it the easy but unsafe way, so we aim to solve that by creating additional tools and services that make it easier for people to choose to do the right thing. To the OpenSAFELY team, the tradeoff is access to powerful data sets, to open-source code, and to features that can be added as requested.”

Regarding complex study designs and machine learning, Goldacre says “It’s all theoretically possible, but not all fully implemented yet—we are implementing functionality for our users as the requirements come up.”⁶¹

Goldacre claims we are at the point of a culture shift: “In the past it was quite natural—because data management and data collection was such a laborious and uncommon business—for people to assert monopolies . . . that the person who collected and managed the data would be the only person who would analyze it. It is only since computational tools have become more widely accessible that it has become more common for lots of people to expect to . . . reproduce things and examine the methods that were used.”⁶¹ He compares this period to 10–15 years ago, when clinical trial reporting practices were changed, requiring public registration for all clinical trials (and all interventions within those trials) and strongly encouraging result reporting. That was a shock at the time—but is now an accepted norm.