Chapter 3: Methods in Real-World Evidence Generation - Sources of Error

5. Missing Data

Author: Lin KJ


The most commonly used RWD sources, insurance claims data and EHRs, could have missing data when used for COVID-19 pharmacoepidemiology research. For patients with mild COVID-19 disease, supportive care has been the preferred management strategy;62 pharmacological treatments with possible antiviral effects have been primarily used for patients with moderate to severe disease requiring hospitalization.63,64 Insurance claims data generally do not capture detailed information on inpatient medication use, vaccines or treatments not reimbursed by the insurance provider or investigational drugs, or certain clinical data elements, limiting their utility for addressing some questions.65 In studies relying on claims, missing information generally translates into misclassification of study variables, which can lead to bias. In contrast, EHR data often enable ascertainment of certain clinical covariates that are not usually measured in claims data, such as vital signs, laboratory test results, smoking status, body mass index (BMI), and code status. However, these factors can have a substantial amount of missing data in routine-care EHR, which can also lead to bias.66 In addition, EHR-discontinuity, defined as receiving care outside of the reach of a given EHR, can result in incomplete capture of medical information in the study EHR, which is another form of data incompleteness that can lead to misclassification of key study variables.67It is critical to examine and address missing data in study outcomes, exposures, and covariates.

What does it look like in practice?

For routinely evaluated health metrics or biomarkers, such as laboratory results, vital signs, or body mass index, missing data are presented as having no values recorded in the timeframe of interest (e.g., one year prior to a drug dispensing). If longitudinal recording of multiple values of the same biomarkers is used to monitor disease course, such as a change in glycosylated hemoglobin (HbA1c) or inflammatory markers, having a recording of the biomarker at some time points (e.g., at cohort entry) but not others (e.g., at certain visits or time interval during follow-up) also creates a scenario of missing data (e.g., change of the value since baseline). This is in contrast to what is typical in randomized trials in which measurements are made on a specific schedule according to a study protocol. For health outcomes, such as those defined using diagnostic codes, missing data often result in misclassification of study variables (i.e., false negatives) with EHR or claims data. For example, if a patient has COVID-19 but does not have a medical encounter during the study period in which a COVID-19-related code is recorded, it will appear in the data as though this individual does not have COVID-19. It will not be known that the data point is missing, but the patient’s COVID-19 status will be misclassified. This creates an issue between a true negative versus an unknown, as the absence of the value may be the result of the patient not having the condition, the value not collected, or the patient being seen outside the system.

Why is it a problem?

The mechanism or reasons causing missing data will determine its propensity for biasing effect estimates.68 “Missing completely at random” (MCAR) means the probability of missingness is independent of exposure and/or outcome (e.g., missing a batch of laboratory results due to a fire). Under MCAR, using only people without missing data (i.e., complete case analysis) will not lead to bias. “Missing at random” (MAR) means the probability of missingness depends only on observed variables (e.g., missing prostate specific antigen values in females). Under MAR, complete case analysis will be biased, so proper implementation of statistical methods is needed to yield unbiased results. “Missing not at random” (MNAR) means the probability of missingness depends on not only observed but unobserved information (e.g., people with better renal function tend to miss serum creatinine values that measure renal function). Under MCAR, bias caused by missing data is expected, and an effort to quantify the impact on the study validity should be made.

How to handle it

Assessment of missing mechanism: The first step is to examine the frequency and patterns of missingness in the database.69 Then the missing mechanism should be determined based on domain knowledge or empirical data whenever available.70 Under the assumption of MCAR, complete case analysis is unbiased, but statistical power may be reduced. We will now briefly survey strategies to reduce potential bias due to missing data under MAR and MNAR.

Three commonly used statistical methods to handle missing data under MAR—multiple imputation, maximum likelihood methods, and inverse probability weighting:68,71  

Multiple imputation (MI) relies on a correctly specified imputation model that uses observable information to replace the missing value with a predicted value and repeats such imputation process multiple times to account for the uncertainty of the imputation process. While there is no magic cut-off of missing proportion for reliable imputation, generally imputing data when missing data >40% should be avoided.72,73 The implementation of MI can be computationally intensive for large data sets.

Maximum likelihood methods use an iterative algorithm that fits different sets of parameter values for an assumed probability distribution until it identifies what maximizes the log-likelihood value (i.e., what fits the observed data the best). It leverages all the observed information and often yields estimates with optimal efficiency. However, sometimes incomplete likelihood functions have a complicated form or require special computational techniques (e.g., expectation-maximization algorithm).74 Therefore, the implementation of maximum-likelihood-based methods may be problem-specific and require special software.

Inverse probability weighting methods weight patients without missing data by the reciprocal of the probability of having complete records to adjust for factors underlying missingness.75 Weighting does not use those with missing data in the final outcome model but rather emphasizes (with larger weights) those patients with complete data but with similar characteristics as the patients with missing data. Weighting is sometimes affected by extreme weights, both of which may compromise statistical efficiency.

Strategies to reduce bias due to missing data under MNAR: Sometimes researchers may be able to obtain additional information through linkage with other data sources.71 For example, in a study of patients with atrial fibrillation (AF), the investigators originally had access only to international normalized ratio (INR, a blood test that quantifies anticoagulation effect) values for tests performed in hospitals, and it is possible that INR values done in the hospitals are higher than in ambulatory settings due to higher medical complexity of hospitalized patients. Under the assumption of MNAR, the investigators linked the data set with anticoagulation management clinic data, where the INR results were available even if the tests were performed outside of hospitals.76 Another strategy to reduce missing data is to supplement the structured data with variables derived from natural language processing (NLP) of the free-text notes to improve missingness. For example, in the same cohort of AF patients, using NLP to extract additional information from the free-text notes reduced missing smoking information from 54.4% to 7.8%.76 For data with residual missing data under MNAR, the investigator should conduct a sensitivity analysis varying plausible values of the missing data to test the robustness of the study findings.77

Methods to handle missing data due to EHR-discontinuity: Except for some systems where the EHR system is explicitly integrated with the payor system (claims), many EHR systems do not have well-defined enrollment dates and are subject to data leakage outside of the study EHR (i.e., EHR-discontinuity).78 The causes and patterns of EHR-discontinuity may vary by type of EHR system (e.g., general hospital vs. specialty EHR or a metropolitan vs. suburban systems). For example, EHR data from a general hospital might lack information from encounters at unaffiliated specialty facilities. The information bias due to EHR discontinuity can be reduced by an externally validated algorithm to identify a sub-cohort with high data completeness.67,79,80 In select empirical examples, it has been found that patients with high EHR-continuity had a similar comorbidity profile compared to those with low EHR-continuity, so restricting analyses to those with high EHR continuity seemed to confer a desirable trade-off between validity (reducing misclassification of the study variables) and generalizability (drawing inference in the sub-population).80 However, these findings may not generalize to all settings and should be carefully considered on a case-by-case basis.

Empirical example: In an observational study aiming to determine the effect of hydroxychloroquine with or without azithromycin on COVID-19 mortality using EHR data from the United States Department of Veterans Affairs,81 there was a notable amount of missing data in lifestyle factors (e.g., ranging from 2% in alcohol consumption to 38% in smoking status), vitals (e.g., ranging from 3% in blood pressure to 6% in oxygen saturation), and labs (e.g., ranging from 7% in serum sodium, 19% in HbA1c, to 86% D-dimer). Missingness at random was assumed, and multiple imputation was used for most variables before incorporating these variables in a propensity score model for confounding adjustment. Exposure misclassification was unlike to be a major concern as the study focused on the short-term use of inpatient medications that were well-captured in the study EHR. For outcome ascertainment, to capture out-of-network deaths, the investigators linked the EHR data with the Beneficiary Identification Records Locator Subsystem and Social Security Death Index data.82