Handling missing data in longitudinal clinical databases
By Nathan Davies, on 1 July 2014
Cathy Welch is a former Research Associate of Primary Care and Population Health and current part time PhD student.Cathy Welch is a former Research Associate of Primary Care and Population Health and part-time PhD student. She worked on a MRC funded project:
“Missing data imputation in clinical databases: development of a longitudinal model for cardiovascular risk factors”. For more information about our work on electronic health records please see: http://www.ucl.ac.uk/pcph/research-groups-themes/thin-pub
She is currently a Research Associate working on the Whitehall II study in the Department of Epidemiology and Public Health. Here she discusses multiple imputation as a method for handling missing data.
Missing data is a common problem in all areas of research. Many studies choose to analyse the observed data only, but this can bias findings if the observed data is not representative of all the data, resulting in misleading conclusions.
An alternative approach to handle missing data is multiple imputation, which selects multiple random values from the predictive distribution of the missing data given the observed data. Analysis of imputed data can achieve unbiased results if certain assumptions are plausible. For example, the data is missing at random (the reasons for missing data depend on the observed values but not the missing values). We can investigate the relationships in the data to understand structure, extent of the missing data and the reasons for the missing data to justify a plausible missing at random (MAR) mechanism. We can also make plausible assumptions regarding the `missingness mechanism’ based on prior knowledge of the data, not just from the observed data.
Multiple imputation (MI) has been implemented in many research studies. However, we were interested in the best multiple imputation approach to imputed missing values in longitudinal data (repeated measurements over a time period). For this investigation, we used The Health Improvement Network (THIN) primary care database of electronic, health records. This database currently consists of over 550 practices with 11 million patients and many variables. We wanted MI to take account of the longitudinal and dynamic structure of the data as reasons for recording information changes over time. However, computational problems arise due to the substantial size of the database.
Nevalainen(1) proposed a new approach to MI, the two-fold fully conditional specification (FCS) algorithm, to impute missing values in longitudinal data. It imputes each time point conditional on observations at the same and adjacent time points and repeatedly cycles through all time points, so it takes into account the longitudinal structure. This approach reduces computational problems because each iteration includes a few time points.
This method was not validated for use in a real longitudinal database, for example when entry to the study and exit from the study varies for each individual. The first step was to develop a Stata command to implement the two-fold FCS algorithm in real longitudinal data. This command is now available to download from the Statistical Software Components (SSC) archive using the Stata command ssc install twofold. A paper describing the command is now available from The Stata Journal.
Before implementing the two-fold FCS algorithm in THIN, we used a simulation study to validate the two-fold FCS algorithm. One advantage of using simulation study is we know the original data so we can make data missing and compare different approaches for handling missing data to the results from analysing the original data. Another advantage is we can challenge the two-fold FCS algorithm in different settings when it achieves unbiased results.
We simulated data from the associations observed in THIN with 10 years of follow up. We used an exponential model of interest to investigate the association between measurements recorded at a baseline time point and a future CVD event.
We made 70% weight, systolic blood pressure and smoking status measurements missing at each time point and compared to:
- a complete records analysis
- imputing missing values at baseline conditional only on other measurements recorded at baseline
- using the two-fold FCS algorithm to impute
- Smoking status as time-independent
- Smoking status as time-dependent – From GPs and interrogation of the data found adult non-smokers are unlikely to begin smoking. So, if a patient only ever had a non-smoker record, we assumed they were non-smokers at every time point. This approach reduced missing data and simplified the imputation process (semi-deterministic method).
The simulation study found analysing data imputed using the two-fold FCS algorithm gave essentially unbiased estimates for time-dependent variables compared to a complete records analysis or analysing data imputed at baseline only because the two-fold FCS algorithm uses the repeated measurements in the imputation. Weight variable was least biased, probably because of high correlations between repeatedly measured weight values. Also, using semi-deterministic method derived based on our knowledge of the data to impute smoking status was preferable because it improved the estimate precision (standard errors).
This study is now published in Statistics in Medicine(2). Next step is to use the knowledge gained from this validation process to apply the algorithm in THIN. We also hope other researchers using longitudinal databases with missing data may recognise the potential bias which can occur from analysing only the observed data and consider using the two-fold FCS algorithm to impute missing values.
(1) Nevalainen J, Kenward MG, Virtanen SM. Missing values in longitudinal dietary data: a multiple imputation approach based on a fully conditional specification. Stat Med 2009 Dec 20;28(29):3657-69.
(2) Welch CA, Petersen I, Bartlett JW, White IR, Marston L, Morris RW, et al. Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data. Stat Med 2014 Apr 30.