Design and statistical considerations in the evaluation of digital behaviour change interventions
By Emma J Norris, on 18 June 2019
By Dr Emma Beard – University College London
Devices and programs using digital technology to foster or support behaviour change have become increasingly popular. Evaluating their effectiveness is often more complex than for face-to-face interventions where the ‘Gold standard’ randomised controlled trial can be used. With digital interventions we often have repeated measures over long periods of time which results in data with a complex internal structure: season effects, underlying trends and clustering (or autocorrelation). Drop out (leading to loss of power) and confounding are also a problem with natural experiments. This has effects on how we interpret findings in terms of casual effects but also in the presence of null results.
Several novel statistical techniques and study designs are available to help gain insight into the effects of specific digital intervention components on the causal mechanisms influencing outcomes and to assess the association between outcomes and measures of usage. At the 2019 CBC Conference “Behaviour Change for Health: Digital and other Innovative Methods” we presented a symposium which aimed to cover some of the main statistical issues of analysing digital interventions and also presented on some of the designs most commonly used to evaluate digital therapeutic apps.
Time series analysis
To account for underlying trends, seasonality and autocorrelation we can look towards time series models which are commonly used in financial forecasting and to assess the effect of population policies and interventions. Analyses included Autoregressive Integrated Moving Average (ARIMA)/ Autoregressive Integrated Moving Average with Explanatory Variable (ARIMAX)/ Generalised Additive Mixed Models (GAMM) and can be easily applied in most statistical packages, including R (e.g. TSA, Forecast and mgcv packages). GAMM is simply an extension of Generalised Linear Mixed Model (GLMM) which has the added benefit of adjusting for seasonality using data driven smoothing splines comprised of a series of knots. ARIMA/ARIMAX can be viewed as regression models which have one or more autocorrelation term (i.e. values closer in time tend to be more similar).
Dr Olga Perski presented a series of N-of-1 trials using GAMM which assessed within-person predictors of engagement with the Drink Less app. It was found that different app-related and psychological variables were significant predictors of the frequency and amount of engagement within and between individuals (e.g. the receipt of a daily reminder and perceived usefulness of the app were predictors of frequency of engagement). These results suggest that different strategies to promote engagement may be required for different individuals.
The second issue covered was drop-out/missing data. There are three mechanism of missing data which each have implications for the analysis. The first is data missing completely at random. This occurs when the propensity for a data point to be missing is completely random (e.g. a participant flips a coin and decides whether to answer a question or not). The second is data missing at random. This occurs when the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data (e.g. older people are less likely to answer questions about their income, but it does not depend on participants income level). The third is missing not at random. This happens when the propensity for a data point to be missing is related to the missing data (e.g. participants with severe depression are more likely to not answer questions on depression).
Most commonly people handle missing data using listwise deletion (analysis on complete cases) or pairwise deletion (missing value deletion is considered separately for each pair of variables). Although pairwise is preferred over listwise (increased power) both assume data are missing not at random. An alternative is the use of multiple imputation. This is applicable when data are missing completely at random or at random (note: there is some bias for missing at random but this is negligible). Missing imputation follows several stages: 1) select a group of variables to predict the missing values, 2) predicted values ‘imputes’ are substituted for the missing values, 3) repeat this for multiple imputed data sets, 4) run the analysis on each data set, 5) combine the results using ‘Rubin’s Rules’. If data are missing not at random the alternative approach is model the missingness but this leads to complex models and so generally data are assumed to be missing at random or completely at random.
The third issue covered was confounding. The solutions discussed were 1) stratification, 2) multivariable analysis, and 3) propensity score matching. The objective of stratification is to fix the level of the confounders and produce groups within which the confounder does not vary. You then evaluate the exposure-outcome association within each stratum of the confounder and use the Mantel-Haenszel (M-H) estimator to provide an adjusted result according to strata. If there is difference between the crude result and adjusted result (produced from strata) confounding is likely. But in the case that the crude result dose not differ from the adjusted result, then confounding is unlikely. Propensity score matching works by combining information on a number of variables (potential confounders) into a single score and then matches’ individuals on this score. The following caveats should be noted. First, it can be difficult to balance the treatment group in small samples or if the comparison groups are very different. There is a possibility that unknown, unmeasured and residual confounding still exists after matching and matching variables should be unrelated to the exposure but related to the outcome. Finally, propensity score matching cannot handle treatment defined as a continuous variable (e.g. drug dose), unless dosage is categorised. For an example of propensity score matching see https://www.ncbi.nlm.nih.gov/pubmed/22748518.
The fourth issue covered was null findings. No scientific conclusion can follow automatically from p>0.05. A non-significant p-value could reflect either no evidence for an effect or data insensitivity (i.e. low power/high standard error). One solution to this problem is the use of Bayes Factors. B can range from 0 to infinity and conventional cut-offs are available. B<0.3 is evidence for the null hypothesis, between 0.3 and 3 is evidence for data insensitivity and >3 evidence for the alternative hypothesis. What this means, is that if you have a p>0.05 and B>0.3 you should avoid terms such as ‘no difference’ or ‘lack of association’. If p>0.05 and B<0.3 you can use these terms. If you do not calculate a Bayes Factor you should state ‘the findings are inconclusive as to whether or not a difference/association was present’. Bayes Factors can be easily calculated using online calculators and generally require the specification of a plausible predicted value. This should be pre-registered e.g. on the Open Science Framework.
Dr Claire Garnett gave an example of using Bayes Factors to re-examine a dataset from the Drink Less app supplemented with extended recruitment. Bayes Factors calculated for the extended trial (total n=2586; 13.2% responded to follow-up) supported there being no large main effects on past week alcohol consumption (0.22<BF<0.83).
Novel trial deigns
Randomised controlled trials are a poor fit for digital interventions because they (i) do not allow the app to continually improve from the data gathered during the trial and (ii) results only allow us to understand the effectiveness of the whole app, and not individual components. Dr Henry Potts discussed a scoping review that explored how novel trial designs are implemented for digital therapeutic apps. These included: Sequential Multiple Assignment Randomised Trials (SMARTs) for dynamic treatment regimens; micro-randomisation trials (MRT) for ‘just-in-time’ push notifications; N-of-1 and series of N-of-1 for personalisation of apps; randomised response adaptive trials for allocating more patients the most effective app, Multiple Optimisation Strategy (MOST) framework and Multi-Armed Bandit Models for building and optimising apps as complex interventions. He concluded that more micro-randomisation trials and implementations of the MOST framework are emerging in the literature as trial designs for both the development and evaluation of apps. He considered how multi-arm trials, with options of interim analysis and response-adaptive randomisation, may have potential here as well.
Thanks to all panelists at the symposium: Dr Emma Beard, Dr Olga Perski, Dr Claire Garnett & Dr Henry Potts, UCL
- How can we improve the development and analysis of digital interventions?
- How can we make the approach to digital interventions more scientifically rigorous?
- How do we deal with big data?
Emma Beard (@DrEVBeard) is a Senior Research Associate in the Department of Behavioural Science and Health at UCL. Her research focus on the application of novel statistical methodology to population surveys on tobacco and alcohol use.