data:image/s3,"s3://crabby-images/7317c/7317c6385ffccc39e19fab68afe1f90759b32fe6" alt=""
Estimating the reproduction number from modelled incidence
Source:vignettes/rt-from-incidence.Rmd
rt-from-incidence.Rmd
Robert Challen 1,2; Leon Danon 1,2;
- Engineering Mathematics, University of Bristol, Bristol, UK
- AI4CI
Introduction
If we have estimated the incidence of a disease using a poisson rate using maximum likelihood estimators, the rate is typically a log-normally distributed with parameters and . Such a fitted model is shown below on a log1p scale, for the COVID-19 epidemic in England:
It is appealing to use this modelled incidence estimate to calculate an estimate of the reproduction number, . Incidence models can be derived in a number of ways, they are easily inspected for error and can be made tolerant of missing values and outliers.
Methods
To use a modelled estimate of incidence to predict we need to propagate uncertainty in incidence into our estimates. To calculate we can use the backwards-looking renewal equations which incorporate the infectivity profile of the disease () at a number of days after infection ():
giving us:
The sum of such log normal distributions can be approximated by another log normal (Lo 2013) with parameters and .
The sum term in the denominator of the renewal equations consists of a set of correlated scaled log normal distributions with both scale and correlation defined by the infectivity profile (). In our case can be equated to the infectivity profile () when and 1 when . is .
is the central estimate of case counts on the log scale, and its standard deviation can also be large. There are numerical stability issues dealing with terms involving , however keeping everything in log space and using optimised log-sum-exp functions this can be made computationally tractable.
N.B. if we assume the individual estimates of the incidence are uncorrelated this simplifies to:
Empirically there is not a huge amount of difference in estimates between these two forms. If the infectivity profile is spread out over a large period then the correlation matrix will be which may predicate this simpler order 1 formulation.
With and we are left with the final derivation of , giving us a distributional form of incorporating uncertainty from modelled incidence estimates:
This is conditioned on a single known infectivity profile (). In reality there is also uncertainty in the infectivity profile, however we cannot assume any particular distributional form for this. We can use a set of empirical estimates of the infectivity profile () to calculate multiple distributional estimates for the reproduction number () and then combine these as a mixture distribution.
There is not much we can say about this mixture distribution, as it will depend on the nature of the various empirical infectivity profile distributions. However, we can use general properties of mixture distributions to generate estimates for the mean and variance of the reproduction number including uncertainty arising from multiple infection profile estimates ():
The cumulative distribution function of the mixture is simply the arithmetic mean of the component cumulative distribution functions (conditioned on each infectivity profile). If is the cumulative distribution function of the standard normal distribution:
As the cumulative density function of this mixture distribution is a strictly increasing function specific solutions for median and 95% confidence intervals can be calculated numerically by solving the following equations:
Implementation
This method is implemented using the following R function, which is designed for numerical stability and speed. Generating estimates given modelled incidence typically occurring in a matter of seconds:
#> function (mu, sigma, omega, mu_t, sigma_t, cor = FALSE, approx = TRUE)
#> {
#> omega_m = as.matrix(omega)
#> omega_m = apply(omega_m, MARGIN = 2, rev)
#> tmp = apply(omega_m, MARGIN = 2, function(omega) {
#> log_S_t = .logsumexp(mu_t + sigma_t^2/2 + log(omega))
#> log_T_t_tau = mu_t + sigma_t^2/2 + log(omega) + log(sigma_t)
#> if (cor) {
#> n = length(omega)
#> idx = 0:(n^2 - 1)
#> i = idx%/%n
#> j = idx%%n
#> log_cor_ij = c(0, log(omega))[abs(i - j) + 1]
#> log_var_Zt_ij = log_cor_ij + log_T_t_tau[i + 1] +
#> log_T_t_tau[j + 1]
#> }
#> else {
#> log_var_Zt_ij = 2 * log_T_t_tau
#> }
#> log_var_Zt = .logsumexp(log_var_Zt_ij) - 2 * log_S_t
#> var_Zt = exp(log_var_Zt)
#> mu_Zt = log_S_t - var_Zt/2
#> mu_Rt = mu - mu_Zt
#> var_Rt = sigma^2 + var_Zt
#> return(c(mu_Rt, var_Rt))
#> })
#> mu_Rt = tmp[1, ]
#> var_Rt = tmp[2, ]
#> sigma_Rt = sqrt(var_Rt)
#> means = exp(mu_Rt + var_Rt/2)
#> vars = (exp(var_Rt) - 1) * exp(2 * mu_Rt + var_Rt)
#> mean_star = mean(means)
#> var_star = mean(vars + means^2) - mean_star^2
#> out = tibble::tibble(rt.fit = log(mean_star^2/sqrt(mean_star^2 +
#> var_star)), rt.se.fit = sqrt(log(1 + var_star/mean_star^2)))
#> if (approx) {
#> out = out %>% .result_from_fit(type = "rt", qfn = function(p) stats::qlnorm(p,
#> .$rt.fit, .$rt.se.fit)) %>% .keep_cdf(type = "rt",
#> meanlog = .$rt.fit, sdlog = .$rt.se.fit)
#> }
#> else {
#> out = out %>% .result_from_fit(type = "rt", qfn = function(p) .qmixlnorm(p,
#> mu_Rt, sigma_Rt)) %>% .keep_cdf(type = "rt", meanlog = list(mu_Rt),
#> sdlog = list(sigma_Rt))
#> }
#> return(out)
#> }
#> <bytecode: 0x62b552ae74e8>
#> <environment: namespace:ggoutbreak>
Results
Testing this against the incidence model shown above, and comparing the results to the SPI-M-O consensus estimates gives us the following time-series for England. This is formally evaluated elsewhere but qualitatively is a good fit. It took a few seconds to calculate the reproduction number from this single time series with 1410 time points , which opens up the possibility of performing estimates in fine grained geographical or demographic subgroups.
#> user system elapsed
#> 6.344 0.003 6.347
Conclusion
We present a methodology for deriving from modelled estimates of incidence while propagating uncertainty. We demonstrate it produces satisfactory qualitative results against COVID-19 data. This method is relatively quick, fully deterministic, and can be used on top of any statistical models for estimating incidence which use logarithmic link functions.