
Estimating the reproduction number from modelled incidence
Source:vignettes/rt-from-incidence.Rmd
rt-from-incidence.Rmd
Robert Challen 1,2; Leon Danon 1,2;
- Engineering Mathematics, University of Bristol, Bristol, UK
- AI4CI
Introduction
If we have estimated the incidence of a disease using a poisson rate using maximum likelihood estimators, the rate is typically a log-normally distributed with parameters and . Such a fitted model is shown below on a log1p scale, for the COVID-19 epidemic in England:
It is appealing to use this modelled incidence estimate to calculate an estimate of the reproduction number, . Incidence models can be derived in a number of ways, they are easily inspected for error and can be made tolerant of missing values and outliers.
Methods
To use a modelled estimate of incidence to predict we need to propagate uncertainty in incidence into our estimates. To calculate we can use the backwards-looking renewal equations which incorporate the infectivity profile of the disease () at a number of days after infection ():
In expectation, this gives:
It has been shown that the sum of correlated log-normal distributed random variables can be approximated by another log-normal with parameters and , where the correlation between them is .
The sum term in the denominator of the renewal equation (1) consists of a set of correlated scaled log normal distributions with scale defined by the infectivity profile (). For our case for a given time point we equate , and substitute = and into (2) to account for the infectivity profile. We define to be the support of the infectivity profile (). is the weighted contribution from incidence estimates on day , and is the covariance between log-incidence estimates from days and .
With and defined, is approximated as the ratio of two log-normals where is the covariance between the numerator and the log-denominator. Since , and using a first-order approximation, this covariance is a weighted average of the covariances between and each , weighted by the relative expected contributions .:
This formulation of from (3) assumes knowledge of the posterior or prediction covariance of the incidence estimates (). This is typical in modern frameworks , but in other situations may not be available. In this case we could assume the individual estimates of the incidence are independent, however this increases the uncertainty of our estimate and in certain circumstances introduces a potential underestimation bias, influenced by the true off-diagonal correlation mass and the certainty of the incidence estimates. An alternative approach is to assume weak stationarity and estimate a parametric correlation model from the data used to build the incidence model, using Pearson residuals to parameterise an exponential decay function based on time difference (see supporting software package for implementation). The degree of the bias is investigated further in the supplementary materials, and heuristics for assessing the significance of this bias proposed.
The method for estimating from modelled incidence has been described assuming a non-negative component to the infectivity profile, as it is implicit that infector and infectee are necessarily sequential in time. In the situation where symptomatic case counts are used as a proxy for incidence and the serial interval as a proxy for the infectivity profile, negative times between serial cases may be observed due to variation in delay in observation of the transmission chain. There is nothing in this framework to stop the use of a negative time for the infectivity profile, and we can directly support estimates in these cases.
Numerical stability
In (3) is the log-scale mean of the incidence estimate at time , and its standard deviation. These can be large, leading to numerical instability in terms involving . However, assuming non-negative correlations and using log-space computation with optimized log-sum-exp functions , the expressions remain computationally tractable:
Other relations may be implemented directly as in (3).
Infectivity profile uncertainty
This estimate of is conditioned on a single known infectivity profile. In reality there is also uncertainty in the infectivity profile () which plays a role in the definition of and . We cannot assume any particular distributional form for the infectivity profile, but we can use a range of empirical estimates of the infectivity profile to calculate multiple distributional estimates for and then combine these as a mixture distribution.
The nature of this mixture distribution will depend on the various empirical infectivity profile distributions. However, we can use general properties of mixture distributions to create estimates for the mean and variance of the reproduction number estimate () combining the uncertainty arising from multiple infection profile estimates () and from the incidence estimate model itself:
The cumulative distribution function of the mixture is simply the arithmetic mean of the component cumulative distribution functions (conditioned on each infectivity profile). If is the cumulative distribution function of the standard normal distribution:
As the cumulative density function of this mixture distribution is a strictly increasing function, specific solutions for median () and 95% confidence intervals ( and ) can be calculated numerically by solving the following equations:
Numerical solutions to this are moderately expensive to perform. A reasonable approximation can be expected by matching moments of a log normal distribution to the mean and variance of the mixture. This gives us the final closed form estimator for the reproduction number given a set of infectivity profiles, , as:
In summary we present a method for retrieving the distributional form of the reproduction number from log normally distributed probabilistic estimates of incidence arising from simple statistical count models. This includes uncertainty arising from both count models and from infectivity profile distributions. It is fully deterministic and computationally inexpensive. It does not place any particular constraints on the nature of the infectivity profile distribution and can handle distributions that have a negative component, as is sometimes seen in serial interval estimates.
Implementation
This method is implemented using the following R function, which is designed for numerical stability and speed. Generating estimates given modelled incidence typically occurring in a matter of seconds:
#> function (mu_t, vcov_ij = diag(sigma_t^2), omega, sigma_t = NULL,
#> tau_offset = 0)
#> {
#> omega_m = as.matrix(omega)
#> k = nrow(omega_m)
#> if (k != length(mu_t)) {
#> stop("omega must have the same number of rows as length of mu_t.")
#> }
#> if (!all(dim(vcov_ij) == k)) {
#> stop("vcov_ij must be a square matrix with the same dimensions as mu_t, and nrow(omega)")
#> }
#> omega_m = apply(omega_m, MARGIN = 2, rev)
#> sigma_t = sqrt(diag(vcov_ij))
#> time_Rt = length(mu_t) - tau_offset
#> mu = mu_t[time_Rt]
#> sigma = sigma_t[time_Rt]
#> tmp = apply(omega_m, MARGIN = 2, function(omega) {
#> log_m_tau = mu_t + sigma_t^2/2 + log(omega)
#> log_S_plus = .logsumexp(log_m_tau)
#> log_Sigma_ij = log(pmax(vcov_ij, 0))
#> log_Sigma_0tau = log_Sigma_ij[time_Rt, ]
#> log_sigma_Z2 = .logsumexp(outer(log_m_tau, log_m_tau,
#> "+") + log_Sigma_ij) - 2 * log_S_plus
#> mu_Z = log_S_plus - 1/2 * exp(log_sigma_Z2)
#> log_Sigma_0Z = .logsumexp(log_m_tau + log_Sigma_0tau) -
#> log_S_plus
#> mu_Rt = mu - mu_Z
#> sigma_Rt2 = sigma^2 + exp(log_sigma_Z2) - 2 * exp(log_Sigma_0Z)
#> return(c(mu_Rt, sigma_Rt2))
#> })
#> mu_Rt = tmp[1, ]
#> sigma_Rt2 = tmp[2, ]
#> means = exp(mu_Rt + sigma_Rt2/2)
#> vars = (exp(sigma_Rt2) - 1) * exp(2 * mu_Rt + sigma_Rt2)
#> mean_star = mean(means)
#> var_star = mean(vars + means^2) - mean_star^2
#> mu_star = log(mean_star) - log(var_star/mean_star^2 + 1)/2
#> sigma_star = sqrt(log(1 + var_star/mean_star^2))
#> return(list(time_Rt = time_Rt, mean_Rt_star = mean_star,
#> var_Rt_star = var_star, meanlog_Rt_star = mu_star, sdlog_Rt_star = sigma_star,
#> mu_Rt_mix = mu_Rt, sigma_Rt_mix = sqrt(sigma_Rt2), quantile_Rt_fn = function(p) {
#> .qmixlnorm(p, mu_Rt, sqrt(sigma_Rt2))
#> }))
#> }
#> <bytecode: 0x58b456392eb0>
#> <environment: namespace:ggoutbreak>
Results
Testing this against the incidence model shown above, and comparing the results to the SPI-M-O consensus estimates gives us the following time-series for England. This is formally evaluated elsewhere but qualitatively is a good fit. It took a few seconds to calculate the reproduction number from this single time series with 1410 time points , which opens up the possibility of performing estimates in fine grained geographical or demographic subgroups.
#> user system elapsed
#> 10.691 1.273 12.015
Conclusion
We present a methodology for deriving from modelled estimates of incidence while propagating uncertainty. We demonstrate it produces satisfactory qualitative results against COVID-19 data. This method is relatively quick, fully deterministic, and can be used on top of any statistical models for estimating incidence which use logarithmic link functions.