Skip to contents

Robert Challen 1,2; Leon Danon 1,2;

  1. Engineering Mathematics, University of Bristol, Bristol, UK
  2. AI4CI

Introduction

If we have estimated the incidence of a disease ItI_t using a poisson rate using maximum likelihood estimators, the rate is typically a log-normally distributed with parameters μ\mu and σ\sigma. Such a fitted model is shown below on a log1p scale, for the COVID-19 epidemic in England:

It is appealing to use this modelled incidence estimate to calculate an estimate of the reproduction number, RtR_t. Incidence models can be derived in a number of ways, they are easily inspected for error and can be made tolerant of missing values and outliers.

Methods

To use a modelled estimate of incidence to predict RtR_t we need to propagate uncertainty in incidence into our RtR_t estimates. To calculate RtR_t we can use the backwards-looking renewal equations which incorporate the infectivity profile of the disease (ω\omega) at a number of days after infection (τ\tau):

ItPoisson(λt)λtLognormal(μt,σt)Rt=ItτωτItτ \begin{eqnarray} \begin{aligned} I_t &\sim Poisson(\lambda_t) \\ \lambda_t &\sim Lognormal(\mu_t,\sigma_t) \\ R_t &= \frac{I_t}{\sum_{\tau}{\omega_{\tau}I_{t-\tau}}} \end{aligned} \end{eqnarray}

In expectation, this gives:

Rtλtτ=1kωτλtτLognormal(μt,σt)τ=1kLognormal(μtτ+logωτ,σtτ) \begin{eqnarray} \begin{aligned} R_t &\approx \frac{\lambda_t}{\sum_{\tau=1}^{k} \omega_\tau \lambda_{t-\tau}} \\ &\sim \frac{\text{Lognormal}(\mu_t, \sigma_t)}{\sum_{\tau=1}^{k} \text{Lognormal}\left( \mu_{t-\tau} + \log \omega_\tau, \sigma_{t-\tau} \right)} \end{aligned} \tag{1} \end{eqnarray}

It has been shown that the sum of correlated log-normal distributed random variables can be approximated by another log-normal with parameters μZ\mu_Z and σZ\sigma_Z, where the correlation between them is ρij=Corr(logXi,logXj)\rho_{ij} = \text{Corr}(\log X_i, \log X_j).

S+=𝔼[iXi]=i𝔼[Xi]=ieμi+12σi2σZ2=1S+2i,jρijσiσj𝔼[Xi]𝔼[Xj]=1S+2i,jρijσiσjeμi+12σi2eμj+12σj2μZ=logS+12σZ2 \begin{eqnarray} \begin{aligned} S_+ &= \mathbb{E}\left[\sum_i X_i \right] = \sum_i \mathbb{E}[X_i] \\ &= \sum_i e^{\mu_i + \frac{1}{2}\sigma_i^2} \\ \sigma^2_{Z} &= \frac{1}{S_+^2} \, \sum_{i,j} \rho_{ij} \sigma_i \sigma_j \mathbb{E}[X_i] \mathbb{E}[X_j] \\ &= \frac{1}{S_+^2} \, \sum_{i,j} \rho_{ij} \sigma_i \sigma_j e^{\mu_i+\frac{1}{2}\sigma_i^2} e^{\mu_j+\frac{1}{2}\sigma_j^2} \\ \mu_Z &= \log S_+ - \frac{1}{2}\sigma_{Z}^2 \end{aligned} \tag{2} \end{eqnarray}

The sum term in the denominator of the renewal equation (1) consists of a set of correlated scaled log normal distributions with scale defined by the infectivity profile (ω\omega). For our case for a given time point tt we equate Xi=ωτλtτX_i = \omega_{\tau} \lambda_{t-\tau}, and substitute μi\mu_i = μtτ+log(ωτ)\mu_{t-\tau} + log(\omega_{\tau}) and σi=σtτ\sigma_i = \sigma_{t-\tau} into (2) to account for the infectivity profile. We define kk to be the support of the infectivity profile (k=|ω|k = |\omega|). msm_{s} is the weighted contribution from incidence estimates on day tτt-\tau, and Σij\Sigma_{ij} is the covariance between log-incidence estimates from days tit-i and tjt-j.

mτ=eμtτ+logωτ+12σtτ2Σij=Cov(logλti,logλtj)=ρ(ti)(tj)σtiσtjS+=τ=1kmτσZ2=i,j=1k(mimjΣij)S+2μZ=logS+12σZ2 \begin{eqnarray} \begin{aligned} m_{\tau} &= e^{\mu_{t-\tau} + \log \omega_{\tau} + \frac{1}{2}\sigma_{t-\tau}^2 } \\ \Sigma_{ij} &= \text{Cov}(\log \lambda_{t-i}, \log \lambda_{t-j}) = \rho_{(t-i)(t-j)} \sigma_{t-i} \sigma_{t-j} \\ S_{+} &= \sum_{\tau=1}^k { m_{\tau} } \\ \sigma_{Z}^2 &= \frac{ \sum_{i,j=1}^k { (m_{i} m_{j} \Sigma_{ij}) } }{S_{+}^2} \\ \mu_{Z} &= \log S_+ - \frac{1}{2}\sigma_{Z}^2 \end{aligned} \end{eqnarray}

With μZ\mu_{Z} and σZ\sigma_{Z} defined, RtR_t is approximated as the ratio of two log-normals where Σ0Z=Cov(logλt,logS)\Sigma_{0Z} = \text{Cov}(\log \lambda_t, \log S) is the covariance between the numerator and the log-denominator. Since S=τXτS = \sum_\tau X_\tau, and using a first-order approximation, this covariance is a weighted average of the covariances between logλt\log \lambda_t and each logλtτ\log \lambda_{t-\tau}, weighted by the relative expected contributions mτm_\tau.:

RtLognormal(μt,σt)Lognormal(μZ,σZ)μRt=μtμZ=μtlogS++12σZ2σRt=σt2+σZ22Σ0ZΣ0Z=τ=1kmτΣ0ττ=1kmτ=1S+τ=1kmτΣ0τRtLognormal(μRt,σRt) \begin{eqnarray} \tag{3} \begin{aligned} R_t &\sim \frac{Lognormal(\mu_t,\sigma_t)} {Lognormal( \mu_{Z}, \sigma_{Z})} \\ \mu_{R_t} &= \mu_t - \mu_{Z} = \mu_t - \log S_+ + \frac{1}{2}\sigma_{Z}^2\\ \sigma_{R_t} &= \sqrt{\sigma_t^2+\sigma_{Z}^2 - 2 \Sigma_{0Z}} \\ \Sigma_{0Z} &= \frac{\sum_{\tau=1}^k m_\tau \Sigma_{0\tau}}{\sum_{\tau=1}^k m_\tau} = \frac{1}{S_+}\sum_{\tau=1}^k m_\tau \Sigma_{0\tau}\\ R_t &\sim Lognormal(\mu_{R_t}, \sigma_{R_t}) \end{aligned} \end{eqnarray}

This formulation of RtR_t from (3) assumes knowledge of the posterior or prediction covariance of the incidence estimates (Σij\Sigma_ij). This is typical in modern frameworks , but in other situations may not be available. In this case we could assume the individual estimates of the incidence are independent, however this increases the uncertainty of our RtR_t estimate and in certain circumstances introduces a potential underestimation bias, influenced by the true off-diagonal correlation mass and the certainty of the incidence estimates. An alternative approach is to assume weak stationarity and estimate a parametric correlation model from the data used to build the incidence model, using Pearson residuals to parameterise an exponential decay function based on time difference (see supporting software package for implementation). The degree of the bias is investigated further in the supplementary materials, and heuristics for assessing the significance of this bias proposed.

The method for estimating RtR_t from modelled incidence has been described assuming a non-negative component to the infectivity profile, as it is implicit that infector and infectee are necessarily sequential in time. In the situation where symptomatic case counts are used as a proxy for incidence and the serial interval as a proxy for the infectivity profile, negative times between serial cases may be observed due to variation in delay in observation of the transmission chain. There is nothing in this framework to stop the use of a negative time for the infectivity profile, and we can directly support RtR_t estimates in these cases.

Numerical stability

In (3) μt\mu_t is the log-scale mean of the incidence estimate at time tt, and σt\sigma_t its standard deviation. These can be large, leading to numerical instability in terms involving exp(μ+σ2)\exp(\mu + \sigma^2). However, assuming non-negative correlations and using log-space computation with optimized log-sum-exp functions , the expressions remain computationally tractable:

logmτ=μtτ+logωτ+12σtτ2logS+=logsumexpτ(logmτ)logσZ2=logsumexpi,j(logmi+logmj+logΣij)2logS+logΣ0Z=logsumexpτ(logmτ+logΣ0τ)logS+ \begin{eqnarray} \begin{aligned} \log m_\tau &= \mu_{t-\tau} + \log \omega_\tau + \frac{1}{2} \sigma_{t-\tau}^2 \\ \log S_+ &= \text{logsumexp}_\tau(\log m_\tau) \\ \log \sigma_Z^2 &= \text{logsumexp}_{i,j} \left( \log m_i + \log m_j + \log \Sigma_{ij} \right) - 2 \log S_+ \\ \log \Sigma_{0Z} &= \text{logsumexp}_\tau \left( \log m_\tau + \log \Sigma_{0\tau} \right) - \log S_+ \end{aligned} \tag{4} \end{eqnarray}

Other relations may be implemented directly as in (3).

Infectivity profile uncertainty

This estimate of RtR_t is conditioned on a single known infectivity profile. In reality there is also uncertainty in the infectivity profile (ω\omega) which plays a role in the definition of μZ,t\mu_{Z,t} and σZ,t\sigma_{Z,t}. We cannot assume any particular distributional form for the infectivity profile, but we can use a range of empirical estimates of the infectivity profile to calculate multiple distributional estimates for RtR_t and then combine these as a mixture distribution.

The nature of this mixture distribution will depend on the various empirical infectivity profile distributions. However, we can use general properties of mixture distributions to create estimates for the mean and variance of the reproduction number estimate (Rt*R_t^*) combining the uncertainty arising from multiple infection profile estimates (Ω\Omega) and from the incidence estimate model itself:

E[Rt|ω]=e(μRt,ω12σRt,ω2)V[Rt|ω]=[e(σRt,ω2)1][e2μRt,ω+σRt,ω2]E[Rt*]=1|Ω|ωΩE[Rt|ω]V[Rt*]=1|Ω|[ωΩV[Rt|ω]+E[Rt|ω]2]E[Rt*]2 \begin{eqnarray} \begin{aligned} E[R_t|\omega] &= e^{(\mu_{R_t,\omega} - \frac{1}{2}\sigma_{R_t,\omega}^2)} \\ V[R_t|\omega] &= \big[e^{(\sigma_{R_t,\omega}^2)} - 1\big] \big[e^{2 \mu_{R_t,\omega} + \sigma_{R_t,\omega}^2}\big] \\ E[R_t^*] &= \frac{1}{|\Omega|}\sum_{\omega \in \Omega} E[{R_t|\omega}] \\ V[R_t^*] &= \frac{1}{|\Omega|} \bigg[\sum_{\omega \in \Omega}{V[R_t|\omega]+E[R_t|\omega]^2}\bigg] - E[R_t^*]^2 \\ \end{aligned} \end{eqnarray}

The cumulative distribution function of the mixture is simply the arithmetic mean of the component cumulative distribution functions (conditioned on each infectivity profile). If Φ\Phi is the cumulative distribution function of the standard normal distribution:

FRt*(x)=1|Ω|ωΩFRt(x|ω)P(Rt*x)=1|Ω|ωΩP(Rt,ωx)P(Rt*x)=1|Ω|ωΩΦ(ln(x)μRt,ωσRt,ω) \begin{eqnarray} \begin{aligned} F_{R_t^*}(x) &= \frac{1}{|\Omega|}\sum_{\omega \in \Omega}F_{R_t}(x|\omega) \\ P(R_t^* \le x) &= \frac{1}{|\Omega|}\sum_{\omega \in \Omega} P(R_{t,\omega} \le x) \\ P(R_t^* \le x) &= \frac{1}{|\Omega|}\sum_{\omega \in \Omega} \Phi\bigg(\frac{ln(x) - \mu_{R_t,\omega}}{\sigma_{R_t,\omega}}\bigg) \end{aligned} \end{eqnarray}

As the cumulative density function of this mixture distribution is a strictly increasing function, specific solutions for median (q0.5q_{0.5}) and 95% confidence intervals (q0.025q_{0.025} and q0.975q_{0.975}) can be calculated numerically by solving the following equations:

1|Ω|ωΩΦ(ln(q0.025)μRt,ωσRt,ω)0.025=01|Ω|ωΩΦ(ln(q0.5)μRt,ωσRt,ω)0.5=01|Ω|ωΩΦ(ln(q0.975)μRt,ωσRt,ω)0.975=0 \begin{eqnarray} \begin{aligned} \frac{1}{|\Omega|}\sum_{\omega \in \Omega} \Phi\bigg(\frac{ln(q_{0.025}) - \mu_{R_t,\omega}}{\sigma_{R_t,\omega}}\bigg) - 0.025 &= 0 \\ \frac{1}{|\Omega|}\sum_{\omega \in \Omega} \Phi\bigg(\frac{ln(q_{0.5}) - \mu_{R_t,\omega}}{\sigma_{R_t,\omega}}\bigg) - 0.5 &= 0 \\ \frac{1}{|\Omega|}\sum_{\omega \in \Omega} \Phi\bigg(\frac{ln(q_{0.975}) - \mu_{R_t,\omega}}{\sigma_{R_t,\omega}}\bigg) - 0.975 &= 0 \end{aligned} \end{eqnarray}

Numerical solutions to this are moderately expensive to perform. A reasonable approximation can be expected by matching moments of a log normal distribution to the mean E[Rt*]E[R_t^*] and variance V[Rt*]V[R_t^*] of the mixture. This gives us the final closed form estimator for the reproduction number given a set of infectivity profiles, Rt,Ω¯\overline{R_{t,\Omega}}, as:

μt|Ω=log(E[Rt*]2E[Rt*]2+V[Rt*])σt|Ω=log(1+V[Rt*]E[Rt*]2)Rt|Ω¯Lognormal(μt|Ω,σt|Ω) \begin{eqnarray} \begin{aligned} \mu_{t|\Omega} &= log\bigg(\frac{E[R_t^*]^2}{\sqrt{E[R_t^*]^2 + V[R_t^*]}}\bigg) \\ \sigma_{t|\Omega} &= \sqrt{log\bigg(1 + \frac{V[R_t^*]}{E[R_t^*]^2}\bigg)}\\ \overline{R_{t|\Omega}} &\sim Lognormal(\mu_{t|\Omega},\sigma_{t|\Omega}) \end{aligned} \end{eqnarray}

In summary we present a method for retrieving the distributional form of the reproduction number from log normally distributed probabilistic estimates of incidence arising from simple statistical count models. This includes uncertainty arising from both count models and from infectivity profile distributions. It is fully deterministic and computationally inexpensive. It does not place any particular constraints on the nature of the infectivity profile distribution and can handle distributions that have a negative component, as is sometimes seen in serial interval estimates.

Implementation

This method is implemented using the following R function, which is designed for numerical stability and speed. Generating RtR_t estimates given modelled incidence typically occurring in a matter of seconds:

#> function (mu_t, vcov_ij = diag(sigma_t^2), omega, sigma_t = NULL, 
#>     tau_offset = 0) 
#> {
#>     omega_m = as.matrix(omega)
#>     k = nrow(omega_m)
#>     if (k != length(mu_t)) {
#>         stop("omega must have the same number of rows as length of mu_t.")
#>     }
#>     if (!all(dim(vcov_ij) == k)) {
#>         stop("vcov_ij must be a square matrix with the same dimensions as mu_t, and nrow(omega)")
#>     }
#>     omega_m = apply(omega_m, MARGIN = 2, rev)
#>     sigma_t = sqrt(diag(vcov_ij))
#>     time_Rt = length(mu_t) - tau_offset
#>     mu = mu_t[time_Rt]
#>     sigma = sigma_t[time_Rt]
#>     tmp = apply(omega_m, MARGIN = 2, function(omega) {
#>         log_m_tau = mu_t + sigma_t^2/2 + log(omega)
#>         log_S_plus = .logsumexp(log_m_tau)
#>         log_Sigma_ij = log(pmax(vcov_ij, 0))
#>         log_Sigma_0tau = log_Sigma_ij[time_Rt, ]
#>         log_sigma_Z2 = .logsumexp(outer(log_m_tau, log_m_tau, 
#>             "+") + log_Sigma_ij) - 2 * log_S_plus
#>         mu_Z = log_S_plus - 1/2 * exp(log_sigma_Z2)
#>         log_Sigma_0Z = .logsumexp(log_m_tau + log_Sigma_0tau) - 
#>             log_S_plus
#>         mu_Rt = mu - mu_Z
#>         sigma_Rt2 = sigma^2 + exp(log_sigma_Z2) - 2 * exp(log_Sigma_0Z)
#>         return(c(mu_Rt, sigma_Rt2))
#>     })
#>     mu_Rt = tmp[1, ]
#>     sigma_Rt2 = tmp[2, ]
#>     means = exp(mu_Rt + sigma_Rt2/2)
#>     vars = (exp(sigma_Rt2) - 1) * exp(2 * mu_Rt + sigma_Rt2)
#>     mean_star = mean(means)
#>     var_star = mean(vars + means^2) - mean_star^2
#>     mu_star = log(mean_star) - log(var_star/mean_star^2 + 1)/2
#>     sigma_star = sqrt(log(1 + var_star/mean_star^2))
#>     return(list(time_Rt = time_Rt, mean_Rt_star = mean_star, 
#>         var_Rt_star = var_star, meanlog_Rt_star = mu_star, sdlog_Rt_star = sigma_star, 
#>         mu_Rt_mix = mu_Rt, sigma_Rt_mix = sqrt(sigma_Rt2), quantile_Rt_fn = function(p) {
#>             .qmixlnorm(p, mu_Rt, sqrt(sigma_Rt2))
#>         }))
#> }
#> <bytecode: 0x58b456392eb0>
#> <environment: namespace:ggoutbreak>

Results

Testing this against the incidence model shown above, and comparing the results to the SPI-M-O consensus RtR_t estimates gives us the following time-series for England. This is formally evaluated elsewhere but qualitatively is a good fit. It took a few seconds to calculate the reproduction number from this single time series with 1410 time points , which opens up the possibility of performing RtR_t estimates in fine grained geographical or demographic subgroups.

#>    user  system elapsed 
#>  10.691   1.273  12.015

Conclusion

We present a methodology for deriving RtR_t from modelled estimates of incidence while propagating uncertainty. We demonstrate it produces satisfactory qualitative results against COVID-19 data. This method is relatively quick, fully deterministic, and can be used on top of any statistical models for estimating incidence which use logarithmic link functions.