ggoutbreak
assumes a consistent naming scheme for
significant columns, notably time
and count
,
and additionally class
, denom
,
population
columns. Data needs to be supplied using these
column names and to make sure the data is formatted correctly it
undergoes quite rigorous checks. One area where this can pose problems
is in correct grouping, which must make sure that each group is a single
time series of unique time
and minimally count
columns.
Line lists vs. time series
Infectious disease data usually either comes as a set of observations of an individual infection with a time stamp (i.e. a line list) or as a count of events (e.g. positive tests, hospitalisations, deaths) happening within a specific period (day, week, month etc.) as a time series.
For count data there may also be a denominator known. For testing this could be the number of tests performed, or the number of patients at risk of hospitalisation.
For both these data types there may also be a class associated with each observation, defining a subgroup of infections of interest. This could be the variant of a virus, or the age group, for example. It may make sense to compare these different subgroups against each other. In this case the denominator may be the total of counts among all groups per unit time. Additionally there may be information about the size of the population for each subgroup.
ggoutbreak
assumes for the most part that the input data
is in the form of a set of time series of counts, each of which has a
unique set of times, which are usually complete. To create datasets like
this from line lists ggoutbreak
provides some
infrastructure for dealing with time series:
Time periods
A weekly case rate represents a time slice of seven days with a start
and finish date. Dates are a continuous quantity, and
cut_dates()
can be used to classify continuous dates into
periods of equal duration, with a start date:
random_dates = Sys.Date()+sample.int(21,50,replace = TRUE)
cut_date( random_dates, unit = "1 week", anchor = "start", dfmt = "%d %b")
#> 14 Feb — 20 Feb 14 Feb — 20 Feb 07 Feb — 13 Feb 07 Feb — 13 Feb 21 Feb — 27 Feb
#> "2025-02-14" "2025-02-14" "2025-02-07" "2025-02-07" "2025-02-21"
#> 21 Feb — 27 Feb 07 Feb — 13 Feb 07 Feb — 13 Feb 14 Feb — 20 Feb 07 Feb — 13 Feb
#> "2025-02-21" "2025-02-07" "2025-02-07" "2025-02-14" "2025-02-07"
#> 07 Feb — 13 Feb 21 Feb — 27 Feb 14 Feb — 20 Feb 07 Feb — 13 Feb 14 Feb — 20 Feb
#> "2025-02-07" "2025-02-21" "2025-02-14" "2025-02-07" "2025-02-14"
#> 14 Feb — 20 Feb 07 Feb — 13 Feb 21 Feb — 27 Feb 14 Feb — 20 Feb 07 Feb — 13 Feb
#> "2025-02-14" "2025-02-07" "2025-02-21" "2025-02-14" "2025-02-07"
#> 07 Feb — 13 Feb 07 Feb — 13 Feb 21 Feb — 27 Feb 07 Feb — 13 Feb 07 Feb — 13 Feb
#> "2025-02-07" "2025-02-07" "2025-02-21" "2025-02-07" "2025-02-07"
#> 14 Feb — 20 Feb 07 Feb — 13 Feb 21 Feb — 27 Feb 07 Feb — 13 Feb 14 Feb — 20 Feb
#> "2025-02-14" "2025-02-07" "2025-02-21" "2025-02-07" "2025-02-14"
#> 14 Feb — 20 Feb 21 Feb — 27 Feb 07 Feb — 13 Feb 21 Feb — 27 Feb 21 Feb — 27 Feb
#> "2025-02-14" "2025-02-21" "2025-02-07" "2025-02-21" "2025-02-21"
#> 07 Feb — 13 Feb 07 Feb — 13 Feb 07 Feb — 13 Feb 21 Feb — 27 Feb 14 Feb — 20 Feb
#> "2025-02-07" "2025-02-07" "2025-02-07" "2025-02-21" "2025-02-14"
#> 07 Feb — 13 Feb 21 Feb — 27 Feb 21 Feb — 27 Feb 07 Feb — 13 Feb 14 Feb — 20 Feb
#> "2025-02-07" "2025-02-21" "2025-02-21" "2025-02-07" "2025-02-14"
#> 07 Feb — 13 Feb 14 Feb — 20 Feb 21 Feb — 27 Feb 21 Feb — 27 Feb 14 Feb — 20 Feb
#> "2025-02-07" "2025-02-14" "2025-02-21" "2025-02-21" "2025-02-14"
Performing calculations using interval censored dates is awkward. A
numeric version of dates is useful that can keep track of both the start
date of a time series and its intrinsic duration, as a numeric. This is
the purpose of the time_period
class:
dates = seq(as.Date("2020-01-01"),by=7,length.out = 5)
tmp = as.time_period(dates)
#> No `start_date` (or `anchor`) specified. Using default (N.b. set `options('day_zero'=XXX)` to change): 2019-12-29
#> No unit given. Guessing a sensible value from the dates gives: 7d 0H 0M 0S
tmp
#> time unit: week, origin: 2019-12-29 (a Sunday)
#> [1] 0.4285714 1.4285714 2.4285714 3.4285714 4.4285714
The time_period
defaults to using a date at the
beginning of the COVID-19 pandemic as its origin and calculating a
duration unit based on the data (in this case weekly).
A usual set of S3 methods are available such as formatting, printing,
labelling, and casting time_period
s to and from dates and
POSIXct
classes:
suppressWarnings(labels(tmp))
#> labelling applied to non-integer times.
#> 01/Jan — 08/Jan
#> 08/Jan — 15/Jan
#> 15/Jan — 22/Jan
#> 22/Jan — 29/Jan
#> 29/Jan — 05/Feb
A weekly time series can be recast to a different frequency, or start date:
tmp2 = as.time_period(tmp, unit = "2 days", start_date = "2020-01-01")
tmp2
#> time unit: 2 days, origin: 2020-01-01 (a Wednesday)
#> [1] 0.0 3.5 7.0 10.5 14.0
and the original dates should be recoverable:
as.Date(tmp2)
#> [1] "2020-01-01" "2020-01-08" "2020-01-15" "2020-01-22" "2020-01-29"
date_seq()
can be used to make sure a set of periodic
times is complete:
tmp3 = as.time_period(Sys.Date()+c(0:2,4:5)*7,anchor = "start")
as.Date(date_seq(tmp3))
#> [1] "2025-02-06" "2025-02-13" "2025-02-20" "2025-02-27" "2025-03-06"
#> [6] "2025-03-13"
time_period
s can also be used with monthly or yearly
data but such data are not regular. This is approximately handles and
irregular date periods are generally OK to use with
ggoutbreak
. Some functions like date_seq
may
not work as anticipated with irregular dates, and some conversions
between weeks and months for example are potentially risky.
Two time series can be aligned to make them comparable:
orig_dates = Sys.Date()+1:10*7
# a 2 daily time series based on weekly dates
t1 = as.time_period(orig_dates, unit = "2 days", start_date = "2021-01-01")
t1
#> time unit: 2 days, origin: 2021-01-01 (a Friday)
#> [1] 752.0 755.5 759.0 762.5 766.0 769.5 773.0 776.5 780.0 783.5
# a weekly with different start date
t2 = as.time_period(orig_dates, unit = "1 week", start_date = "2022-01-01")
t2
#> time unit: week, origin: 2022-01-01 (a Saturday)
#> [1] 162.7143 163.7143 164.7143 165.7143 166.7143 167.7143 168.7143 169.7143
#> [9] 170.7143 171.7143
# rebase t1 into the same format as t2
# as t1 and t2 based on the same original dates converting t2 onto the same
# peridicty as t1 results in an identical set of times
t3 = as.time_period(t1,t2)
t3
#> time unit: week, origin: 2022-01-01 (a Saturday)
#> [1] 162.7143 163.7143 164.7143 165.7143 166.7143 167.7143 168.7143 169.7143
#> [9] 170.7143 171.7143
Times in ggoutbreak
and conversion of line-lists
ggoutbreak
uses the time_period
class
internally extensively. Casting dates to and from
time_periods
is all that generally needs to be done before
using ggoutbreak
. Most of the functions in
ggoutbreak
operate on time series data which expect a
unique (and usually complete) set of data on a periodic time.
To help prepare line-list data into time series there is the
time_summarise()
function. A minimal line-list will have a
date column and nothing else.
random_dates = Sys.Date()+sample.int(21,50,replace = TRUE)
linelist = tibble::tibble(date = random_dates)
linelist %>% time_summarise(unit="1 week") %>% dplyr::glimpse()
#> Rows: 3
#> Columns: 2
#> $ time <time_prd> 0, 1, 2
#> $ count <int> 27, 17, 6
If the line-list contains a class
column it is
interpreted as a complete record of all possible options from which we
can calculate a denominator. In this case the positive and negative
results of a test:
random_dates = Sys.Date()+sample.int(21,200,replace = TRUE)
linelist2 = tibble::tibble(
date = random_dates,
class = stats::rbinom(200, 1, 0.04) %>% ifelse("positive","negative")
)
linelist2 %>% time_summarise(unit="1 week") %>% dplyr::glimpse()
#> Rows: 6
#> Columns: 4
#> Groups: class [2]
#> $ class <chr> "negative", "negative", "negative", "positive", "positive", "pos…
#> $ time <time_prd> 0, 1, 2, 0, 1, 2
#> $ count <int> 52, 66, 73, 4, 2, 3
#> $ denom <int> 56, 68, 76, 56, 68, 76
In this specific example subsequent analysis with
ggoutbreak
may focus on the positive
subgroup
only, as the comparison between positive
and
negative
test results is trivial. In another example
class
may not be test results, it could be any other major
subdivision e.g. the variant of a disease. In this case the comparison
between different groups may be much more relevant. The use of
class
as the major sub-group is for convenience. Additional
grouping other than class
columns is also possible for
multi-faceted comparisons, and grouping is preserved but not included
automatically in the denominator, which may need to be manually
calculated:
random_dates = Sys.Date()+sample.int(21,200,replace = TRUE)
variant = apply(stats::rmultinom(200, 1, c(0.1,0.3,0.6)), MARGIN = 2, function(x) which(x==1))
linelist3 = tibble::tibble(
date = random_dates,
class = c("variant1","variant2","variant3")[variant],
gender = ifelse(stats::rbinom(200,1,0.5),"male","female")
)
count_by_gender = linelist3 %>%
dplyr::group_by(gender) %>%
time_summarise(unit="1 week") %>%
dplyr::arrange(time, gender, class) %>%
dplyr::glimpse()
#> Rows: 18
#> Columns: 5
#> Groups: gender, class [6]
#> $ gender <chr> "female", "female", "female", "male", "male", "male", "female",…
#> $ class <chr> "variant1", "variant2", "variant3", "variant1", "variant2", "va…
#> $ time <time_prd> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2
#> $ count <int> 5, 12, 17, 2, 5, 26, 4, 8, 24, 1, 8, 25, 3, 10, 25, 7, 6, 12
#> $ denom <int> 34, 34, 34, 33, 33, 33, 36, 36, 36, 34, 34, 34, 38, 38, 38…
Aggregating time series datasets.
In the case of a time series with additional grouping present,
removing a level of grouping whilst retaining time is made easier with
time_aggregate()
. In this case we wish to sum
count
and denom
by gender, retaining the class
grouping.
count_by_gender %>%
dplyr::group_by(class,gender) %>%
time_aggregate() %>%
dplyr::glimpse()
#> Rows: 9
#> Columns: 4
#> Groups: class [3]
#> $ class <chr> "variant1", "variant1", "variant1", "variant2", "variant2", "var…
#> $ time <time_prd> 0, 1, 2, 0, 1, 2, 0, 1, 2
#> $ count <int> 7, 5, 10, 17, 16, 16, 43, 49, 37
#> $ denom <int> 67, 70, 63, 67, 70, 63, 67, 70, 63
by default time_aggregate
will sum any of
count
, denom
and population
columns but any other behaviour can be specified by passing
dplyr::summarise
style directives to the function.