Simulation, scoring and convergence functions
Source:vignettes/scorer-functions.Rmd
scorer-functions.Rmd
library(tidyabc)
#>
#> Attaching package: 'tidyabc'
#> The following objects are masked from 'package:base':
#>
#> transform, truncate
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)Introduction
This vignette provides a detailed guide on writing the core
user-defined functions required for Approximate Bayesian Computation
(ABC) using tidyabc: the simulation function
(sim_fn), the scoring function (scorer_fn),
and the convergence function (converged_fn). Understanding
the precise input and output structure of these functions is crucial for
correctly implementing an ABC workflow.
1. Writing the Simulation Function (sim_fn)
The simulation function (sim_fn) defines your generative
model. It takes parameter values as input and returns simulated
data.
Inputs
-
Named Arguments: The
sim_fnmust accept named arguments corresponding to the parameters defined in yourpriors_list. For example, if you havepriors(norm_mean ~ unif(0, 10),norm_sd ~ unif(0, 10)), yoursim_fnmust acceptnorm_meanandnorm_sdas arguments. -
...(Optional): While not strictly required by theabc_*functions themselves, it’s good practice to allow for...in the function definition if you anticipate passing extra arguments during debugging or testing.
Output
- List: The function must return a list containing the simulated data. The structure of this list is entirely up to your model. It could be a list of vectors, a list of data frames, or any other R object that represents the output of your simulation model.
Example
# Example sim_fn matching the parameters from priors defined later
example_sim_fn = function(norm_mean, norm_sd, gamma_mean, gamma_sd) {
# Simulate data based on parameters
A = rnorm(1000, norm_mean, norm_sd)
B = rgamma2(1000, gamma_mean, gamma_sd) # Using a custom dist from tidyabc
C = rgamma2(1000, gamma_mean, gamma_sd)
# Return a list containing the simulated outputs
return(
list(
data1 = A + B - C, # Example composite output
data2 = B * C # Example composite output
)
)
}2. Writing the Scoring Function (scorer_fn)
The scoring function (scorer_fn) compares simulated data
to observed data, producing a set of summary statistics (distances or
other metrics) that quantify the “closeness” of the simulation to the
observation.
Inputs
-
simdata: This is the output list from yoursim_fn. -
obsdata: This is the observed data provided to theabc_*function, which should be in the same format as thesimdatalist.
Output
-
List: The function must return a
list. Each element of this list corresponds to a
component of the overall summary statistic used for comparison.
Commonly, these components are distances (e.g., Wasserstein, RMSE)
between corresponding parts of
simdataandobsdata, but they could also be other metrics like summary moments (mean, variance). -
Naming: The names of the list elements returned by
scorer_fnare crucial. They must be consistent and identifiable, as they are used for weighting (scoreweights) and internal processing.
Combining Scorer Outputs and Comparison to Observed Scores
-
Execution: The
scorer_fnis called repeatedly within the ABC workflow, once for each simulation (simdata) against the fixedobsdata. -
Component Scores: The output of
scorer_fn(simdata, obsdata)is a list of component scores (e.g.,list(data1 = dist1, data2 = dist2)). -
Combining Components: The ABC algorithm combines
these component scores into a single summary distance for each
simulation. The method is specified by the
distance_methodargument inabc_*functions:-
"euclidean"(default): Calculates the weighted Euclidean distance:
\[ d_{summary} = \sqrt{\sum_i (w_i \cdot s_i)^2} \] where \(s_i\) is the \(i\)-th component score (e.g.,dist1,dist2) and \(w_i\) is the corresponding weight from thescoreweightsvector. -
"manhattan": Calculates the weighted Manhattan distance:
\[ d_{summary} = \sum_i w_i \cdot |s_i| \] -
"mahalanobis": Calculates the Mahalanobis distance using the covariance matrix of the component scores from the first wave, incorporating weights.
-
-
obsscores: This argument allows you to provide pre-calculated component scores for theobsdataitself (e.g.,list(data1 = 0, data2 = 0)if comparing againstobsdatausingscorer_fn(obsdata, obsdata)yields zeros). This is optional and often inferred internally ifscorer_fnis run onobsdataitself initially. -
Tolerance and Kernels: The final \(d_{summary}\) is compared against a
tolerance threshold (
epsilon) using a kernel function (defined by thekernelargument) to calculate the ABC weights for each simulation.
Example
# Example scorer_fn matching the simdata structure from example_sim_fn
example_scorer_fn = function(simdata, obsdata) {
# Compare corresponding elements of simdata and obsdata
# Using calculate_wasserstein from tidyabc as an example distance
return(list(
data1 = calculate_wasserstein(simdata$data1, obsdata$data1),
data2 = calculate_wasserstein(simdata$data2, obsdata$data2)
))
# Output: list(data1 = <dist_val1>, data2 = <dist_val2>)
}Score Weights (scoreweights)
The scoreweights argument allows you to assign relative
importance to each component score returned by
scorer_fn.
-
Type: A named numeric vector, where names
correspond to the names of the list elements returned by
scorer_fn. -
Purpose: If
data1anddata2have different scales or importance, you can use weights to balance their contribution to the summary distance. For instance,c(data1 = 1, data2 = 2)meansdata2’s distance contributes twice as much asdata1’s distance to the final \(d_{summary}\) (before applying the square root for Euclidean). -
Automatic Calculation: The
posterior_distance_metrics()function can analyze results from a trial run (e.g.,abc_rejection) and suggest suitablescoreweightsbased on the scale of the summary statistics.
3. Debugging and Parallelisation
It is a good idea to run your simulation with some test parameters
first. Similarly testing the scoring function is correctly working is a
good idea. An end to end test harness is supplied in
test_simulation() that will run your simulation and scorer
for a fixed set of parameters. It can be configured to drop to an
interactive debugging browser if it encounters an error.
Although testing against one set of parameters will catch major
problems edge cases often arise during the main ABC. Each of the ABC
functions can also drop the user to a debugging browser if an issue is
detected. This is not compatible with parallelisation though so best to
run a small set of simulations using
e.g. abc_rejection().
For parallel execution the simulation and scorer functions must be completely self contained, and make no reference to global environment variables, or unqualified references to package functions. Global references will in general be automatically detected, and you will be alerted and your functions will be rewritten with common package names qualified so that they are self contained.
If you need to use globals carrier::crateing your
simulation function will achieve the same thing.
4. The Convergence Function (converged_fn)
The convergence function (converged_fn) determines when
the iterative ABC-SMC or ABC-Adaptive algorithm should stop. It
evaluates the change in the posterior approximation between waves.
Inputs to converged_fn
The converged_fn is called internally by
abc_smc and abc_adaptive at the end of each
wave. It receives three arguments:
-
wave: The wave number as an integer -
summary: A single-row data frame containing summary metrics aggregated across all parameters for the current wave. Common columns include:-
abs_distance: The absolute tolerance threshold (epsilon) for the current wave. -
abs_distance_redn: The absolute reduction inepsiloncompared to the previous wave (epsilon_{prev} - epsilon_{current}). -
rel_distance_redn: The relative reduction inepsiloncompared to the previous wave ((epsilon_{prev} - epsilon_{current}) / epsilon_{prev}). -
ESS: The Effective Sample Size of the current wave’s particle set. - Other aggregated metrics might be present depending on the algorithm’s implementation.
-
-
per_param: A single-row data frame containing metrics calculated for each individual parameter in the current wave, compared to the previous wave. Common columns include:-
param: The name of the parameter (e.g., “norm_mean”). -
IQR_95_redn: The reduction in the 95% credible interval (IQR) range (q0.975 - q0.025) for this parameter compared to the previous wave (IQR_{prev} - IQR_{current}). -
abs_variance_redn: The absolute reduction in the posterior variance for this parameter compared to the previous wave (var_{prev} - var_{current}). -
rel_mean_change: The relative change in the posterior mean for this parameter compared to the previous wave (|mean_{current} - mean_{prev}| / |mean_{prev}|). - This data frame has one row for each parameter being estimated.
-
Output of converged_fn
-
Logical: The function must return a single
TRUEorFALSE.-
TRUE: Signals that the algorithm has converged, and the iterative process stops. -
FALSE: Signals that the algorithm should continue to the next wave.
-
The default_termination_fn
tidyabc provides a default_termination_fn()
which implements a common convergence strategy:
- It checks if the
rel_mean_changefor all parameters is below a specifiedstabilitythreshold. -
AND it checks if the
IQR_95_rednfor all parameters is below a specifiedconfidencethreshold. - If both conditions are met, it returns
TRUE, indicating convergence.
# Example of using default_termination_fn with custom thresholds
default_converged_fn = default_termination_fn(stability = 0.01, confidence = 0.1)
# This means: converge if mean changes are < 1% and 95% CI shrinks by < 0.1 units for all params.
# If you have 2 parameters `norm_mean` and `norm_sd` you can specify different
# convergence criteria by nameing the stability and confidence inputs:
default_converged_fn2 = default_termination_fn(
stability = c(norm_mean = 0.01, norm_sd = 0.1),
confidence = c(norm_mean = 0.01, norm_sd = 0.1)
)
# this would accept more variation in the `norm_sd` estimate compared to that
# of the `norm_mean`5. Writing a Custom Convergence Function
Writing a custom convergence function allows you to tailor the stopping criteria to your specific needs.
Structure
A custom convergence function must adhere to the input signature:
function(wave, summary, per_param). It processes the data
frames provided and returns a single TRUE or
FALSE.
Example: Convergence Based on ESS and Distance Reduction
This example defines convergence as achieving a high Effective Sample Size (ESS) and a minimal reduction in the absolute distance threshold.
# Define a custom convergence function
custom_converged_fn = function(wave, summary, per_param) {
# Extract values from the summary data frame (it's a single row)
current_ess = summary$ESS
abs_dist_redn = summary$abs_distance_redn # Absolute reduction in epsilon
# Define criteria
min_ess_threshold = 500
max_dist_reduction_threshold = 0.001 # Stop if epsilon isn't reducing much
# Check conditions
ess_condition = current_ess >= min_ess_threshold
dist_condition = abs_dist_redn <= max_dist_reduction_threshold
# Return TRUE only if BOTH conditions are met
return(ess_condition && dist_condition)
}
# Example usage (conceptual - requires obsdata, priors, etc. defined)
# smc_result = abc_smc(
# obsdata = test_obsdata,
# priors_list = test_priors,
# sim_fn = example_sim_fn,
# scorer_fn = example_scorer_fn,
# n_sims = 1000,
# acceptance_rate = 0.5,
# converged_fn = custom_converged_fn # Use the custom function
# )Example: Convergence Based on Parameter Variance
This example focuses on the stability of parameter uncertainty, converging when the absolute variance reduction for all parameters is small.
# Define a custom convergence function based on variance reduction
custom_converged_fn_variance = function(wave, summary, per_param) {
# Extract the 'abs_variance_redn' column from the per_param data frame
abs_variance_reductions = per_param$abs_variance_redn
# Define a threshold for variance reduction
variance_reduction_threshold = 0.0001
# Check if ALL absolute variance reductions are below the threshold
all_stable <- all(abs_variance_reductions < variance_reduction_threshold)
return(all_stable)
}
# Example usage (conceptual)
# smc_result = abc_smc(
# obsdata = test_obsdata,
# priors_list = test_priors,
# sim_fn = example_sim_fn,
# scorer_fn = example_scorer_fn,
# n_sims = 1000,
# acceptance_rate = 0.5,
# converged_fn = custom_converged_fn_variance
# )Key Considerations for Custom Functions
-
Data Frame Access: Remember that
summaryandper_paramare data frames. Access columns using$(e.g.,summary$ESS,per_param$rel_mean_change). -
Aggregation: The
per_paramdata frame has one row per parameter. Use functions likeall(),any(), ormax()on the relevant column to combine information across parameters for a single logical output. - Robustness: Consider edge cases, like what happens in the very first wave if previous values are not available (though the algorithm typically handles initialization).
- Monitoring: Convergence criteria are crucial for computational efficiency and result quality. Choose metrics that reflect the stability you require for your specific inference problem.
-
browser(): You can get a better feel for the statistics available by running an interactive browser from within a convergence function while testing. —
Conclusion
Understanding the precise structure and purpose of
sim_fn, scorer_fn, and
converged_fn is important for effectively using
tidyabc. The simulation function generates data, the
scoring function quantifies similarity, and the convergence function
determines when the iterative process halts. By carefully specifying
these functions, you can tailor the ABC workflow to your model and
inference goals.