Skip to content

Reorganize the stan example repository  #153

Open
@yao-yl

Description

@yao-yl

Summary:

Reorganize the stan example repository and include predictive metrics for the purpose of a collective model set which will be used as the Bayesian computation benchmark.

Description:

Just as a fake data simulation can possibly reveal the drawback of a given model, a fake model simulation is the ideal way to evaluate an inference algorithm. Unfortunately, it is not uncommon to see research papers merely use a few linear regressions as the benchmark for evaluating a new sampling or approximate inference algorithm they propose, partially due to the tendency of random-seed-hacking or model-hacking, and also because there is no single large scale collective model set for Bayesian computation in the first place.

Currently, the stan example repo contains more than 270 unique models with data. Many of them are from real-world tasks and therefore are a more representative model space in terms of real Bayesian computation problems that user will encounter.

Depending on the purpose, there are various metrics that the user can specify. To name a few:

  • Treating the Stan sampling result as the ground truth, how much divergence (compared with HMC) in the parameter space does an approximation method has? When we store the HMC output (mean and sd for all parameters in each model), a developer of a new approximate method can easily compute such divergence in the parameter space.
  • The predictive capacity. When the approximate inference itself is considered part of the model that implicitly regularize the posterior, we could also compare the predictive ability, or more specifically elpd, of each model+inference combination. For part of the dataset, an independent test set can be simulated, for others, we can rely on cross-validation.
  • Replication of exact inference. Introduced by @seantalts et al in SBC paper, we can use the simulated posterior p-value as a metric for how exact a sampling algorithm is, or at least how unbiased an approximation inference is.
  • These 270 models represent a reasonable share of the model space--the space of high dimensional distribution functions, we should be able to quantify and characterize these posterior distributions by some geometry metrics, such as the global/tail curvature. It enables researchers to better understand when/why their computation methods work or fail. We could also encode other meta-characterizations such as the number of parameters, sample size, whether it is a hierarchical model, whether it is centered parametrization, etc.

To this end, there are a few modifications required:

  1. Currently, some of the input data are randomly generated (e.g.,). All input should be fixed for the purpose of cross-method comparison.

  2. Not all models are runnable for stan sampling. Some models are designed to reveal the parametrization tricks in Stan (e.g., centered-parametrization in hierarchical regression) and therefore we know the stan sampling result is not exact. It is fine to keep them, but we should also manually label those models since we will use stan sampling as the ground truth otherwise.

  3. Output likelihood in the generated quantity block. We can then run cross-validation by calling LOO and therefore compare the expected log predictive density (elpd) across methods.

  4. Run Stan sampling on all these models, record the mean and sd of the posterior distribution for all parameters (excluding transformed parameters) and loo-elpd.

SBC requires fake data simulation, which might be non-trivial to make it automated. To start with I can first make a pull request that implements these improvements.

Expected output:

  • For each model, there should be fixed (i.e. non-random) input file.
  • The stan file outputs the likelihood.
  • A separate file that records the ground truth of stan sampling, the elpd therein, and other meta-characterizations.

Current Version:

v2.19.1

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions