-
-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reorganize the stan example repository #153
Comments
I'd like to see efforts to make it as easy as possible for researchers to use the model/data repo and report results. That adds:
|
Link to SBC paper: http://www.stat.columbia.edu/~gelman/research/unpublished/sbc.pdf |
Re: input files: Rather than hardcoding synthetic inputs, I think it would be better to deliver a reproducible generator program (i.e., one that gives the same results when run with the same random seed). Advantages of a program:
We can also ship a canonical small output from the generator program, in case someone wants to verify that they aren't experiencing version skew. |
Re: reproducibility: How feasible would it be (maybe some ways down the line) to provide a Kubernetes configuration (with associated Docker files, etc) that stands up some cloud machines to rerun Stan on all the models, and possibly to run the researcher's method as well? It would be the simplest possible interface: put $ here, and get everything recomputed. Including, for instance, allowing researchers to re-assess Stan on different versions of the models or data sets, etc, just by locally modifying the benchmark repository. Just an idea to think about; I don't mean to create lots of extra work. |
This is not a |
Thanks for writing this up, Yuling. Until the issue gets moved, let me reply:
By what measure? I think we want to look at code coverage of our special functions and for model types. Are there any ODE models in there? Anything with more than a tiny amount of data?
I think we need to carefully distinguish programs and models. The model is a mathematical concept defining a joint density. The program is a Stan program. For a given model, some programs might work and others might not. For example, a centered parameterization might not work, but a non-centered parameterization might be able to sample with no problem. Now we could argue they're different models because they sample different parameters, so it's not exactly clear to me where to draw the line. For instance, the centered parameterization draws from
Using an approximation seems risky as a gold-standard evaluation. How reliable is LOO?
Don't we also want standard error (i.e., from ESS) on both mean and posterior sd? That's how @betanalpha is testing in
A script testing estimation within standard error in
Of course. The existing scripts are simple---just input CSV for your draws and get output. But interpreting that output assumes some statistical sophistication in understanding MCMC standard error beyond what we'll be able to impart in doc for this tool.
I agree. That was part of the original plan so that we could test scalability. The difficulty is always generating predictors for regressions (or any other unmodeled data). We also need a generator to do SBC, though that would most efficiently live in the model file itself, which complicates the issue of what the models should look like. [@axch] How feasible would it be (maybe some ways down the line) to provide a Kubernetes configuration I think this is critical as we're going to want to run this during algorithm development. I don't know anything about Kubernetes per se, but we're moving a lot of our testing to AWS with the help of some dev ops consultants. So @seantalts is the one to ask about that as he's managing the consultants. |
re: kubernetes - we don't have any plans for such a cluster right now and we'll run out of EC2 credits this November, but we might still be able to pursue such a thing. My first question is around isolation - are kubernetes jobs isolated enough to be a reliable benchmarking platform? |
Don't we also want standard error (i.e., from ESS) on both mean and posterior sd? That's how @betanalpha <https://github.com/betanalpha> is testing in stat_comp_benchmark. I like the idea of testing with actual square error from simulated parameters as it avoids the approximation of estiating ESS.
Actually stat_comp_benchmark does not use ESS at all and hence
is applicable to any Bayesian computation algorithm. All the
benchmark does is compare the estimated component parameter
means to the true parameter means scaled by the true standard
devotions. This is an incredibly low bar to pass but was all that
was needed to demonstrate the failures of ADVI and so I haven’t
moved forward with more realistic benchmarks yet.
In any case we should be very careful to separate out what the
objective of any benchmark might be. stat_comp_benchmark
is meant to compare estimated posterior expectation values to
the true posterior expectation values. Anything related to predictive
accuracy is a completely different use case and should be separated
out into a separate project. Similarly example_models was never
designed to be a comprehensive model repository but rather useful
examples for users and there’s no reason why that purpose should
change.
|
On Jul 1, 2019, at 10:02 PM, Michael Betancourt ***@***.***> wrote:
> Don't we also want standard error (i.e., from ESS) on both mean and posterior sd? That's how @betanalpha <https://github.com/betanalpha> is testing in stat_comp_benchmark. I like the idea of testing with actual square error from simulated parameters as it avoids the approximation of estiating ESS.
Actually stat_comp_benchmark does not use ESS at all and hence
is applicable to any Bayesian computation algorithm. All the
benchmark does is compare the estimated component parameter
means to the true parameter means scaled by the true standard
devotions. This is an incredibly low bar to pass but was all that
was needed to demonstrate the failures of ADVI and so I haven’t
moved forward with more realistic benchmarks yet.
Right---if we know the truth, we can measure error directly and don't need to go the indirect route of using ESS to estimate it.
Is the test also measuring theta^2 values (as a proxy for variance) or is that on the user?
In any case we should be very careful to separate out what the
objective of any benchmark might be. stat_comp_benchmark
is meant to compare estimated posterior expectation values to
the true posterior expectation values. Anything related to predictive
accuracy is a completely different use case and should be separated
out into a separate project.
But aren't predictions just different random variables with their own posteriors? I don't see how it's such a different use case. Unless you mean using something like 0/1 loss instead of our usual log loss.
Similarly example_models was never
designed to be a comprehensive model repository but rather useful
examples for users and there’s no reason why that purpose should
change.
I agree. I do think we should bring the example models up to best practices. Now whether that means full SBC for all of them, I don't know.
I'd like to build up a test suite the way it's being done in the benchmarks repo, one trusted example at a time.
|
Say we fix the random seed and generate the input in R (as it is). Should we worry about R might change its random number generator in the future?
It should be better transferred to
No there are no ODE models and no large data. Nevertheless, it should still be better than running a few linear regressions? In the ideal situation, we should wish to have a better coverage of models and users could also contribute to the model list too.
I agree. It is indeed a combination of model, parametrization, and sampling scheme. Fortunately, we have enough diagnostics that should be able to tell whether stan sampling result is enough for one example model within that parametrization form, and this all I am intended to say for this point.
I think it is the difference between the estimation of parameters (given the model) vs the prediction of future outcome. As an obvious a caveat, A wrong inference in the wrong model might do better than a correct inference. |
On Jul 2, 2019, at 12:10 PM, Yuling Yao ***@***.***> wrote:
***@***.*** input files: Rather than hardcoding synthetic inputs, I think it would be better to deliver a reproducible generator program (i.e., one that gives the same results when running with the same random seed).
Say we fix the random seed and generate the input in R (as it is). Should we worry about R might change its random number generator in the future?
That's why @axch was also suggesting archiving a version of the data.
I think we should generate the input in Stan to the extent possible.
***@***.*** This is not a stan-dev/stan issue. Could you please move this to either (a) example-models or (b) stat_comp_benchmarks or if it's preliminary, to (c) design-docs. Or to Discourse for discussion until there's a plan for implementation.
It should be better transferred to example-models, which requires permission there. Could you add me writing permission there or you could help me transfer there?
I just sent you an invite.
***@***.*** By what measure? I think we want to look at code coverage of our special functions and for model types. Are there any ODE models in there? Anything with more than a tiny amount of data?
No there are no ODE models and no large data. Nevertheless, it should still be better than running a few linear regressions?
Unless it gives us a false sense of security. In some sense, I think
it is largely a bunch of well-conditioned, small-scale GLMs that can
be fit or almost fit in BUGS.
In the ideal situation, we should wish to have a better coverage of models and users could also contribute to the model list too.
***@***.*** For instance, the centered parameterization draws from p(alpha, sigma) and the non-centered from p(alpha_std, sigma). These are different joint densities, related through the transform alpha_std = alpha / sigma. Are these different models?
I agree. It is indeed a combination of model, parametrization, and sampling scheme. Fortunately, we have enough diagnostics that should be able to tell whether stan sampling result is enough for one example model within that parametrization form, and this all I am intended to say for this point.
I was just suggesting we try to document all these distinctions clearly.
***@***.*** But aren't predictions just different random variables with their own posteriors? I don't see how it's such a different use case. Unless you mean using something like 0/1 loss instead of our usual log loss.
I think it is the difference between the estimation of parameters (given the model) vs the prediction of future outcome.
I don't understand the difference. A prediction is coded as a parameter in Stan (or a generated quantity if it's simple enough and can be factored). Either way it's a random variable we can sample in the posterior conditioned on observed data.
As an obvious a caveat, A wrong inference in the wrong model might do better than a correct inference.
Sure. Just throwing darts might give a good result in some cases, too. But what's the relevance to our example models?
|
Right, predictions can be viewed as generated quantities of parameters, and parameters can also be interpreted as predictive quantities. But in the practical level, we could always treat Stan sampling as the true posterior distribution for parameters and therefore a benchmark for all approximation method whenever Stan sampling passes all diagnostics— there can only be one truth. For prediction, It is unnecessary to match the exact posterior predictive distributions in Stan sampling as they are not necessarily optimal in the first place. We have seen examples in which an approximation method gives problematic posterior distribution of parameters but still ok or even better prediction. |
Right, predictions can be viewed as generated quantities of parameters, and parameters can also be interpreted as predictive quantities. But in the practical level, we could always treat Stan sampling as the true posterior distribution for parameters and therefore a benchmark for all approximation method whenever Stan sampling passes all diagnostics— there can only be one truth.
Exactly. And this is what we should be evaluating for sampling.
For prediction, It is unnecessary to match the exact posterior predictive distributions in Stan sampling as they are not necessarily optimal in the first place. We have seen examples in which an approximation method gives problematic posterior distribution of parameters but still ok or even better prediction.
I get it. But I think this is something for another repository where we do something like posterior predictive checks for predictive quantities of interest. I'm mainly interested in calibrated probabilistic prediction, but others want to evaluate point predictions or 0/1 loss. For most standard ML evaluations, Scikit.learn has a really clean and well organized set of interfaces. Their cross-validation is particularly thoughtful in accommodating structured data where a literal application of LOO observation-by-observation doesn't make sense.
|
Summary:
Reorganize the stan example repository and include predictive metrics for the purpose of a collective model set which will be used as the Bayesian computation benchmark.
Description:
Just as a fake data simulation can possibly reveal the drawback of a given model, a fake model simulation is the ideal way to evaluate an inference algorithm. Unfortunately, it is not uncommon to see research papers merely use a few linear regressions as the benchmark for evaluating a new sampling or approximate inference algorithm they propose, partially due to the tendency of random-seed-hacking or model-hacking, and also because there is no single large scale collective model set for Bayesian computation in the first place.
Currently, the stan example repo contains more than 270 unique models with data. Many of them are from real-world tasks and therefore are a more representative model space in terms of real Bayesian computation problems that user will encounter.
Depending on the purpose, there are various metrics that the user can specify. To name a few:
elpd
, of each model+inference combination. For part of the dataset, an independent test set can be simulated, for others, we can rely on cross-validation.To this end, there are a few modifications required:
Currently, some of the input data are randomly generated (e.g.,). All input should be fixed for the purpose of cross-method comparison.
Not all models are runnable for stan sampling. Some models are designed to reveal the parametrization tricks in Stan (e.g., centered-parametrization in hierarchical regression) and therefore we know the stan sampling result is not exact. It is fine to keep them, but we should also manually label those models since we will use stan sampling as the ground truth otherwise.
Output likelihood in the generated quantity block. We can then run cross-validation by calling
LOO
and therefore compare the expected log predictive density (elpd) across methods.Run Stan sampling on all these models, record the mean and sd of the posterior distribution for all parameters (excluding transformed parameters) and loo-elpd.
SBC requires fake data simulation, which might be non-trivial to make it automated. To start with I can first make a pull request that implements these improvements.
Expected output:
Current Version:
v2.19.1
The text was updated successfully, but these errors were encountered: