Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_to_inference_data argument for posterior predictive sampling functions #4021

Closed
wants to merge 2 commits into from

Conversation

rpgoldman
Copy link
Contributor

@rpgoldman rpgoldman commented Jul 20, 2020

When completed, this will allow programmers to have posterior predictive sampling add the posterior predictive samples to an InferenceData argument from which it gets its posterior distribution.

  • what are the (breaking) changes that this PR makes?
    This would return an InferenceData object, by default, when it is invoked with an InferenceData object as argument.
  • important background, or details about the implementation
    The idea is to make it so that users don't have to figure out how to insert the samples into the right groups themselves.
  • are the changes—especially new features—covered by tests and docstrings?
  • consider adding/updating relevant example notebooks
  • right before it's ready to merge, mention the PR in the RELEASE-NOTES.md

@kyleabeauchamp
Copy link
Contributor

Have you thought through the following particular use case?

  1. In model sampling, the observed training data x and y are set via a pm.Data object
  2. You want to use PPC sampling for test data, which will later (e.g., after model training) be set via pm.set_data. This new data will contain a different number of examples as compared to the training data in (1).
  3. For various analyses, done in arviz, you may want to run the same or similar analyses on the training data (1) or the test data (2).

The only reason I bring this up is that for my application, I ended up using two separate arviz objects because the dimensions were different due to the different number of examples in the training and test+PPC cases.

@kyleabeauchamp
Copy link
Contributor

I can possibly see if this branch fits in with my existing code to see if I have any more "concrete" feedback.

@rpgoldman
Copy link
Contributor Author

I'm wondering how add_to_inference_data lines up with return_inference_data. I think one could populate an inference data and then return it merged with an original argument, or just return an inference data object. These are two different options (the first more for model criticism, the latter for prediction proper). So that's an argument for doing it. On the other hand, I hesitate to add two such wordy keywords! I suppose that return_inference_data might wither away, though, if we move towards uniformly returning inference data across the whole library.

@rpgoldman
Copy link
Contributor Author

The test failure looks like a false positive coming from Matplotlib, since master is also failing.

@rpgoldman
Copy link
Contributor Author

Have you thought through the following particular use case?

1. In model sampling, the observed training data x and y are set via a `pm.Data` object

2. You want to use PPC sampling for test data, which will later (e.g., after model training) be set via `pm.set_data`.  This new data will contain a different number of examples as compared to the training data in (1).

3. For various analyses, done in arviz, you may want to run the same or similar analyses on the training data (1) or the test data (2).

The only reason I bring this up is that for my application, I ended up using two separate arviz objects because the dimensions were different due to the different number of examples in the training and test+PPC cases.

As far as I know, it's not possible to follow this workflow, because you can't change the shape of variables with pm.Data(), because that breaks the model -- shapes are compiled in.

To do that kind of out of sample prediction, you need to create a new model, rather than changing the old one, and you need to use from_pymc3_predictions instead of from_pymc3.

@kyleabeauchamp
Copy link
Contributor

Are you 100% sure about the shape limitation? I was able to implement a working version of this workflow already, including changing the number of observations.

@rpgoldman
Copy link
Contributor Author

Are you 100% sure about the shape limitation? I was able to implement a working version of this workflow already, including changing the number of observations.

Pretty sure. There have been multiple issues posted about it. See #3640 #3631

Do you have a working model you can share?

@kyleabeauchamp
Copy link
Contributor

Sure, give me a couple of days and I can post an example.

@rpgoldman rpgoldman force-pushed the predictive-add-to-idata branch from eff0fac to 862ff2e Compare July 20, 2020 20:42
@codecov
Copy link

codecov bot commented Jul 20, 2020

Codecov Report

Merging #4021 (f901c9f) into master (bf734fc) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4021   +/-   ##
=======================================
  Coverage   88.95%   88.95%           
=======================================
  Files          92       92           
  Lines       14806    14817   +11     
=======================================
+ Hits        13170    13181   +11     
  Misses       1636     1636           
Impacted Files Coverage Δ
pymc3/distributions/posterior_predictive.py 89.61% <100.00%> (+0.32%) ⬆️

@kyleabeauchamp
Copy link
Contributor

Here's the example you requested, which I think is partially relevant to this PR:

#!/usr/bin/env python
import pymc3 as pm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
import arviz as az


x, y = load_boston(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

coords = {"features": range(x_train.shape[1]), "examples": range(len(x_train))}

n_samples = 2000
chains = 2
tune = 500
sd_prior = 0.5
sd_observation = 0.5

with pm.Model(coords=coords) as model:
    x = pm.Data('x', x_train, dims=("examples", "features"))
    y = pm.Data('y', y_train, dims="examples")
    intercept = pm.Uniform('intercept', -100, 100)
    w = pm.Normal('w', mu=0.0, sd=sd_prior, dims="features")

    mu = pm.Deterministic("mu", intercept + pm.math.dot(x, w), dims="examples")
    y_obs = pm.Normal('y_obs', mu, sd_observation, observed=y, dims="examples")
    tr = pm.sample(n_samples, chains=chains, return_inferencedata=True, tune=tune)


example_holdout_dim = [f"H_{i}" for i in range(len(y_test))]
ppc_coords = {"chain": tr.posterior.coords["chain"], "draw": tr.posterior.coords["draw"], "example_holdout": example_holdout_dim, "features": tr.posterior.coords["features"]}
ppc_dims = {"mu": ["chain", "draw", "example_holdout"], "x": ["example_holdout", "features"], "y": ["example_holdout"]}

observed_data = {'x': x_test, "y": y_test}

with model:
    pm.set_data(observed_data)
    ppc = pm.sample_posterior_predictive(tr.posterior, var_names=["mu"], keep_size=True)

ppc = az.from_dict(posterior_predictive=ppc, observed_data=observed_data, coords=ppc_coords, dims=ppc_dims)


print(x_train.shape)
print(x_test.shape)
print(tr.posterior.mu.shape)
print(ppc.posterior_predictive.mu.shape)

(379, 13)
(127, 13)
(2, 2000, 379)
(2, 2000, 127)

@twiecki twiecki marked this pull request as draft July 27, 2020 09:44
@twiecki twiecki changed the title WIP: add_to_inference_data argument for posterior predictive sampling functions add_to_inference_data argument for posterior predictive sampling functions Jul 27, 2020
@michaelosthege michaelosthege added this to the 3.10 milestone Nov 14, 2020
@michaelosthege
Copy link
Member

If I understand the conversation above correctly, @kyleabeauchamp and @rpgoldman were talking about a corner case where users could change the shape of pm.Data variables before running sample_posterior_predictive and how that could lead to shape problems?

If that does happen, users coud still fall back to not using the new feature this PR is about, right?
Therefore, I would like to suggest merging this PR. It has tests, it's a very useful feature and it would be great to get this into 3.10.

@rpgoldman
Copy link
Contributor Author

If I understand the conversation above correctly, @kyleabeauchamp and @rpgoldman were talking about a corner case where users could change the shape of pm.Data variables before running sample_posterior_predictive and how that could lead to shape problems?

If that does happen, users could still fall back to not using the new feature this PR is about, right?
Therefore, I would like to suggest merging this PR. It has tests, it's a very useful feature and it would be great to get this into 3.10.

I'd be a lot happier with that alternative if it didn't potentially lead to corrupting the contents of the inference data object in ways that the user might find very confusing.

If it was possible to detect and raise an error on cases where bad things would happen, then I would agree with you.

Is there some way we could check and see that the dimensions of the posterior predictive trace are consistent with the other information in the inference data, and raise an error if not?

To be honest, this is yet another knock-on problem from the pm.Data feature in PyMC3, which is (in my opinion) fundamentally broken.

I don't agree that users could fall back to not using the new feature, because there is no way for them to understand what has gone wrong, or when the feature is/is not appropriate to use. The large number of pm.Data related errors in PyMC3 issues, many with exceedingly obscure error messages, should be a warning to us.

@Spaak
Copy link
Member

Spaak commented Nov 23, 2020

@rpgoldman @michaelosthege @kyleabeauchamp I'm lining up PRs for release 3.10 (for which this one is marked right now). I understand there's still some discussion on whether it should go in or not? It would be great if you could give an update on the status here. (I must admit I don't immediately understand all the subtleties.)

@rpgoldman
Copy link
Contributor Author

I gave my feedback a few days ago. Waiting for response. Key question: can we detect when add_to_inference_data will fail and provide a reasonable error message? It will fail when the shapes of variables change between posterior sampling and posterior predictive. Since this is an attribute of the model, rather than the trace proper (although it might be detectable without the model?), it's a bit tricky.

One other thing -- is there a reason why add_to_inference_data can only be used when the trace is read out of an inference data object? Seems like this argument could be Union[bool, InferenceData], and we could let the user choose what inference data to add to. But if we abandon the use of MultiTrace, that might not be so important.

@Spaak
Copy link
Member

Spaak commented Nov 27, 2020

Thanks for the explanation. @michaelosthege any thoughts on the above? It would be good to make a decision on whether this should go in 3.10 or not.

@michaelosthege
Copy link
Member

Hm..

  1. I would keep the behaviour such that add_to_inference_data can only be used if the input is InferenceData already. The alternative would be doing that conversion internally, but if that works without problems, the user should have passed pm.sample(return_inferencedata=True) to begin with. This way we continue to nudge people away from MultiTrace and that's good.
  2. If the corner case of shape-changes is a concern, I would suggest the following:
  • write a test case that changes the RV shapes on purpose, triggering the problem
  • mark that test as xfail and open an issue to fix later
  • link that issue from the docstring

This way we can push this PR over the 3.10 finish line.

@rpgoldman
Copy link
Contributor Author

@michaelosthege

I agree with your first point, but with respect to your second, I'm still reluctant to release a feature that could lead to further confusion on the part of people doing out of sample prediction with PyMC3, or using pm.Data in ways that seem reasonable, but are unsupported.

The real answer is probably to have a document giving a recipe for doing out-of-sample prediction, but I won't have an opportunity to draft such a thing until at least the new year.

@michaelosthege
Copy link
Member

@rpgoldman @Spaak then shall we defer this PR to after 3.10? After all there's some chance that we can fix some of these shape issues with aesara, or at least make it easier to check the conditions.

@rpgoldman
Copy link
Contributor Author

@michaelosthege @Spaak I could go either way. My big question is the following:

Will this make things more difficult for people in the corner cases?

and I don't know the answer. If this is only going to fail in cases where the PyMC3 is already failing, then this is no worse, and we might as well let it go. If you have a feeling about this, I defer to your decision on whether this should be merged now or later. Since I don't fully understand the implications, and don't have time to experiment myself, I am inclined to be cautious. But I may be over-cautious.

@Spaak
Copy link
Member

Spaak commented Nov 30, 2020

@michaelosthege @rpgoldman My take would be: only include in the release stuff we're reasonably sure about; so, let's leave it until after 3.10 indeed. Especially since some of the shape stuff will be touched later on anyway.

@Spaak Spaak removed this from the 3.10 milestone Nov 30, 2020
@michaelosthege
Copy link
Member

Just like for a few other open PRs I'm adding the "won't fix" label, because we'll have to re-evaluate this problem after the big v4 change.

@rpgoldman
Copy link
Contributor Author

@michaelosthege You asked about rebasing this for using in v4. I'm afraid that I have been away from PyMC3 for some months so... what would be the branch I should rebase onto?

@ricardoV94
Copy link
Member

ricardoV94 commented Jun 13, 2021

V4 has been merged into main now. You should rebase into it

@rpgoldman
Copy link
Contributor Author

rpgoldman commented Jun 13, 2021

Looks like the directory structure has been changed radically and the file distributions/posterior_predictive.py, where the greatest part of the changes lie, has been removed, which makes rebasing quite difficult (indeed, I have no idea how to do this).

@ricardoV94
Copy link
Member

I think the fast ppc is gone, only the main ppc and the ppc_w in sampling.py remain (the latter still needs to be refactored)

@rpgoldman
Copy link
Contributor Author

I'm going to close this as unrecoverable: the underlying code has moved too far for this branch to be applied to main (and I think that's probably good: the original posterior predictive samplers were over-complicated and kludgy). We should just try a different way to return InferenceData (somehow taking into account the distinction between posterior predictive sampling for model checking and posterior predictive sampling for prediction).

@rpgoldman rpgoldman closed this Jun 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants