Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples #983

Merged
merged 31 commits into from
Jan 19, 2020

Conversation

rpgoldman
Copy link
Contributor

In this approach, unlike the one outlined in the schema, the constant_data contains the constant data used to generate the predictions not to generate the posterior_trace.
This could be modified, but I took this decision because the posterior trace used to generate the predictions in general CANNOT be the same as the posterior trace created by pymc3.sample() -- variables whose shape depends on the shape of the constant data or the observations must be removed.

@rpgoldman rpgoldman force-pushed the feature/predictive-constant branch 2 times, most recently from 7604ace to 6dd4c80 Compare December 31, 2019 02:25
@rpgoldman
Copy link
Contributor Author

A problem I am having with this is the division of variables into predictions_constant_data versus predictions in PyMC3.

Currently, what we do is we try to guess what's a constant value by looking at the variables and seeing if they are in the predictive trace (not constant), or if they are in the posterior trace or if they are in the observations. If not, then we assume that they are constant data.

This really does not work, because the user, in PyMC3, can specify a set of variables of interest to be in the predictive trace (this defaults to the set of observed random variables). So membership in (or absence from) the trace is not a good way to determine anything about a random variable.

This means, if the user is not interested in a random variable, so omits it from the predictive trace, constant_data_to_xarray will misinterpret those omitted variables as being constant data.

I really think it would be better to simply tell the translator what is constant data, but I don't have a great plan for how to do this. So if anyone does, please LMK.

@rpgoldman rpgoldman changed the title WIP: Populate InferenceData with out-of-sample prediction results WIP: Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples Jan 3, 2020
@rpgoldman rpgoldman force-pushed the feature/predictive-constant branch from dc99580 to bcc44e5 Compare January 7, 2020 01:29
@rpgoldman
Copy link
Contributor Author

Huge help from Brandon Willard of PyMC3 and Symbolic PyMC3 enabled me to more accurately determine which variables belong in constant_data or predictions_constant_data. I believe this is close to being ready to merge, if reviewed.

@rpgoldman
Copy link
Contributor Author

This won't pass tests until PyMC3 is fixed. See pymc-devs/pymc#3763

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like all the improvements introduced here, from predictions to better constant constant data support. This is awesome work.

In addition to the comments below, I have one extra comment on API. In my personal case (which I think is quite common) I generate only one set of predictions per model and I generate it without thinning; in which case it makes sense to store all quantities in a single inference data object. Therefore I think it would be great to add an idata or idata_orig optional argument so that predictions_from_pymc3 adds the new inference data (containing only predictions and predictions constant data) to the object passed as idata (this is done in my PR so it should not be much work).

The logic would be:

  • idata_orig=None return new idata object (which may have thinned posterior)
  • idata_orig!=None return original idata object concatenated with predictions idata

Comment on lines +81 to +94
if predictions is not None:
get_from = predictions
elif prior is not None:
get_from = prior
elif posterior_predictive is not None:
get_from = posterior_predictive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the priority order with posterior predictive before prior: predictions, posterior_predictive, prior.

This is because the default in pm.sample_posterior_predictive is samples=None which will end up with ndraw samples, however, for pm.sample_prior_predictive the default is samples=500 which will generally not be equal to ndraws.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a tentative version of the above and will push it momentarily. Note that I have augmented the orig_idata by hand instead of using concat. LMK if you think concat would be better.

*,
prior=None,
posterior_predictive=None,
predictions=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a predictions argument here will be quite confusing.

I would only use predictions argument in PyMC3Converter class. Therefore it will not be visible to users in the from_pymc3 docs but it will still be available for internal usage in predictions_from_pymc3 (which would then use the class instead of from_pymc3, the code is basically the same in both cases).



def predictions_from_pymc3(
predictions, posterior_trace, model, coords=None, dims=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that posterior_trace and model should default to None.

I think that the most general use case will be calling this from inside the model context and without any thinning, and in this case, the model can be obtained from the context and the trace is not really needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@rpgoldman
Copy link
Contributor Author

@OriolAbril Could you LMK how you see this lining up with your work on #794?

Is #794 going to be be merged soon, so that I can rebase on top of it? Should this MR branch of off yours (I don't like that idea, because I think it will be error-prone)? Or do you think we could come up with some subset of #794 that could be merged to master that would be enough to harmonize my work with yours, but still allow you to keep working on yours separately?

@OriolAbril
Copy link
Member

I think both need to be thoroughly reviewed, and #794 has been waiting for quite long so it can wait until this has been merged.

I don't mind rebasing on top of this once merged, and I actually think that they will merge quite well, here there are nearly no modifications to sample stats/likelihood handling. To be extra sure though, are you planning on adding some tests here (once the functionality is decided) or in a future PR? We could also wait for the tests PR if it is not too long.

Does this sound good @rpgoldman ?

@rpgoldman rpgoldman force-pushed the feature/predictive-constant branch from bd4980f to 2806557 Compare January 15, 2020 19:39
@rpgoldman
Copy link
Contributor Author

I'm not sure exactly how to improve the tests. In particular:

  1. We test on python 3.5, but I believe PyMC3 no longer supports anything before 3.6. Should we drop the io_pymc3 tests on 3.5?
  2. Here's a test failure that requires a bug-fix on PyMC3:
=================================== FAILURES ===================================
_________ TestDataPyMC3.test_multiple_observed_rv_without_observations _________

self = <arviz.tests.test_data_pymc.TestDataPyMC3 object at 0x7f034d8d2198>

    def test_multiple_observed_rv_without_observations(self):
        with pm.Model():
            mu = pm.Normal("mu")
            x = pm.DensityDist(  # pylint: disable=unused-variable
                "x", pm.Normal.dist(mu, 1.0).logp, observed={"value": 0.1}
            )
>           trace = pm.sample(100, chains=2)

arviz/tests/test_data_pymc.py:128: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/local/envs/testenv_3.5_PYSTAN_latest_PYRO_latest_EMCEE_latest_TF_latest/lib/python3.5/site-packages/pymc3/sampling.py:498: in sample
    trace.report._run_convergence_checks(trace, model)
/usr/local/envs/testenv_3.5_PYSTAN_latest_PYRO_latest_EMCEE_latest_TF_latest/lib/python3.5/site-packages/pymc3/backends/report.py:84: in _run_convergence_checks
    self._ess = ess = ess(trace, var_names=varnames)
/usr/local/envs/testenv_3.5_PYSTAN_latest_PYRO_latest_EMCEE_latest_TF_latest/lib/python3.5/site-packages/pymc3/stats/__init__.py:24: in wrapped
    return func(*args, **kwargs)
arviz/stats/diagnostics.py:187: in ess
    dataset = convert_to_dataset(data, group="posterior")
arviz/data/converters.py:168: in convert_to_dataset
    inference_data = convert_to_inference_data(obj, group=group, coords=coords, dims=dims)
arviz/data/converters.py:89: in convert_to_inference_data
    return from_pymc3(trace=kwargs.pop(group), **kwargs)
arviz/data/io_pymc3.py:346: in from_pymc3
    model=model,
arviz/data/io_pymc3.py:324: in to_inference_data
    id_dict["constant_data"] = self.constant_data_to_xarray()
arviz/data/base.py:36: in wrapped
    return func(cls, *args, **kwargs)
arviz/data/base.py:36: in wrapped
    return func(cls, *args, **kwargs)
arviz/data/io_pymc3.py:277: in constant_data_to_xarray
    if is_data(name, var):
arviz/data/io_pymc3.py:270: in is_data
    return var not in self.model.deterministics and var not in self.model.observed_RVs \
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pymc3.model.MultiObservedRV object at 0x7f038c06cac8>, other = mu

    def __eq__(self, other):
>       return self.id == other.id
E       AttributeError: 'MultiObservedRV' object has no attribute 'id'

This was only fixed on 7 Jan 2020, and is not in the released version of PyMC3.

@rpgoldman
Copy link
Contributor Author

@OriolAbril OK, this should have all the changes you requested.

I'll check in with PyMC3 about maybe cutting a bug-fix release.

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, thanks!

Minor note: I think the .pyi should be excluded from the repository

Comment on lines 402 to 403
idata_orig.predictions = new_idata.predictions
idata_orig.predictions_constant_data = new_idata.predictions_constant_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use concat for two reasons: the first is that this would allow to use **kwargs and pass them to concat (which already has both inplace and copy arguments), the second is that this does not update the _groups attribute (used to iterate over groups in several cases such as saving or printing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will change that and push.

@OriolAbril
Copy link
Member

I'll check in with PyMC3 about maybe cutting a bug-fix release.

@canyon289 is the expert, but maybe we can mark the failing test with some fixture in python 3.5 so CI build passes.

@rpgoldman
Copy link
Contributor Author

@OriolAbril Why do you think the .pyi file should be excluded from the repository? According to the various PEPs about type hinting, this is the right way to make type hints for a library, and in the case of Arviz, this stub file "explains" to users of InferenceData how the groups work, since otherwise, users of the API cannot be confident that, for example, the posterior attribute is legitimate to use.

@OriolAbril
Copy link
Member

@OriolAbril Why do you think the .pyi file should be excluded from the repository? According to the various PEPs about type hinting, this is the right way to make type hints for a library, and in the case of Arviz, this stub file "explains" to users of InferenceData how the groups work, since otherwise, users of the API cannot be confident that, for example, the posterior attribute is legitimate to use.

Basically ignorance about type hinting. I think I had never seen a .pyi file before, so my first thought was "this has somehow avoided the gitignore"

@rpgoldman
Copy link
Contributor Author

@OriolAbril Why do you think the .pyi file should be excluded from the repository? According to the various PEPs about type hinting, this is the right way to make type hints for a library, and in the case of Arviz, this stub file "explains" to users of InferenceData how the groups work, since otherwise, users of the API cannot be confident that, for example, the posterior attribute is legitimate to use.

Basically ignorance about type hinting. I think I had never seen a .pyi file before, so my first thought was "this has somehow avoided the gitignore"

In that case, I will leave it. If anyone else uses mypy, it might help them, and if they don't it will do no harm.

@rpgoldman
Copy link
Contributor Author

By the way, I was not able to run black on io_pymc3.py for some reason. It failed to parse the PyMC3Converter __init__() method. No idea why.

@OriolAbril
Copy link
Member

I fetched the code from this PR and managed to run black on it. Maybe you have an old version?

@rpgoldman
Copy link
Contributor Author

I fetched the code from this PR and managed to run black on it. Maybe you have an old version?

Pretty sure not: I updated it after the first failure.

I think you might be able to commit the reformatted file to this MR. If you can figure out how, please do. If you can't, please either email it to me, or message it to me on Slack, and I will.

@OriolAbril
Copy link
Member

Managed to push to your branch the black formatting

@rpgoldman
Copy link
Contributor Author

@OriolAbril

  1. I discovered an issue in log_likelihood handling in predictions_from_pymc3(). If we are building InferenceData from pymc3 predictions out of sample, then we may have to have deformed the trace to a point where it does not support extraction of the log likelihood.

I'm going to assume that it's OK not to have log likelihoods in this case. The reasoning would be that if you have done out of sample predictions that require modification of the trace, then you should have another InferenceData from which the log likelihood can be extracted and investigated. I will modify the tests so that they regard not having log likelihood in this case as being OK, unless you have some idea about how I should recompute them (presumably from the modified model). I think that should be left for a further MR if it's required.

I just wanted to explain why this is happening.

  1. I think for uniformity in naming, I should rename predictions_from_pymc3() to from_pymc3_predictions(), and if you have any ideas for a better name than that, please suggest.

  2. I expect a little more activity to add tests for predictions_constant_data, and when those pass, I hope we can merge (I think I'll compress the history a bit first).

@rpgoldman rpgoldman force-pushed the feature/predictive-constant branch 2 times, most recently from b7438f4 to 925cdbd Compare January 17, 2020 18:25
@OriolAbril
Copy link
Member

  1. I discovered an issue in log_likelihood handling in predictions_from_pymc3(). If we are building InferenceData from pymc3 predictions out of sample, then we may have to have deformed the trace to a point where it does not support extraction of the log likelihood

I thought the only constraint is that if there are predictions (hence the model has been modified for out of sample posterior predictive sampling) log likelihood cannot be extracted because it would raise an error when calling theano. Is this right? Are there other exceptions?

  1. I think for uniformity in naming, I should rename predictions_from_pymc3() to from_pymc3_predictions(), and if you have any ideas for a better name than that, please suggest.

I like both names, I don't have any strong opinion on this. Like this looks fine.

OriolAbril and others added 9 commits January 17, 2020 13:19
Made a bunch of overloads to capture the behavior of concat() better.
Format improvements, etc.
The order of priority in extracting the model was wrong -- explicit argument or model context must override extraction from the trace.
Return value handling in predictions_from_pymc3() was wrong.
Incorrectly checked for sampling statistics, which are not supported (yet?) in out of sample predictions.
@rpgoldman rpgoldman force-pushed the feature/predictive-constant branch from aa84cd8 to 495e417 Compare January 17, 2020 19:19
@rpgoldman
Copy link
Contributor Author

OK, this all works for me, and if you check out PyMC3 at master, it should work for you, too. Unfortunately, right now this is broken working against PyMC3 as released because of a bug in the released version.
I will look into monkey-patching PyMC3 to make it work with the current version, too....

@rpgoldman rpgoldman force-pushed the feature/predictive-constant branch from 6767702 to d81220b Compare January 17, 2020 20:53
@rpgoldman rpgoldman force-pushed the feature/predictive-constant branch from d81220b to c3f5c1f Compare January 17, 2020 22:00
arviz/data/io_pymc3.py Show resolved Hide resolved
# random variable object ...
Var = Any # pylint: disable=invalid-name

# pylint: disable=line-too-long
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scope of this diasable? (mainly curiosity)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scope of this disable? (mainly curiosity)

I'm afraid I don't know as well as I should. I'm going to try to leave it out and see what happens. If it's not necessary (after black reformatting), or if it can be more tightly scoped, I will cut or mve it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cut it out -- I think it was there because my editor didn't look at the right .pylintrc, and was holding me to 80 characters. Passes pylint here without it. Assuming it passes in CI, I think we are good.

@rpgoldman
Copy link
Contributor Author

@OriolAbril Assuming that this passes all the tests again, how should we merge it? The git history is cluttered, and I was thinking of, for example, using rebasing to merge together a bunch of the commits for lint and black compliance.

How about I wait for this bout of tests to pass, then clean up the history, and then if it's still passing tomorrow with the cleaned up history, you or I can merge it?

@OriolAbril
Copy link
Member

All merges in ArviZ are squash merges, so the PR history does not really matter. I can merge afterwards

@rpgoldman
Copy link
Contributor Author

All merges in ArviZ are squash merges, so the PR history does not really matter. I can merge afterwards

Oh, great. Then I'll skip the history editing. I will push my changelog entry then, and we're done!

@rpgoldman rpgoldman changed the title WIP: Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples Jan 17, 2020
@OriolAbril OriolAbril merged commit 9987acb into arviz-devs:master Jan 19, 2020
@rpgoldman
Copy link
Contributor Author

@OriolAbril Thanks!

@rpgoldman rpgoldman deleted the feature/predictive-constant branch January 20, 2020 02:51
percygautam pushed a commit to percygautam/arviz that referenced this pull request Jan 21, 2020
…C3 predictive samples (arviz-devs#983)

Adds from_pymc3_predictions to add predictions and constant_data_predictions groups of inference data objects.

Co-authored-by: Oriol Abril <oriol.abril.pla@gmail.com>
This was referenced Jan 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants