Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples #983

rpgoldman · 2019-12-30T17:25:59Z

In this approach, unlike the one outlined in the schema, the constant_data contains the constant data used to generate the predictions not to generate the posterior_trace.
This could be modified, but I took this decision because the posterior trace used to generate the predictions in general CANNOT be the same as the posterior trace created by pymc3.sample() -- variables whose shape depends on the shape of the constant data or the observations must be removed.

rpgoldman · 2020-01-03T22:32:50Z

A problem I am having with this is the division of variables into predictions_constant_data versus predictions in PyMC3.

Currently, what we do is we try to guess what's a constant value by looking at the variables and seeing if they are in the predictive trace (not constant), or if they are in the posterior trace or if they are in the observations. If not, then we assume that they are constant data.

This really does not work, because the user, in PyMC3, can specify a set of variables of interest to be in the predictive trace (this defaults to the set of observed random variables). So membership in (or absence from) the trace is not a good way to determine anything about a random variable.

This means, if the user is not interested in a random variable, so omits it from the predictive trace, constant_data_to_xarray will misinterpret those omitted variables as being constant data.

I really think it would be better to simply tell the translator what is constant data, but I don't have a great plan for how to do this. So if anyone does, please LMK.

rpgoldman · 2020-01-07T01:31:10Z

Huge help from Brandon Willard of PyMC3 and Symbolic PyMC3 enabled me to more accurately determine which variables belong in constant_data or predictions_constant_data. I believe this is close to being ready to merge, if reviewed.

rpgoldman · 2020-01-07T15:49:49Z

This won't pass tests until PyMC3 is fixed. See pymc-devs/pymc#3763

OriolAbril

I really like all the improvements introduced here, from predictions to better constant constant data support. This is awesome work.

In addition to the comments below, I have one extra comment on API. In my personal case (which I think is quite common) I generate only one set of predictions per model and I generate it without thinning; in which case it makes sense to store all quantities in a single inference data object. Therefore I think it would be great to add an idata or idata_orig optional argument so that predictions_from_pymc3 adds the new inference data (containing only predictions and predictions constant data) to the object passed as idata (this is done in my PR so it should not be much work).

The logic would be:

idata_orig=None return new idata object (which may have thinned posterior)
idata_orig!=None return original idata object concatenated with predictions idata

OriolAbril · 2020-01-15T00:02:35Z

arviz/data/io_pymc3.py

+            if predictions is not None:
+                get_from = predictions
+            elif prior is not None:
+                get_from = prior
+            elif posterior_predictive is not None:
+                get_from = posterior_predictive


I would keep the priority order with posterior predictive before prior: predictions, posterior_predictive, prior.

This is because the default in pm.sample_posterior_predictive is samples=None which will end up with ndraw samples, however, for pm.sample_prior_predictive the default is samples=500 which will generally not be equal to ndraws.

I have a tentative version of the above and will push it momentarily. Note that I have augmented the orig_idata by hand instead of using concat. LMK if you think concat would be better.

OriolAbril · 2020-01-15T00:08:58Z

arviz/data/io_pymc3.py

+    *,
+    prior=None,
+    posterior_predictive=None,
+    predictions=None,


I think having a predictions argument here will be quite confusing.

I would only use predictions argument in PyMC3Converter class. Therefore it will not be visible to users in the from_pymc3 docs but it will still be available for internal usage in predictions_from_pymc3 (which would then use the class instead of from_pymc3, the code is basically the same in both cases).

OriolAbril · 2020-01-15T00:11:31Z

arviz/data/io_pymc3.py

+
+
+def predictions_from_pymc3(
+    predictions, posterior_trace, model, coords=None, dims=None


I think that posterior_trace and model should default to None.

I think that the most general use case will be calling this from inside the model context and without any thinning, and in this case, the model can be obtained from the context and the trace is not really needed.

rpgoldman · 2020-01-15T17:00:43Z

@OriolAbril Could you LMK how you see this lining up with your work on #794?

Is #794 going to be be merged soon, so that I can rebase on top of it? Should this MR branch of off yours (I don't like that idea, because I think it will be error-prone)? Or do you think we could come up with some subset of #794 that could be merged to master that would be enough to harmonize my work with yours, but still allow you to keep working on yours separately?

OriolAbril · 2020-01-15T17:09:16Z

I think both need to be thoroughly reviewed, and #794 has been waiting for quite long so it can wait until this has been merged.

I don't mind rebasing on top of this once merged, and I actually think that they will merge quite well, here there are nearly no modifications to sample stats/likelihood handling. To be extra sure though, are you planning on adding some tests here (once the functionality is decided) or in a future PR? We could also wait for the tests PR if it is not too long.

Does this sound good @rpgoldman ?

rpgoldman · 2020-01-15T20:00:41Z

I'm not sure exactly how to improve the tests. In particular:

We test on python 3.5, but I believe PyMC3 no longer supports anything before 3.6. Should we drop the io_pymc3 tests on 3.5?
Here's a test failure that requires a bug-fix on PyMC3:

=================================== FAILURES ===================================
_________ TestDataPyMC3.test_multiple_observed_rv_without_observations _________

self = <arviz.tests.test_data_pymc.TestDataPyMC3 object at 0x7f034d8d2198>

    def test_multiple_observed_rv_without_observations(self):
        with pm.Model():
            mu = pm.Normal("mu")
            x = pm.DensityDist(  # pylint: disable=unused-variable
                "x", pm.Normal.dist(mu, 1.0).logp, observed={"value": 0.1}
            )
>           trace = pm.sample(100, chains=2)

arviz/tests/test_data_pymc.py:128: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/local/envs/testenv_3.5_PYSTAN_latest_PYRO_latest_EMCEE_latest_TF_latest/lib/python3.5/site-packages/pymc3/sampling.py:498: in sample
    trace.report._run_convergence_checks(trace, model)
/usr/local/envs/testenv_3.5_PYSTAN_latest_PYRO_latest_EMCEE_latest_TF_latest/lib/python3.5/site-packages/pymc3/backends/report.py:84: in _run_convergence_checks
    self._ess = ess = ess(trace, var_names=varnames)
/usr/local/envs/testenv_3.5_PYSTAN_latest_PYRO_latest_EMCEE_latest_TF_latest/lib/python3.5/site-packages/pymc3/stats/__init__.py:24: in wrapped
    return func(*args, **kwargs)
arviz/stats/diagnostics.py:187: in ess
    dataset = convert_to_dataset(data, group="posterior")
arviz/data/converters.py:168: in convert_to_dataset
    inference_data = convert_to_inference_data(obj, group=group, coords=coords, dims=dims)
arviz/data/converters.py:89: in convert_to_inference_data
    return from_pymc3(trace=kwargs.pop(group), **kwargs)
arviz/data/io_pymc3.py:346: in from_pymc3
    model=model,
arviz/data/io_pymc3.py:324: in to_inference_data
    id_dict["constant_data"] = self.constant_data_to_xarray()
arviz/data/base.py:36: in wrapped
    return func(cls, *args, **kwargs)
arviz/data/base.py:36: in wrapped
    return func(cls, *args, **kwargs)
arviz/data/io_pymc3.py:277: in constant_data_to_xarray
    if is_data(name, var):
arviz/data/io_pymc3.py:270: in is_data
    return var not in self.model.deterministics and var not in self.model.observed_RVs \
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pymc3.model.MultiObservedRV object at 0x7f038c06cac8>, other = mu

    def __eq__(self, other):
>       return self.id == other.id
E       AttributeError: 'MultiObservedRV' object has no attribute 'id'

This was only fixed on 7 Jan 2020, and is not in the released version of PyMC3.

rpgoldman · 2020-01-15T21:52:03Z

@OriolAbril OK, this should have all the changes you requested.

I'll check in with PyMC3 about maybe cutting a bug-fix release.

OriolAbril

This looks really good, thanks!

Minor note: I think the .pyi should be excluded from the repository

OriolAbril · 2020-01-15T22:02:03Z

arviz/data/io_pymc3.py

+    idata_orig.predictions = new_idata.predictions
+    idata_orig.predictions_constant_data = new_idata.predictions_constant_data


I would use concat for two reasons: the first is that this would allow to use **kwargs and pass them to concat (which already has both inplace and copy arguments), the second is that this does not update the _groups attribute (used to iterate over groups in several cases such as saving or printing)

OK, I will change that and push.

OriolAbril · 2020-01-15T22:05:43Z

I'll check in with PyMC3 about maybe cutting a bug-fix release.

@canyon289 is the expert, but maybe we can mark the failing test with some fixture in python 3.5 so CI build passes.

rpgoldman · 2020-01-15T22:37:39Z

@OriolAbril Why do you think the .pyi file should be excluded from the repository? According to the various PEPs about type hinting, this is the right way to make type hints for a library, and in the case of Arviz, this stub file "explains" to users of InferenceData how the groups work, since otherwise, users of the API cannot be confident that, for example, the posterior attribute is legitimate to use.

OriolAbril · 2020-01-15T22:45:34Z

@OriolAbril Why do you think the .pyi file should be excluded from the repository? According to the various PEPs about type hinting, this is the right way to make type hints for a library, and in the case of Arviz, this stub file "explains" to users of InferenceData how the groups work, since otherwise, users of the API cannot be confident that, for example, the posterior attribute is legitimate to use.

Basically ignorance about type hinting. I think I had never seen a .pyi file before, so my first thought was "this has somehow avoided the gitignore"

rpgoldman · 2020-01-15T22:48:54Z

@OriolAbril Why do you think the .pyi file should be excluded from the repository? According to the various PEPs about type hinting, this is the right way to make type hints for a library, and in the case of Arviz, this stub file "explains" to users of InferenceData how the groups work, since otherwise, users of the API cannot be confident that, for example, the posterior attribute is legitimate to use.

Basically ignorance about type hinting. I think I had never seen a .pyi file before, so my first thought was "this has somehow avoided the gitignore"

In that case, I will leave it. If anyone else uses mypy, it might help them, and if they don't it will do no harm.

rpgoldman · 2020-01-15T22:54:15Z

By the way, I was not able to run black on io_pymc3.py for some reason. It failed to parse the PyMC3Converter __init__() method. No idea why.

OriolAbril · 2020-01-15T23:14:43Z

I fetched the code from this PR and managed to run black on it. Maybe you have an old version?

rpgoldman · 2020-01-15T23:17:14Z

I fetched the code from this PR and managed to run black on it. Maybe you have an old version?

Pretty sure not: I updated it after the first failure.

I think you might be able to commit the reformatted file to this MR. If you can figure out how, please do. If you can't, please either email it to me, or message it to me on Slack, and I will.

OriolAbril · 2020-01-15T23:40:43Z

Managed to push to your branch the black formatting

rpgoldman · 2020-01-17T16:44:18Z

@OriolAbril

I discovered an issue in log_likelihood handling in predictions_from_pymc3(). If we are building InferenceData from pymc3 predictions out of sample, then we may have to have deformed the trace to a point where it does not support extraction of the log likelihood.

I'm going to assume that it's OK not to have log likelihoods in this case. The reasoning would be that if you have done out of sample predictions that require modification of the trace, then you should have another InferenceData from which the log likelihood can be extracted and investigated. I will modify the tests so that they regard not having log likelihood in this case as being OK, unless you have some idea about how I should recompute them (presumably from the modified model). I think that should be left for a further MR if it's required.

I just wanted to explain why this is happening.

I think for uniformity in naming, I should rename predictions_from_pymc3() to from_pymc3_predictions(), and if you have any ideas for a better name than that, please suggest.
I expect a little more activity to add tests for predictions_constant_data, and when those pass, I hope we can merge (I think I'll compress the history a bit first).

OriolAbril · 2020-01-17T18:59:38Z

I discovered an issue in log_likelihood handling in predictions_from_pymc3(). If we are building InferenceData from pymc3 predictions out of sample, then we may have to have deformed the trace to a point where it does not support extraction of the log likelihood

I thought the only constraint is that if there are predictions (hence the model has been modified for out of sample posterior predictive sampling) log likelihood cannot be extracted because it would raise an error when calling theano. Is this right? Are there other exceptions?

I think for uniformity in naming, I should rename predictions_from_pymc3() to from_pymc3_predictions(), and if you have any ideas for a better name than that, please suggest.

I like both names, I don't have any strong opinion on this. Like this looks fine.

Made a bunch of overloads to capture the behavior of concat() better. Format improvements, etc.

The order of priority in extracting the model was wrong -- explicit argument or model context must override extraction from the trace. Return value handling in predictions_from_pymc3() was wrong.

Incorrectly checked for sampling statistics, which are not supported (yet?) in out of sample predictions.

rpgoldman · 2020-01-17T20:28:14Z

OK, this all works for me, and if you check out PyMC3 at master, it should work for you, too. Unfortunately, right now this is broken working against PyMC3 as released because of a bug in the released version.
I will look into monkey-patching PyMC3 to make it work with the current version, too....

arviz/data/io_pymc3.py

OriolAbril · 2020-01-17T22:31:02Z

arviz/data/io_pymc3.py

+# random variable object ...
+Var = Any  # pylint: disable=invalid-name
+
+# pylint: disable=line-too-long


What is the scope of this diasable? (mainly curiosity)

What is the scope of this disable? (mainly curiosity)

I'm afraid I don't know as well as I should. I'm going to try to leave it out and see what happens. If it's not necessary (after black reformatting), or if it can be more tightly scoped, I will cut or mve it.

I cut it out -- I think it was there because my editor didn't look at the right .pylintrc, and was holding me to 80 characters. Passes pylint here without it. Assuming it passes in CI, I think we are good.

rpgoldman · 2020-01-17T22:48:33Z

@OriolAbril Assuming that this passes all the tests again, how should we merge it? The git history is cluttered, and I was thinking of, for example, using rebasing to merge together a bunch of the commits for lint and black compliance.

How about I wait for this bout of tests to pass, then clean up the history, and then if it's still passing tomorrow with the cleaned up history, you or I can merge it?

OriolAbril · 2020-01-17T22:52:33Z

All merges in ArviZ are squash merges, so the PR history does not really matter. I can merge afterwards

rpgoldman · 2020-01-17T22:53:48Z

All merges in ArviZ are squash merges, so the PR history does not really matter. I can merge afterwards

Oh, great. Then I'll skip the history editing. I will push my changelog entry then, and we're done!

rpgoldman · 2020-01-20T02:51:52Z

@OriolAbril Thanks!

…C3 predictive samples (arviz-devs#983) Adds from_pymc3_predictions to add predictions and constant_data_predictions groups of inference data objects. Co-authored-by: Oriol Abril <oriol.abril.pla@gmail.com>

rpgoldman mentioned this pull request Dec 30, 2019

Proposal for PyMC3 predictions support #916

Closed

rpgoldman force-pushed the feature/predictive-constant branch 2 times, most recently from 7604ace to 6dd4c80 Compare December 31, 2019 02:25

rpgoldman changed the title ~~WIP: Populate InferenceData with out-of-sample prediction results~~ WIP: Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples Jan 3, 2020

rpgoldman force-pushed the feature/predictive-constant branch from dc99580 to bcc44e5 Compare January 7, 2020 01:29

rpgoldman mentioned this pull request Jan 7, 2020

Rename from_pymc3(prior) arg to prior_predictive and add prior arg #597

Closed

rpgoldman mentioned this pull request Jan 8, 2020

Improve assignment of pm.Deterministic to InferenceData groups #947

Closed

rpgoldman force-pushed the feature/predictive-constant branch from f3a2b76 to bd4980f Compare January 8, 2020 16:54

OriolAbril reviewed Jan 15, 2020

View reviewed changes

rpgoldman force-pushed the feature/predictive-constant branch from bd4980f to 2806557 Compare January 15, 2020 19:39

OriolAbril reviewed Jan 15, 2020

View reviewed changes

rpgoldman force-pushed the feature/predictive-constant branch 2 times, most recently from b7438f4 to 925cdbd Compare January 17, 2020 18:25

OriolAbril and others added 9 commits January 17, 2020 13:19

black

a202396

lint.

7f7532a

Stub improvements.

21c6c66

Made a bunch of overloads to capture the behavior of concat() better. Format improvements, etc.

Lint.

c548ee6

Fix return statements.

170e8ad

Fixed model handling and concat() invocation.

c3a20cc

The order of priority in extracting the model was wrong -- explicit argument or model context must override extraction from the trace. Return value handling in predictions_from_pymc3() was wrong.

Add tests for pymc3 prediction handling.

2e6b1b9

Fix test_from_pymc_predictions_new()

5eef686

Incorrectly checked for sampling statistics, which are not supported (yet?) in out of sample predictions.

black and lint reformatting.

495e417

rpgoldman force-pushed the feature/predictive-constant branch from aa84cd8 to 495e417 Compare January 17, 2020 19:19

rpgoldman added 2 commits January 17, 2020 13:48

Test predictions constant data.

9e5398b

Rename to from_pymc3_predictions()

577233e

rpgoldman force-pushed the feature/predictive-constant branch from 6767702 to d81220b Compare January 17, 2020 20:53

Monkey patch to repair buggy PyMC3 versions.

c3f5c1f

rpgoldman force-pushed the feature/predictive-constant branch from d81220b to c3f5c1f Compare January 17, 2020 22:00

OriolAbril approved these changes Jan 17, 2020

View reviewed changes

Drop unnecessary pylint directive.

4363a38

Changelog entry.

de45190

rpgoldman changed the title ~~WIP: Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples~~ Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples Jan 17, 2020

black

0226d50

OriolAbril merged commit 9987acb into arviz-devs:master Jan 19, 2020

rpgoldman deleted the feature/predictive-constant branch January 20, 2020 02:51

This was referenced Jan 26, 2020

Update schema #1029

Merged

New predictions group! #1030

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples #983

Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples #983

rpgoldman commented Dec 30, 2019

rpgoldman commented Jan 3, 2020

rpgoldman commented Jan 7, 2020

rpgoldman commented Jan 7, 2020

OriolAbril left a comment

OriolAbril Jan 15, 2020

rpgoldman Jan 15, 2020

OriolAbril Jan 15, 2020

OriolAbril Jan 15, 2020

rpgoldman Jan 15, 2020

rpgoldman commented Jan 15, 2020

OriolAbril commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

OriolAbril left a comment

OriolAbril Jan 15, 2020

rpgoldman Jan 15, 2020

OriolAbril commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

OriolAbril commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

OriolAbril commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

OriolAbril commented Jan 15, 2020

rpgoldman commented Jan 17, 2020

OriolAbril commented Jan 17, 2020

rpgoldman commented Jan 17, 2020

OriolAbril Jan 17, 2020

rpgoldman Jan 17, 2020

rpgoldman Jan 17, 2020

rpgoldman commented Jan 17, 2020

OriolAbril commented Jan 17, 2020

rpgoldman commented Jan 17, 2020

rpgoldman commented Jan 20, 2020



		def predictions_from_pymc3(
		predictions, posterior_trace, model, coords=None, dims=None

		idata_orig.predictions = new_idata.predictions
		idata_orig.predictions_constant_data = new_idata.predictions_constant_data

Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples #983

Populate InferenceData with out-of-sample prediction results from PyMC3 predictive samples #983

Conversation

rpgoldman commented Dec 30, 2019

rpgoldman commented Jan 3, 2020

rpgoldman commented Jan 7, 2020

rpgoldman commented Jan 7, 2020

OriolAbril left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rpgoldman commented Jan 15, 2020

OriolAbril commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

OriolAbril left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OriolAbril commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

OriolAbril commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

OriolAbril commented Jan 15, 2020

rpgoldman commented Jan 15, 2020

OriolAbril commented Jan 15, 2020

rpgoldman commented Jan 17, 2020

OriolAbril commented Jan 17, 2020

rpgoldman commented Jan 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rpgoldman commented Jan 17, 2020

OriolAbril commented Jan 17, 2020

rpgoldman commented Jan 17, 2020

rpgoldman commented Jan 20, 2020