Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use inference data in end of sampling report #3883

Merged
merged 2 commits into from
Apr 20, 2020

Conversation

OriolAbril
Copy link
Member

Current code calls az.ess and az.rhat using trace objects instead of inference data objects. ArviZ therefore converts the trace to inferencedata object twice. This is generally not an issue and it is barely noticeable unless the number of observations is large, in which case the conversion (and consequent retrieving of pointwise log likelihood data) can be quite memory expensive. Below there is memory usage in an example (courtesy of @nitishp25) where the effect is quite noticeable:

image

This PR explicits the conversion to inferencedata so it only happens once, and if possible, skips retrieving the pointwise log likelihood data (which is not needed for ess nor rhat calculation).

@@ -107,7 +107,7 @@ def _run_convergence_checks(self, trace, model):
self._add_warnings([warn])
return

from pymc3 import rhat, ess
from pymc3 import rhat, ess, _to_arviz
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used this to_arviz approach for readability, but I can change it to from arviz import from_pymc3 if you prefer it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could _to_arviz be a place where to put storing of sampling metadata? (See arviz-devs/arviz#1146)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think it has to be added to from_pymc3 so also to _to_arviz, is that the question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know _to_arviz existed, so to me it sounds like code duplication?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aliased from_pymc3 to _to_arviz, I was worried calling a function called from_pymc3 inside pymc code would be confusing. I guess it depends on the general degree of familiarity with ArviZ

@codecov
Copy link

codecov bot commented Apr 16, 2020

Codecov Report

Merging #3883 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3883   +/-   ##
=======================================
  Coverage   90.71%   90.71%           
=======================================
  Files         135      135           
  Lines       21314    21316    +2     
=======================================
+ Hits        19335    19337    +2     
  Misses       1979     1979           
Impacted Files Coverage Δ
pymc3/backends/report.py 93.07% <100.00%> (+0.10%) ⬆️

@michaelosthege
Copy link
Member

@OriolAbril do you think we could transition to returning InferenceData in the long term?

@OriolAbril
Copy link
Member Author

@OriolAbril do you think we could transition to returning InferenceData in the long term?

Now that ArviZ is a dependency, it is feasible, however it would be quite a breaking change. It should probably be discussed in an issue or lab meeting. Another alternative would be to add an option to pm.sampling to return an inference data object 🤔

@AlexAndorra
Copy link
Contributor

I like the idea of having an option to return an inference data object! Would make the modeling workflow easier and more elegant.

@rpgoldman
Copy link
Contributor

Returning an InferenceData is a nice idea, but it would radically change the posterior predictive sampling code, which is pretty rickety already. To be honest, I'm so scarred by my experience working on fast_sample_posterior_predictive that I wouldn't like to go back in there again.

@michaelosthege
Copy link
Member

Returning an InferenceData is a nice idea, but it would radically change the posterior predictive sampling code, which is pretty rickety already. To be honest, I'm so scarred by my experience working on fast_sample_posterior_predictive that I wouldn't like to go back in there again.

But we already have a converter to swallow InferenceData.posterior objects in sample_posterior_predictive - we did thata few weeks ago, remember?
https://github.com/pymc-devs/pymc3/blob/e7bc8326541e2c7aeee1f3741ff1877f716077a1/pymc3/sampling.py#L1561

@rpgoldman
Copy link
Contributor

Ah, right. I had forgotten. And it's already ported to fast_sample_posterior_predictive. OK, never mind!

@lucianopaz
Copy link
Contributor

@OriolAbril do you think we could transition to returning InferenceData in the long term?

Now that ArviZ is a dependency, it is feasible, however it would be quite a breaking change. It should probably be discussed in an issue or lab meeting. Another alternative would be to add an option to pm.sampling to return an inference data object 🤔

I think that the steps to transition to an InferenceData interface should be:

  1. Write a new trace backend that uses xarray datasets or whatever data container arviz uses under the hood. This way, arviz would only need to get a reference to the true trace instead of copying the contents into a new data structure.
  2. Proceed as the guys in scikit-learn do: issue a FutureWarning in pm.sample saying that the default backend will change from NDArray to XArray (or whatever it is called in the end).
  3. Leave the warning for the duration of at least one minor release before switching the default. By this I mean that, if the new backend is added in pymc3 v3.9, the default should change in v3.10.

The way in which the future warnings is dealt with usually is by setting the default argument value to something that indicates that the user did not explicitly provide anything as it's value (could be None or a string like "unset"). Then, if the argument is unset, the future warning is signaled and the argument's value is set to NDArray to preserve backwards compatibility.

@nitishp25
Copy link

nitishp25 commented Apr 18, 2020

Btw, this is how the memory usage looks after the changes done in this PR:

pymc3_issue

pymc3/backends/report.py Outdated Show resolved Hide resolved
@OriolAbril
Copy link
Member Author

Addressed the comments about using inference data to avoid memory usage peaks due to pointwise log likelihood. I would leave returning an inference data object as a result of pm.sample for another PR.

@twiecki
Copy link
Member

twiecki commented Apr 19, 2020

@OriolAbril Looks great -- thanks! Can you add a note to the release-notes under maintenance?

@twiecki twiecki merged commit 80a82dd into pymc-devs:master Apr 20, 2020
@twiecki
Copy link
Member

twiecki commented Apr 20, 2020

Thanks @OriolAbril !

@OriolAbril OriolAbril deleted the arviz_integration branch April 20, 2020 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants