-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port InferenceData conversion code to pymc3 codebase #4489
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some comments about places where I expect more changes to be needed. I have pushed to a branch on pymc3 instead of on my fork to allow anyone with permisson to work directly on this.
The gist of each of the converter methods is to generate a dict of str: np.ndarray with the arrays following the chain, draw, *shape
convention, then call dict_to_dataset
.
I don't know if it would make sense to try and use __numpy_func__
on Aesara (and I don't know enough about either Aesara nor __numpy_func__
to venture a guess), I am only mentioning, because if it did, the InferenceDatas could have Aesara TensorVariables as the underlying data structure and when working with xarray objects, one could use .values
to get a numpy array or .data
to get the actual data array used. Now ArviZ forces convertion to numpy, but we could change that.
pymc3/backends/arviz.py
Outdated
import xarray as xr | ||
|
||
from aesara.gof.graph import ancestors | ||
from aesara.tensor.var import TensorVariable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It used to check for PyMC3Variable
, I had some doubts between using Variable
or TensorVariable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, the difference comes down to whether or not the code requires shape information (e.g. the dtype
and broadcastable
fields that are exposed by TensorVariable
via TensorVariable.type
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we explicitly use this info, but it should be taken into account when converting to numpy to make sure we preserve the dtypes of the original parameters and data. Thanks for the explanation, I will test to make sure dtypes are preserved.
pymc3/backends/arviz.py
Outdated
def log_likelihood_vals_point(self, point, var, log_like_fun): | ||
"""Compute log likelihood for each observed point.""" | ||
log_like_val = utils.one_de(log_like_fun(point)) | ||
if var.missing_values: | ||
mask = var.observations.mask | ||
if np.ndim(mask) > np.ndim(log_like_val): | ||
mask = np.any(mask, axis=-1) | ||
log_like_val = np.where(mask, np.nan, log_like_val) | ||
return log_like_val | ||
|
||
def _extract_log_likelihood(self, trace): | ||
"""Compute log likelihood of each observation.""" | ||
if self.trace is None: | ||
return None | ||
if self.model is None: | ||
return None | ||
|
||
if self.log_likelihood is True: | ||
cached = [(var, var.logp_elemwise) for var in self.model.observed_RVs] | ||
else: | ||
cached = [ | ||
(var, var.logp_elemwise) | ||
for var in self.model.observed_RVs | ||
if var.name in self.log_likelihood | ||
] | ||
log_likelihood_dict = _DefaultTrace(len(trace.chains)) | ||
for var, log_like_fun in cached: | ||
for chain in trace.chains: | ||
log_like_chain = [ | ||
self.log_likelihood_vals_point(point, var, log_like_fun) | ||
for point in trace.points([chain]) | ||
] | ||
log_likelihood_dict.insert(var.name, np.stack(log_like_chain), chain) | ||
return log_likelihood_dict.trace_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hoping that all this will go away and pointwise log likelihood can now be evaluated in a vectorized way, looping over variables to loop over chains to loop over draws should not be the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be possible to evaluate a likelihood in a vectorized fashion. You'll still have to compile an Aesara function with a fixed set of dimensions for the input, but, assuming that the number/order of the input dimensions are always the same (e.g. chains + samples + var dims), that need only be done once for a given model.
|
||
@requires(["trace", "predictions"]) | ||
@requires("model") | ||
def constant_data_to_xarray(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may also need a significant refactor, but honestly I have no idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of those attributes in self.model
are still available, but they might correspond to different variables. For example, model.vars
returns the log-likelihood-space TensorVariable
s, and the model.*_RVs
methods return the sample-space TensorVariable
s.
Regardless, you can always get the corresponding log-likelihood-space TensorVariable
from the sample-space variable using var.tag.value_var
(I should probably rename it to var.tag.measure_space_var
or var.tag.loglik_var
, though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these checks and selection of variables have only the goal of gathering all named variables that are not sampled parameters nor observed/observations. Up until now that basically meant pm.Data
objects that were not passed as observed
, things like the x
in a linear regression, the county and floor indexes in the radon example... Maybe we can think of a better way to identify these variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely; we can very easily obtain variables like that using something as simple as aesara.graph.basic.vars_between
, where the ins
are the sampled/unobserved variables and the outs
are the observed variables.
Thanks @Spaak ! On a more general note, does ArviZ preserve the chain indexes when they don't start at 0? I fear the current code may work when different chain idxs are used but the resulting xarray coordinates will still be 0, 1, 2... Which also poses a question about how we want to handle this going further. Do we want to keep |
277f9d3
to
023fdcd
Compare
pymc3/backends/arviz.py
Outdated
(var, var.logp_elemwise) | ||
for var in self.model.observed_RVs | ||
if var.name in self.log_likelihood |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be updated, because the var.log*
methods are gone now. Instead, you can use something like logpt(var)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want all of log_likelihood_vals_point
and _extract_log_likelihood
to go and instead evaluate the likelihood in a vectorized fashion, looping over observed variables at most. I am reading the developer guide and trying to see how that would be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have given this a try now, and I am not understanding what is happening nor where to find the relevant docs between pymc3 docstrings and aesara. Say I have this model:
with pm.Model() as model:
m = pm.Normal("m", 0.0, 1.0)
a = pm.Normal("a", mu=m, sigma=1, observed=0.0)
I then undestood I have to use logpt
on a
to get its pointwise log likelihood: log_lik = pm.distributions.logpt(a)
and the observed values should be pulled from a
. I then try to compile a function to calculate log_lik (for now with manual scalar input for m, but ideally with a numpy array containing the whole trace). However, I can't seem to do that:
aesara.function([], log_lik)
errors out with:
...
raise MissingInputError(error_msg, variable=var)
aesara.graph.fg.MissingInputError: Input 0 (m) of the graph (indices start from 0),
used to compute InplaceDimShuffle{x}(m), was not provided and not given a value. Use
the Aesara flag exception_verbosity='high', for more information on this error.
Backtrace when that variable is created:
File "<stdin>", line 2, in <module>
File "/home/oriol/miniconda3/envs/arviz/lib/python3.8/site-packages/pymc3/distribu
tions/distribution.py", line 96, in __new__
rv_out = cls.dist(*args, rng=rng, **kwargs)
File "/home/oriol/miniconda3/envs/arviz/lib/python3.8/site-packages/pymc3/distribu
tions/continuous.py", line 477, in dist
return super().dist([mu, sigma], **kwargs)
File "/home/oriol/miniconda3/envs/arviz/lib/python3.8/site-packages/pymc3/distribu
tions/distribution.py", line 105, in dist
rv_var = cls.rv_op(*dist_params, **kwargs)
and
aesara.function([m], log_lik)
errors with:
...
raise UnusedInputError(msg % (inputs.index(i), i.variable, err_msg))
aesara.compile.function.types.UnusedInputError: aesara.function was asked to create
a function computing outputs given certain inputs, but the provided input variable a
t index 0 is not part of the computational graph needed to compute the outputs: m.
To make this error into a warning, you can pass the parameter on_unused_input='warn'
to aesara.function. To disable it completely, use on_unused_input='ignore'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you need to use m.tag.value_var
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's rather confusing, because the display names for these two variables (i.e. m
and m.tag.value_var
) are exactly the same. I thought about changing the name of the latter to something like "{rv_name}_value"
or "{rv_name}_lik"
, but I couldn't decide.
Anyway, it's very important to understand that there are now two distinct, but complementary "variables" for each random variable in a model: TensorVariable
s that represent sample-able random variables, and TensorVariable
s that represent concrete values of those random variables. This idea is directly related to the observed
parameter, because the values passed as observed
are essentially a concrete instance of the latter type of variable.
More specifically, when one (sloppily) writes P(Y = y)
in "math"-like notation, Y
is the random variable and y
is an "instance" of Y
, but they are two distinct things with different properties/constraints. The latter term (i.e. the "value" variable), y
, is the input argument used in density functions, e.g. f(y) = exp(- y**2 * ...) * ...
; this is why the "value" variable is the only one used in the log-likelihood graphs.
Really, what I'm pointing out here is that random variables are actually (measurable) functions, both mathematically and, now, in v4
. Technically, in v4
, random variables are Aesara graphs, but, since Aesara graphs can represent—and be converted into—functions, it's the same idea.
N.B.: These "value" variables are conveniently stored within their corresponding random variable's tag
(i.e. a Variable
's arbitrary "storage" space) under the name value_var
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you might've accidentally merge-committed; you'll definitely need to rebase instead.
a7d52c7
to
9db658f
Compare
9db658f
to
d46dae4
Compare
pymc3/backends/arviz.py
Outdated
for obs in self.model.observed_RVs: | ||
if hasattr(obs.tag, "observations"): | ||
aux_obs = obs.tag.observations | ||
observations[obs.name] = aux_obs.data if hasattr(aux_obs, "data") else aux_obs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is data
the right field? Or should it be value
? Or try both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will work for Constant
objects, but not for shared variables, which will require aux_obs.get_value()
.
There's also the case where obs.tag.observations
is a partially observed variable (i.e. has missing data). The result will be an AdvancedIncSubtensor1
in that case.
Here's a helper function for all those situations:
import numpy as np
from aesara.graph.basic import Constant
from aesara.tensor.sharedvar import SharedVariable
from aesara.tensor.subtensor import AdvancedIncSubtensor1
def extract_data(x):
if isinstance(x, Constant):
return x.data
if isinstance(x, SharedVariable):
return x.get_value()
if x.owner and isinstance(x.owner.op, AdvancedIncSubtensor1):
array_data = extract_data(x.owner.inputs[0])
mask_idx = extract_data(x.owner.inputs[2])
mask = np.zeros_like(array_data)
mask[mask_idx] = 1
return np.ma.MaskedArray(array_data, mask)
raise TypeError(f"Data cannot be extracted from {x}")
pymc3/backends/arviz.py
Outdated
def log_likelihood_to_xarray(self): | ||
"""Extract log likelihood and log_p data from PyMC3 trace.""" | ||
# TODO: add pointwise log likelihood extraction to the converter | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have disabled the log likelihood conversion for now so we can already start working on tests that depend on this. The ones I am adding from ArviZ do test specifically for log likelihood conversion but I don't think the tests on pymc3 that use sample
will. I hope this will already work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
I'm about to push a rebased version of this branch. Once that's done, you can fetch and pull this branch. If you don't have any unpushed local changes, you can |
face4b7
to
3af0a00
Compare
Thanks! I have no unpushed local changes. I also realized that the code I'm using runs only on ArviZ master, so tests won't pass until we make the next release. We don't have a very clear timeline, but it should happen soon |
I'm updating all that now. I'll push those changes shortly. |
816a3fa
to
9622401
Compare
pymc3/backends/arviz.py
Outdated
library=pymc3, | ||
coords=self.coords, | ||
dims=self.dims, | ||
# default_dims=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work yet I think, dict_to_dataset
will try to add the chain
and draw
dimensions and fail. The old version (with old meaning the current release) had all the code in dict_to_dataset
repeated here to not assume chain
and draw
were present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dims=self.dims
won't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, commenting default_dims=[]
because it should revert to the default behaviour of assuming chain
and draw
are present and they are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_idata_conversion.py
passes locally with that commented out, but I had to comment out dims
as well.
9622401
to
1a604c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like I've gotten most/all of the major things convert, so I'll stop making edits for now.
Otherwise, @OriolAbril, I don't know what the deal is with dims
, so you'll have to look at that one.
I'll try to make a workaround so we can merge asap. To properly fix that we need an ArviZ release to break all the entanglement.
|
8d68c83
to
fa7607d
Compare
I have started porting the
arviz.from_pymc3
converter to the pymc3 codebase.This should first allow to simplify significantly the code (which has already happened quite a lot
and probably can be further simplified) and to allow having an ArviZ independent test suite.
closes arviz-devs/arviz#1278, closes arviz-devs/arviz#939 and closes arviz-devs/arviz#1470
I think we can also solve the issue in arviz-devs/arviz#1224 by vectorizing and being fast enough a progressbar makes no sense anymore.
There surely is a lot of room for improvement as I am not yet too familiar with all the v4 changes.
Depending on what your PR does, here are a few things you might want to address in the description: