Port InferenceData conversion code to pymc3 codebase #4489

OriolAbril · 2021-02-26T23:04:03Z

I have started porting the arviz.from_pymc3 converter to the pymc3 codebase.
This should first allow to simplify significantly the code (which has already happened quite a lot
and probably can be further simplified) and to allow having an ArviZ independent test suite.

closes arviz-devs/arviz#1278, closes arviz-devs/arviz#939 and closes arviz-devs/arviz#1470

I think we can also solve the issue in arviz-devs/arviz#1224 by vectorizing and being fast enough a progressbar makes no sense anymore.

There surely is a lot of room for improvement as I am not yet too familiar with all the v4 changes.

Depending on what your PR does, here are a few things you might want to address in the description:

what are the (breaking) changes that this PR makes? It should not break anything
important background, or details about the implementation: see description above
are the changes—especially new features—covered by tests and docstrings?
linting/style checks have been run
consider adding/updating relevant example notebooks
right before it's ready to merge, mention the PR in the RELEASE-NOTES.md

OriolAbril

I have added some comments about places where I expect more changes to be needed. I have pushed to a branch on pymc3 instead of on my fork to allow anyone with permisson to work directly on this.

The gist of each of the converter methods is to generate a dict of str: np.ndarray with the arrays following the chain, draw, *shape convention, then call dict_to_dataset.

I don't know if it would make sense to try and use __numpy_func__ on Aesara (and I don't know enough about either Aesara nor __numpy_func__ to venture a guess), I am only mentioning, because if it did, the InferenceDatas could have Aesara TensorVariables as the underlying data structure and when working with xarray objects, one could use .values to get a numpy array or .data to get the actual data array used. Now ArviZ forces convertion to numpy, but we could change that.

OriolAbril · 2021-02-26T23:05:12Z

pymc3/backends/arviz.py

+import xarray as xr
+
+from aesara.gof.graph import ancestors
+from aesara.tensor.var import TensorVariable


It used to check for PyMC3Variable, I had some doubts between using Variable or TensorVariable

Usually, the difference comes down to whether or not the code requires shape information (e.g. the dtype and broadcastable fields that are exposed by TensorVariable via TensorVariable.type).

I don't think we explicitly use this info, but it should be taken into account when converting to numpy to make sure we preserve the dtypes of the original parameters and data. Thanks for the explanation, I will test to make sure dtypes are preserved.

pymc3/backends/arviz.py

OriolAbril · 2021-02-26T23:08:18Z

pymc3/backends/arviz.py

+    def log_likelihood_vals_point(self, point, var, log_like_fun):
+        """Compute log likelihood for each observed point."""
+        log_like_val = utils.one_de(log_like_fun(point))
+        if var.missing_values:
+            mask = var.observations.mask
+            if np.ndim(mask) > np.ndim(log_like_val):
+                mask = np.any(mask, axis=-1)
+            log_like_val = np.where(mask, np.nan, log_like_val)
+        return log_like_val
+
+    def _extract_log_likelihood(self, trace):
+        """Compute log likelihood of each observation."""
+        if self.trace is None:
+            return None
+        if self.model is None:
+            return None
+
+        if self.log_likelihood is True:
+            cached = [(var, var.logp_elemwise) for var in self.model.observed_RVs]
+        else:
+            cached = [
+                (var, var.logp_elemwise)
+                for var in self.model.observed_RVs
+                if var.name in self.log_likelihood
+            ]
+        log_likelihood_dict = _DefaultTrace(len(trace.chains))
+        for var, log_like_fun in cached:
+            for chain in trace.chains:
+                log_like_chain = [
+                    self.log_likelihood_vals_point(point, var, log_like_fun)
+                    for point in trace.points([chain])
+                ]
+                log_likelihood_dict.insert(var.name, np.stack(log_like_chain), chain)
+        return log_likelihood_dict.trace_dict


I am hoping that all this will go away and pointwise log likelihood can now be evaluated in a vectorized way, looping over variables to loop over chains to loop over draws should not be the way to go.

Yes, it should be possible to evaluate a likelihood in a vectorized fashion. You'll still have to compile an Aesara function with a fixed set of dimensions for the input, but, assuming that the number/order of the input dimensions are always the same (e.g. chains + samples + var dims), that need only be done once for a given model.

OriolAbril · 2021-02-26T23:10:10Z

pymc3/backends/arviz.py

+
+    @requires(["trace", "predictions"])
+    @requires("model")
+    def constant_data_to_xarray(self):


This may also need a significant refactor, but honestly I have no idea

Most of those attributes in self.model are still available, but they might correspond to different variables. For example, model.vars returns the log-likelihood-space TensorVariables, and the model.*_RVs methods return the sample-space TensorVariables.

Regardless, you can always get the corresponding log-likelihood-space TensorVariable from the sample-space variable using var.tag.value_var (I should probably rename it to var.tag.measure_space_var or var.tag.loglik_var, though).

All these checks and selection of variables have only the goal of gathering all named variables that are not sampled parameters nor observed/observations. Up until now that basically meant pm.Data objects that were not passed as observed, things like the x in a linear regression, the county and floor indexes in the radon example... Maybe we can think of a better way to identify these variables

Absolutely; we can very easily obtain variables like that using something as simple as aesara.graph.basic.vars_between, where the ins are the sampled/unobserved variables and the outs are the observed variables.

OriolAbril · 2021-03-02T09:33:40Z

Thanks @Spaak ! On a more general note, does ArviZ preserve the chain indexes when they don't start at 0? I fear the current code may work when different chain idxs are used but the resulting xarray coordinates will still be 0, 1, 2...

Which also poses a question about how we want to handle this going further. Do we want to keep chain_idx handling within sample or leave it to the conversion to inferencedata? Not sure this can be solved satisfactorily before updating the backend from multitrace to something else.

brandonwillard · 2021-03-09T22:59:58Z

pymc3/backends/arviz.py

+                (var, var.logp_elemwise)
+                for var in self.model.observed_RVs
+                if var.name in self.log_likelihood


This will need to be updated, because the var.log* methods are gone now. Instead, you can use something like logpt(var).

I want all of log_likelihood_vals_point and _extract_log_likelihood to go and instead evaluate the likelihood in a vectorized fashion, looping over observed variables at most. I am reading the developer guide and trying to see how that would be done.

I have given this a try now, and I am not understanding what is happening nor where to find the relevant docs between pymc3 docstrings and aesara. Say I have this model:

with pm.Model() as model: m = pm.Normal("m", 0.0, 1.0) a = pm.Normal("a", mu=m, sigma=1, observed=0.0)

I then undestood I have to use logpt on a to get its pointwise log likelihood: log_lik = pm.distributions.logpt(a) and the observed values should be pulled from a. I then try to compile a function to calculate log_lik (for now with manual scalar input for m, but ideally with a numpy array containing the whole trace). However, I can't seem to do that:

aesara.function([], log_lik)

errors out with:

... raise MissingInputError(error_msg, variable=var) aesara.graph.fg.MissingInputError: Input 0 (m) of the graph (indices start from 0), used to compute InplaceDimShuffle{x}(m), was not provided and not given a value. Use the Aesara flag exception_verbosity='high', for more information on this error. Backtrace when that variable is created: File "<stdin>", line 2, in <module> File "/home/oriol/miniconda3/envs/arviz/lib/python3.8/site-packages/pymc3/distribu tions/distribution.py", line 96, in __new__ rv_out = cls.dist(*args, rng=rng, **kwargs) File "/home/oriol/miniconda3/envs/arviz/lib/python3.8/site-packages/pymc3/distribu tions/continuous.py", line 477, in dist return super().dist([mu, sigma], **kwargs) File "/home/oriol/miniconda3/envs/arviz/lib/python3.8/site-packages/pymc3/distribu tions/distribution.py", line 105, in dist rv_var = cls.rv_op(*dist_params, **kwargs)

and

aesara.function([m], log_lik)

errors with:

... raise UnusedInputError(msg % (inputs.index(i), i.variable, err_msg)) aesara.compile.function.types.UnusedInputError: aesara.function was asked to create a function computing outputs given certain inputs, but the provided input variable a t index 0 is not part of the computational graph needed to compute the outputs: m. To make this error into a warning, you can pass the parameter on_unused_input='warn' to aesara.function. To disable it completely, use on_unused_input='ignore'.

Looks like you need to use m.tag.value_var.

It's rather confusing, because the display names for these two variables (i.e. m and m.tag.value_var) are exactly the same. I thought about changing the name of the latter to something like "{rv_name}_value" or "{rv_name}_lik", but I couldn't decide.

Anyway, it's very important to understand that there are now two distinct, but complementary "variables" for each random variable in a model: TensorVariables that represent sample-able random variables, and TensorVariables that represent concrete values of those random variables. This idea is directly related to the observed parameter, because the values passed as observed are essentially a concrete instance of the latter type of variable.

More specifically, when one (sloppily) writes P(Y = y) in "math"-like notation, Y is the random variable and y is an "instance" of Y, but they are two distinct things with different properties/constraints. The latter term (i.e. the "value" variable), y, is the input argument used in density functions, e.g. f(y) = exp(- y**2 * ...) * ...; this is why the "value" variable is the only one used in the log-likelihood graphs.

Really, what I'm pointing out here is that random variables are actually (measurable) functions, both mathematically and, now, in v4. Technically, in v4, random variables are Aesara graphs, but, since Aesara graphs can represent—and be converted into—functions, it's the same idea.

N.B.: These "value" variables are conveniently stored within their corresponding random variable's tag (i.e. a Variable's arbitrary "storage" space) under the name value_var.

brandonwillard

Looks like you might've accidentally merge-committed; you'll definitely need to rebase instead.

OriolAbril · 2021-03-25T00:23:00Z

pymc3/backends/arviz.py

+        for obs in self.model.observed_RVs:
+            if hasattr(obs.tag, "observations"):
+                aux_obs = obs.tag.observations
+                observations[obs.name] = aux_obs.data if hasattr(aux_obs, "data") else aux_obs


Is data the right field? Or should it be value? Or try both?

That will work for Constant objects, but not for shared variables, which will require aux_obs.get_value().

There's also the case where obs.tag.observations is a partially observed variable (i.e. has missing data). The result will be an AdvancedIncSubtensor1 in that case.

Here's a helper function for all those situations:

import numpy as np from aesara.graph.basic import Constant from aesara.tensor.sharedvar import SharedVariable from aesara.tensor.subtensor import AdvancedIncSubtensor1 def extract_data(x): if isinstance(x, Constant): return x.data if isinstance(x, SharedVariable): return x.get_value() if x.owner and isinstance(x.owner.op, AdvancedIncSubtensor1): array_data = extract_data(x.owner.inputs[0]) mask_idx = extract_data(x.owner.inputs[2]) mask = np.zeros_like(array_data) mask[mask_idx] = 1 return np.ma.MaskedArray(array_data, mask) raise TypeError(f"Data cannot be extracted from {x}")

OriolAbril · 2021-03-25T00:24:52Z

pymc3/backends/arviz.py

+    def log_likelihood_to_xarray(self):
+        """Extract log likelihood and log_p data from PyMC3 trace."""
+        # TODO: add pointwise log likelihood extraction to the converter
+        return None


I have disabled the log likelihood conversion for now so we can already start working on tests that depend on this. The ones I am adding from ArviZ do test specifically for log likelihood conversion but I don't think the tests on pymc3 that use sample will. I hope this will already work.

Sounds good!

brandonwillard · 2021-03-25T23:31:56Z

I'm about to push a rebased version of this branch. Once that's done, you can fetch and pull this branch. If you don't have any unpushed local changes, you can git rebase --skip any conflicts to get a clean local copy of this rebased branch.

OriolAbril · 2021-03-25T23:48:44Z

Thanks! I have no unpushed local changes. I also realized that the code I'm using runs only on ArviZ master, so tests won't pass until we make the next release. We don't have a very clear timeline, but it should happen soon

brandonwillard · 2021-03-26T00:01:20Z

I also realized that the code I'm using runs only on ArviZ master, so tests won't pass until we make the next release.

I'm updating all that now. I'll push those changes shortly.

OriolAbril · 2021-03-26T02:09:28Z

pymc3/backends/arviz.py

+            library=pymc3,
+            coords=self.coords,
+            dims=self.dims,
+            # default_dims=[],


This won't work yet I think, dict_to_dataset will try to add the chain and draw dimensions and fail. The old version (with old meaning the current release) had all the code in dict_to_dataset repeated here to not assume chain and draw were present.

dims=self.dims won't work?

no, commenting default_dims=[] because it should revert to the default behaviour of assuming chain and draw are present and they are not.

test_idata_conversion.py passes locally with that commented out, but I had to comment out dims as well.

….11.2

brandonwillard

It looks like I've gotten most/all of the major things convert, so I'll stop making edits for now.

Otherwise, @OriolAbril, I don't know what the deal is with dims, so you'll have to look at that one.

OriolAbril · 2021-03-26T03:35:29Z

I don't know what the deal is with dims, so you'll have to look at that one.

I'll try to make a workaround so we can merge asap. To properly fix that we need an ArviZ release to break all the entanglement.

dict_to_dataset was an internal function that I extended and added to the docs to allow other libraries to easily write their own converters, now it's public and documented so that pymc3 doesn't need to do any xarray black magic to write the converter, but it has not been released yet. The old dict_to_dataset can be imported but it is very limited and there is no way to make it work for observed or constant data groups.

OriolAbril commented Feb 26, 2021

View reviewed changes

OriolAbril mentioned this pull request Mar 1, 2021

use chain serial rather than logical index with _DefaultTrace.insert arviz-devs/arviz#1590

Merged

brandonwillard force-pushed the v4 branch from 932672a to 181c8dd Compare March 2, 2021 21:52

michaelosthege added this to the vNext (4.0.0) milestone Mar 8, 2021

michaelosthege added the v4 label Mar 8, 2021

OriolAbril force-pushed the to_inference_data branch from 277f9d3 to 023fdcd Compare March 9, 2021 21:47

OriolAbril mentioned this pull request Mar 9, 2021

Reinstate log-likelihood transforms #4521

Merged

brandonwillard suggested changes Mar 9, 2021

View reviewed changes

OriolAbril force-pushed the to_inference_data branch 2 times, most recently from a7d52c7 to 9db658f Compare March 13, 2021 00:51

This was referenced Mar 13, 2021

Remove MultiObservedRV in v4 #4534

Closed

Make transforms stateless #4551

Merged

OriolAbril force-pushed the to_inference_data branch from 9db658f to d46dae4 Compare March 24, 2021 22:11

OriolAbril commented Mar 25, 2021

View reviewed changes

brandonwillard force-pushed the to_inference_data branch from face4b7 to 3af0a00 Compare March 25, 2021 23:32

Create extract_obs_data function

0ef932a

brandonwillard force-pushed the to_inference_data branch 2 times, most recently from 816a3fa to 9622401 Compare March 26, 2021 01:10

OriolAbril commented Mar 26, 2021

View reviewed changes

brandonwillard and others added 4 commits March 25, 2021 21:58

Do not use shared variables as inputs during prior/posterior sampling

3713936

Port InferenceData conversion code

78ff887

Disable dims, default_dims, and index_origin options until arviz > v0…

e01a473

….11.2

Re-enable Arviz tests in pymc3.tests.test_sampling

1a604c3

brandonwillard force-pushed the to_inference_data branch from 9622401 to 1a604c3 Compare March 26, 2021 02:58

brandonwillard reviewed Mar 26, 2021

View reviewed changes

brandonwillard previously approved these changes Mar 26, 2021

View reviewed changes

OriolAbril dismissed brandonwillard’s stale review via 8d68c83 March 26, 2021 04:14

OriolAbril added 2 commits March 26, 2021 06:24

add workaround for data groups until next arviz release

d5726e7

activate arviz compat tests

fa7607d

OriolAbril force-pushed the to_inference_data branch from 8d68c83 to fa7607d Compare March 26, 2021 04:24

brandonwillard approved these changes Mar 26, 2021

View reviewed changes

brandonwillard merged commit 22be8fd into v4 Mar 26, 2021

brandonwillard deleted the to_inference_data branch March 26, 2021 14:12

OriolAbril mentioned this pull request Nov 18, 2021

Adding log_likelihood, observed_data, and sample_stats to numpyro sampler #5189

Merged

OriolAbril mentioned this pull request Dec 4, 2021

inferencedata.log_likelihood is summing observations #5236

Closed

Uh oh!

Port InferenceData conversion code to pymc3 codebase #4489

Port InferenceData conversion code to pymc3 codebase #4489

Uh oh!

Conversation

OriolAbril commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OriolAbril left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandonwillard Feb 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandonwillard Mar 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OriolAbril commented Mar 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandonwillard Mar 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandonwillard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandonwillard commented Mar 25, 2021

Uh oh!

OriolAbril commented Mar 25, 2021

Uh oh!

brandonwillard commented Mar 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandonwillard left a comment

Choose a reason for hiding this comment

Uh oh!

OriolAbril commented Mar 26, 2021

Uh oh!

Uh oh!

OriolAbril commented Feb 26, 2021 •

edited

Loading

brandonwillard Feb 27, 2021 •

edited

Loading

brandonwillard Mar 13, 2021 •

edited

Loading

brandonwillard Mar 13, 2021 •

edited

Loading