-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
BUG: automatic data imputation does not work when observed=pm.Data() tensors #6626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In light of pymc-devs/pytensor#258, it seems that MaskedArrays can't be currently supported via PyMC Data (Mutable or Constant), but arrays simply containing import pymc as pm
import numpy as np
real_X = np.random.default_rng().normal(size=1000)
Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan
with pm.Model() as m:
β = pm.Normal("β", 0, 1)
σ = pm.Exponential("σ", 1)
X = pm.Normal("X", 0, 1, observed=pm.ConstantData("X_with_nans", X))
pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y)
m.compile_logp()(m.initial_point()) # array(nan) By introspecting the values of |
@ricardoV94 - it looks like your previous message got cut off. Based on what you explained in pymc-devs/pytensor#258, this is a bit more complicated than I thought - appreciate you explaining it. At the same time, the automatic imputation of missing values is quite a core concept to Bayesian workflows, and currently pyMC's support to that is a bit awkward. On the one hand, it's possible to leverage it by passing masked arrays directly to I am not familiar enough with the pyMC / pytensor internals to have a big picture, but could an interim solution be:
In that case, automatic imputation would still be backward compatible while anyone who passes nan/masked arrays to Appreciate it may break some implicit assumptions elsewhere, though. |
@kamicollo that seems reasonable. Would you mind opening a PR for that? |
Yes, I can have a go at that - may revert if I run into issues, as I see the current code relies a lot on subtensors, and I may need some help to figure out how exactly to leverage them. |
Stumbling into this 18 months later, I'd like to be able to support the usual workflow, and allow for data to be missing in specified features:
The auto-impute works great for the in-sample dataset, but when I want to replace data in the E.g. inside model context I want to replace
EDIT I can get around that by rebuilding the model using
Any ideas of alternative methods? @ricardoV94 EDIT EDIT I found a hacky way! In a two step process, sample_ppc the new |
@jonsedar you can build the imputed model manually instead of relying on pymc/tests/distributions/test_distribution.py Lines 424 to 430 in a507ea8
In general, the auto-imputation shouldn't come from some magic |
Thanks @ricardoV94, that's interesting - nice idea. FWIW I'll have this example notebook ready for PR today or tomorrow and hope to get your input :) I'm hoping it can be a simple one-stop shop to demonstrate a few approaches |
Describe the issue:
Automatic imputation fails silently in pyMC if a user passes partially observed data held in
pm.ConstantData()
orpm.MutableData()
toobserved
parameter of any distribution. In simple models, the user won't be able to sample (as loglik will evaluate to nan), but I have also been able to run more complex (GP) models that sampled - likely producing wrong results. (see https://discourse.pymc.io/t/issue-imputing-data-for-gaussian-process-model/11626/3 for detail).Based on my initial review of the source code, it seems the culprit is
Model.make_obs_var()
method, where the check whether passed data is performed withmask = getattr(data, "mask", None)
, which always returns None for tensors.In case of
pm.ConstantData()
, the fix appears to be quite simple (need to retrieve masked values bymask = getattr(data.value, "mask", None)
instead. In case ofpm.MutableData()
, however, the issue seems to be thatpytensor.shared()
does not maintain masked values. That is very problematic on its own if masked values are represented by actual numbers and notnp.nan
. I'll file an issue under pytensor project about this, too.I'd be happy to contribute a PR for
pm.ConstantData()
fix + possibly aNotImplemented
error forpm.MutableData()
if this indeed cannot be solved in other ways. I'm new to pyMC code base and may be missing the big picture!Reproduceable code example:
Error message:
No response
PyMC version information:
pymc 5.1.2
pytensor 2.10.1
Context for the issue:
The fact that it fails silently on some models is particularly concerning - it means some users may be using pyMC and getting wrong inference results.
The text was updated successfully, but these errors were encountered: