Skip to content

BUG: automatic data imputation does not work when observed=pm.Data() tensors #6626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kamicollo opened this issue Mar 29, 2023 · 7 comments
Labels

Comments

@kamicollo
Copy link

Describe the issue:

Automatic imputation fails silently in pyMC if a user passes partially observed data held in pm.ConstantData() or pm.MutableData() to observed parameter of any distribution. In simple models, the user won't be able to sample (as loglik will evaluate to nan), but I have also been able to run more complex (GP) models that sampled - likely producing wrong results. (see https://discourse.pymc.io/t/issue-imputing-data-for-gaussian-process-model/11626/3 for detail).

Based on my initial review of the source code, it seems the culprit is Model.make_obs_var() method, where the check whether passed data is performed with mask = getattr(data, "mask", None), which always returns None for tensors.

In case of pm.ConstantData(), the fix appears to be quite simple (need to retrieve masked values by mask = getattr(data.value, "mask", None) instead. In case of pm.MutableData(), however, the issue seems to be that pytensor.shared() does not maintain masked values. That is very problematic on its own if masked values are represented by actual numbers and not np.nan. I'll file an issue under pytensor project about this, too.

I'd be happy to contribute a PR for pm.ConstantData() fix + possibly a NotImplemented error for pm.MutableData() if this indeed cannot be solved in other ways. I'm new to pyMC code base and may be missing the big picture!

Reproduceable code example:

import pymc as pm
import numpy as np

real_X = np.random.default_rng().normal(size=1000)
Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan
masked_X = np.ma.masked_where(np.isnan(X), X)

with pm.Model() as m:

    β = pm.Normal("β", 0, 1)
    σ = pm.Exponential("σ", 1)

    # This works
    # X = pm.Normal("X", 0, 1, observed = masked_X, dims='test')   
    
    #T his even fails to sample 
    X = pm.Normal("X", 0, 1, observed = pm.ConstantData("masked_X", masked_X))    
    
    pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y) 

    pm.sample()

Error message:

No response

PyMC version information:

pymc 5.1.2
pytensor 2.10.1

Context for the issue:

The fact that it fails silently on some models is particularly concerning - it means some users may be using pyMC and getting wrong inference results.

@ricardoV94
Copy link
Member

In light of pymc-devs/pytensor#258, it seems that MaskedArrays can't be currently supported via PyMC Data (Mutable or Constant), but arrays simply containing nan should be. In this case, we could at least support the following example:

import pymc as pm
import numpy as np
real_X = np.random.default_rng().normal(size=1000)
Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan

with pm.Model() as m:
    β = pm.Normal("β", 0, 1)
    σ = pm.Exponential("σ", 1)
    X = pm.Normal("X", 0, 1, observed=pm.ConstantData("X_with_nans", X))
    pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y)

m.compile_logp()(m.initial_point())  # array(nan)

By introspecting the values of X_with_nans

@kamicollo
Copy link
Author

kamicollo commented Mar 30, 2023

@ricardoV94 - it looks like your previous message got cut off. Based on what you explained in pymc-devs/pytensor#258, this is a bit more complicated than I thought - appreciate you explaining it.

At the same time, the automatic imputation of missing values is quite a core concept to Bayesian workflows, and currently pyMC's support to that is a bit awkward. On the one hand, it's possible to leverage it by passing masked arrays directly to observed because the automatic imputation logic under Model.make_obs_var() relies on masking attribute to split the RV into observed and missing portions. On the other hand, all workflows that want to get predictions or otherwise "swap" data need to rely on pm.Data - so it's impossible to work both with automatic imputation & more modular code.

I am not familiar enough with the pyMC / pytensor internals to have a big picture, but could an interim solution be:

  1. If data passed to pm.Data() is a masked array, convert it to an array with nan values before passing to pytensor, this way ensuring that masked values do not get unmasked downstream.
  2. Update the code under Model.make_obs_var() that splits the RV into observed/missing portions to do the split based on nan values (or mask, if one exists).

In that case, automatic imputation would still be backward compatible while anyone who passes nan/masked arrays to pm.Data() and then passes it on to observed could also benefit from automatic imputation.

Appreciate it may break some implicit assumptions elsewhere, though.

@ricardoV94
Copy link
Member

@kamicollo that seems reasonable. Would you mind opening a PR for that?

@kamicollo
Copy link
Author

Yes, I can have a go at that - may revert if I run into issues, as I see the current code relies a lot on subtensors, and I may need some help to figure out how exactly to leverage them.

@jonsedar
Copy link
Contributor

jonsedar commented Nov 7, 2024

Stumbling into this 18 months later, I'd like to be able to support the usual workflow, and allow for data to be missing in specified features:

  1. Sample posterior (and posterior predictive, lets live a little) on in-sample dataset
  2. Replace data with out-of-sample dataset and make forecast predictions

The auto-impute works great for the in-sample dataset, but when I want to replace data in the Data containers, there doesn't seem to be a way to replace the masked array.

E.g. inside model context I want to replace dfx_train with dfx_holdout, but there seems no straightforward way to do this

...
xk_ma = np.ma.masked_array(dfx_train[fts_xk].values, mask=np.isnan(dfx_train[fts_xk].values))
xk_mu = pm.Normal('xk_mu', mu=0.0, sigma=1, dims='xk_nm')
xk = pm.Normal('xk', mu=xk_mu, sigma=1.0, observed=xk_ma, dims=('oid', 'xk_nm'))
...

EDIT I can get around that by rebuilding the model using dfx_holdout. Howver, then encounter a more fundamental problem: the xk_unobserved RV in the posterior is of course, sized according to the relevant missing data values in dfx_train. So it can have the wrong dimension for use in e.g:

ida_ppc = pm.sample_posterior_predictive(trace=ida.posterior, var_names=['xk'], predictions=True

Any ideas of alternative methods? @ricardoV94

EDIT EDIT I found a hacky way! In a two step process, sample_ppc the new xk_unobserved, update the idata dataset and then sample_ppc again, this time for the yhat I'm writing this into a pymc-examples notebook as we speak

@ricardoV94
Copy link
Member

ricardoV94 commented Nov 7, 2024

@jonsedar you can build the imputed model manually instead of relying on observed= with nan. It allows a pm.Data boolean variable as the mask, which can later be updated. There's a test for it here:

def test_multivariate_shared_mask_separable(self):
mask = shared(np.array([True]))
obs_data = np.array([[0.1, 0.4, 0.1, 0.4]])
unobs_data = np.array([[0.4, 0.1, 0.4, 0.1]])
rv = pm.Dirichlet.dist([1, 2, 3, 4], shape=(1, 4))
(obs_rv, obs_mask), (unobs_rv, unobs_mask), joined_rv = create_partial_observed_rv(rv, mask)

In general, the auto-imputation shouldn't come from some magic observed but probably from a transform like pm.observe (we could add a mask kwarg), since it creates new variables and deterministics under the hood. It's an eager model transformation if you will.

@jonsedar
Copy link
Contributor

jonsedar commented Nov 7, 2024

Thanks @ricardoV94, that's interesting - nice idea. FWIW I'll have this example notebook ready for PR today or tomorrow and hope to get your input :) I'm hoping it can be a simple one-stop shop to demonstrate a few approaches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants