BUG: automatic data imputation does not work when observed=pm.Data() tensors #6626

kamicollo · 2023-03-29T03:05:13Z

Describe the issue:

Automatic imputation fails silently in pyMC if a user passes partially observed data held in pm.ConstantData() or pm.MutableData() to observed parameter of any distribution. In simple models, the user won't be able to sample (as loglik will evaluate to nan), but I have also been able to run more complex (GP) models that sampled - likely producing wrong results. (see https://discourse.pymc.io/t/issue-imputing-data-for-gaussian-process-model/11626/3 for detail).

Based on my initial review of the source code, it seems the culprit is Model.make_obs_var() method, where the check whether passed data is performed with mask = getattr(data, "mask", None), which always returns None for tensors.

In case of pm.ConstantData(), the fix appears to be quite simple (need to retrieve masked values by mask = getattr(data.value, "mask", None) instead. In case of pm.MutableData(), however, the issue seems to be that pytensor.shared() does not maintain masked values. That is very problematic on its own if masked values are represented by actual numbers and not np.nan. I'll file an issue under pytensor project about this, too.

I'd be happy to contribute a PR for pm.ConstantData() fix + possibly a NotImplemented error for pm.MutableData() if this indeed cannot be solved in other ways. I'm new to pyMC code base and may be missing the big picture!

Reproduceable code example:

import pymc as pm
import numpy as np

real_X = np.random.default_rng().normal(size=1000)
Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan
masked_X = np.ma.masked_where(np.isnan(X), X)

with pm.Model() as m:

    β = pm.Normal("β", 0, 1)
    σ = pm.Exponential("σ", 1)

    # This works
    # X = pm.Normal("X", 0, 1, observed = masked_X, dims='test')   
    
    #T his even fails to sample 
    X = pm.Normal("X", 0, 1, observed = pm.ConstantData("masked_X", masked_X))    
    
    pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y) 

    pm.sample()

Error message:

No response

PyMC version information:

pymc 5.1.2
pytensor 2.10.1

Context for the issue:

The fact that it fails silently on some models is particularly concerning - it means some users may be using pyMC and getting wrong inference results.

The text was updated successfully, but these errors were encountered:

ricardoV94 · 2023-03-29T16:02:31Z

In light of pymc-devs/pytensor#258, it seems that MaskedArrays can't be currently supported via PyMC Data (Mutable or Constant), but arrays simply containing nan should be. In this case, we could at least support the following example:

import pymc as pm
import numpy as np
real_X = np.random.default_rng().normal(size=1000)
Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan

with pm.Model() as m:
    β = pm.Normal("β", 0, 1)
    σ = pm.Exponential("σ", 1)
    X = pm.Normal("X", 0, 1, observed=pm.ConstantData("X_with_nans", X))
    pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y)

m.compile_logp()(m.initial_point())  # array(nan)

By introspecting the values of X_with_nans

kamicollo · 2023-03-30T02:22:58Z

@ricardoV94 - it looks like your previous message got cut off. Based on what you explained in pymc-devs/pytensor#258, this is a bit more complicated than I thought - appreciate you explaining it.

At the same time, the automatic imputation of missing values is quite a core concept to Bayesian workflows, and currently pyMC's support to that is a bit awkward. On the one hand, it's possible to leverage it by passing masked arrays directly to observed because the automatic imputation logic under Model.make_obs_var() relies on masking attribute to split the RV into observed and missing portions. On the other hand, all workflows that want to get predictions or otherwise "swap" data need to rely on pm.Data - so it's impossible to work both with automatic imputation & more modular code.

I am not familiar enough with the pyMC / pytensor internals to have a big picture, but could an interim solution be:

If data passed to pm.Data() is a masked array, convert it to an array with nan values before passing to pytensor, this way ensuring that masked values do not get unmasked downstream.
Update the code under Model.make_obs_var() that splits the RV into observed/missing portions to do the split based on nan values (or mask, if one exists).

In that case, automatic imputation would still be backward compatible while anyone who passes nan/masked arrays to pm.Data() and then passes it on to observed could also benefit from automatic imputation.

Appreciate it may break some implicit assumptions elsewhere, though.

ricardoV94 · 2023-03-30T05:33:13Z

@kamicollo that seems reasonable. Would you mind opening a PR for that?

kamicollo · 2023-03-31T01:10:48Z

Yes, I can have a go at that - may revert if I run into issues, as I see the current code relies a lot on subtensors, and I may need some help to figure out how exactly to leverage them.

jonsedar · 2024-11-07T09:01:32Z

Stumbling into this 18 months later, I'd like to be able to support the usual workflow, and allow for data to be missing in specified features:

Sample posterior (and posterior predictive, lets live a little) on in-sample dataset
Replace data with out-of-sample dataset and make forecast predictions

The auto-impute works great for the in-sample dataset, but when I want to replace data in the Data containers, there doesn't seem to be a way to replace the masked array.

E.g. inside model context I want to replace dfx_train with dfx_holdout, but there seems no straightforward way to do this

...
xk_ma = np.ma.masked_array(dfx_train[fts_xk].values, mask=np.isnan(dfx_train[fts_xk].values))
xk_mu = pm.Normal('xk_mu', mu=0.0, sigma=1, dims='xk_nm')
xk = pm.Normal('xk', mu=xk_mu, sigma=1.0, observed=xk_ma, dims=('oid', 'xk_nm'))
...

EDIT I can get around that by rebuilding the model using dfx_holdout. Howver, then encounter a more fundamental problem: the xk_unobserved RV in the posterior is of course, sized according to the relevant missing data values in dfx_train. So it can have the wrong dimension for use in e.g:

ida_ppc = pm.sample_posterior_predictive(trace=ida.posterior, var_names=['xk'], predictions=True

Any ideas of alternative methods? @ricardoV94

EDIT EDIT I found a hacky way! In a two step process, sample_ppc the new xk_unobserved, update the idata dataset and then sample_ppc again, this time for the yhat I'm writing this into a pymc-examples notebook as we speak

ricardoV94 · 2024-11-07T13:54:08Z

@jonsedar you can build the imputed model manually instead of relying on observed= with nan. It allows a pm.Data boolean variable as the mask, which can later be updated. There's a test for it here:

pymc/tests/distributions/test_distribution.py

Lines 424 to 430 in a507ea8

    
           def test_multivariate_shared_mask_separable(self): 
        
               mask = shared(np.array([True])) 
        
               obs_data = np.array([[0.1, 0.4, 0.1, 0.4]]) 
        
               unobs_data = np.array([[0.4, 0.1, 0.4, 0.1]]) 
        
               rv = pm.Dirichlet.dist([1, 2, 3, 4], shape=(1, 4)) 
        
               (obs_rv, obs_mask), (unobs_rv, unobs_mask), joined_rv = create_partial_observed_rv(rv, mask)

In general, the auto-imputation shouldn't come from some magic observed but probably from a transform like pm.observe (we could add a mask kwarg), since it creates new variables and deterministics under the hood. It's an eager model transformation if you will.

jonsedar · 2024-11-07T13:58:51Z

Thanks @ricardoV94, that's interesting - nice idea. FWIW I'll have this example notebook ready for PR today or tomorrow and hope to get your input :) I'm hoping it can be a simple one-stop shop to demonstrate a few approaches

kamicollo added the bug label Mar 29, 2023

kamicollo mentioned this issue Mar 29, 2023

BUG: pytensor.shared does not respect masked missing values pymc-devs/pytensor#258

Closed

kamicollo mentioned this issue Apr 2, 2023

Allow passing masked arrays with missing values to pm.Data() #6645

Draft

5 tasks

ricardoV94 mentioned this issue Jun 27, 2023

Support automatic imputation for multivariate and symbolic distributions #6797

Merged

ricardoV94 mentioned this issue Nov 7, 2024

Allow partial imputation with pm.observe #7204

Open

jonsedar mentioned this issue Nov 9, 2024

New example for auto-imputation aka handle missing values with a simple dataset and full workflow pymc-devs/pymc-examples#721

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: automatic data imputation does not work when observed=pm.Data() tensors #6626

BUG: automatic data imputation does not work when observed=pm.Data() tensors #6626

kamicollo commented Mar 29, 2023

ricardoV94 commented Mar 29, 2023

Uh oh!

kamicollo commented Mar 30, 2023 •

edited

Loading

Uh oh!

ricardoV94 commented Mar 30, 2023

Uh oh!

kamicollo commented Mar 31, 2023

Uh oh!

jonsedar commented Nov 7, 2024 •

edited

Loading

Uh oh!

ricardoV94 commented Nov 7, 2024 •

edited

Loading

Uh oh!

jonsedar commented Nov 7, 2024

Uh oh!

Uh oh!

BUG: automatic data imputation does not work when observed=pm.Data() tensors #6626

BUG: automatic data imputation does not work when observed=pm.Data() tensors #6626

Comments

kamicollo commented Mar 29, 2023

Describe the issue:

Reproduceable code example:

Error message:

PyMC version information:

Context for the issue:

ricardoV94 commented Mar 29, 2023

Uh oh!

kamicollo commented Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ricardoV94 commented Mar 30, 2023

Uh oh!

kamicollo commented Mar 31, 2023

Uh oh!

jonsedar commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ricardoV94 commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonsedar commented Nov 7, 2024

Uh oh!

kamicollo commented Mar 30, 2023 •

edited

Loading

jonsedar commented Nov 7, 2024 •

edited

Loading

ricardoV94 commented Nov 7, 2024 •

edited

Loading