-
-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add test for from_cmdstanpy to distinguish between vectors of length 1 and scalars #1646
Comments
FWIW, this example works fine if these lines are removed: https://github.com/arviz-devs/arviz/blob/main/arviz/data/io_cmdstanpy.py#L594-L595 I don't know if it will break anything else, but I can make a quick PR .. |
There was probably some reason for the logic. code should check if the variable is 'scalar' or nD object. |
I think we added that to distinguish between multidimensional arrays and vectors to fix the previous issue where everything was flattened. We need some tests to check all possible cases, scalar, 1d array of lenght 1, 1d array of length n and nd array to make sure removing this doesn't break anything. |
And we should probably test on length 0 arrays for sampling wrappers and reloo |
This is a problem I'm running into with reloo since my |
@bmorris3 This hasn't been integrated into the main branch due to lack of testing, but I made this change which resolved the problem for me: https://github.com/arviz-devs/arviz/compare/main...jburos:fix-cmdstanpy-length1?expand=1 I'd be curious if this fixes your issue with |
Thanks @jburos, I've made some tweaks that get the |
I don't think a scalar khat makes any sense, loo and waic require the pointwise log likelihood values so if you have the aggregated log likelihood term you can't really use them (even if technically they do take an array of log probabilities so ArviZ's function will run without problem one way or another). Moreover, running reloo on a single observation model instead of looping over the khats and refitting for the bad ones only should be exactly the same as refitting the model by itself, or to refitting the model with no data at all? 🤔 IIRC, https://arviz-devs.github.io/arviz/user_guide/pystan_refitting.html uses what I mean by these length 0 arrays (when fitting the whole model, the
I wanted to give a quick look at the changes and I seem to have opened a PR from your fork @jburos 😅 . If you want and have time we'll be happy to review and help with testing but I think you should be the one to open the PR. Otherwise I think it's fine to keep the comment here in case someone else wants to take over, add the test and submit the PR themelves. |
Noting here, in response to the current PR, that this bug persists with the latest version of cmdstanpy ( Error output: In [6]: idata1 = az.from_cmdstanpy(fit1, coords = coords1, dims = dims,
...: posterior_predictive = ["y_hat"],
...: prior_predictive = ['y_hat'],
...: log_likelihood = ["log_lik"])
/Users/jburos/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/base.py:130: UserWarning: In variable theta_tilde, there are more dims (1) given than exist (0). Passed array should have shape (chain,draw, *shape)
warnings.warn(
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 idata1 = az.from_cmdstanpy(fit1, coords = coords1, dims = dims,
2 posterior_predictive = ["y_hat"],
3 prior_predictive = ['y_hat'],
4 log_likelihood = ["log_lik"])
File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/io_cmdstanpy.py:834, in from_cmdstanpy(posterior, posterior_predictive, predictions, prior, prior_predictive, observed_data, constant_data, predictions_constant_data, log_likelihood, index_origin, coords, dims, save_warmup, dtypes)
768 def from_cmdstanpy(
769 posterior=None,
770 *,
(...)
783 dtypes=None,
784 ):
785 """Convert CmdStanPy data into an InferenceData object.
786
787 For a usage example read the
(...)
832 InferenceData object
833 """
--> 834 return CmdStanPyConverter(
835 posterior=posterior,
836 posterior_predictive=posterior_predictive,
837 predictions=predictions,
838 prior=prior,
839 prior_predictive=prior_predictive,
840 observed_data=observed_data,
841 constant_data=constant_data,
842 predictions_constant_data=predictions_constant_data,
843 log_likelihood=log_likelihood,
844 index_origin=index_origin,
845 coords=coords,
846 dims=dims,
847 save_warmup=save_warmup,
848 dtypes=dtypes,
849 ).to_inference_data()
File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/io_cmdstanpy.py:456, in CmdStanPyConverter.to_inference_data(self)
446 def to_inference_data(self):
447 """Convert all available data to an InferenceData object.
448
449 Note that if groups can not be created (i.e., there is no `output`, so
450 the `posterior` and `sample_stats` can not be extracted), then the InferenceData
451 will not have those groups.
452 """
453 return InferenceData(
454 save_warmup=self.save_warmup,
455 **{
--> 456 "posterior": self.posterior_to_xarray(),
457 "sample_stats": self.sample_stats_to_xarray(),
458 "posterior_predictive": self.posterior_predictive_to_xarray(),
459 "predictions": self.predictions_to_xarray(),
460 "prior": self.prior_to_xarray(),
461 "sample_stats_prior": self.sample_stats_prior_to_xarray(),
462 "prior_predictive": self.prior_predictive_to_xarray(),
463 "observed_data": self.observed_data_to_xarray(),
464 "constant_data": self.constant_data_to_xarray(),
465 "predictions_constant_data": self.predictions_constant_data_to_xarray(),
466 "log_likelihood": self.log_likelihood_to_xarray(),
467 },
468 )
File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/base.py:65, in requires.__call__.<locals>.wrapped(cls)
63 if all((getattr(cls, prop_i) is None for prop_i in prop)):
64 return None
---> 65 return func(cls)
File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/io_cmdstanpy.py:139, in CmdStanPyConverter.posterior_to_xarray(self)
135 dims = deepcopy(self.dims) if self.dims is not None else {}
136 coords = deepcopy(self.coords) if self.coords is not None else {}
138 return (
--> 139 dict_to_dataset(
140 data,
141 library=self.cmdstanpy,
142 coords=coords,
143 dims=dims,
144 index_origin=self.index_origin,
145 ),
146 dict_to_dataset(
147 data_warmup,
148 library=self.cmdstanpy,
149 coords=coords,
150 dims=dims,
151 index_origin=self.index_origin,
152 ),
153 )
File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/base.py:307, in dict_to_dataset(data, attrs, library, coords, dims, default_dims, index_origin, skip_event_dims)
305 data_vars = {}
306 for key, values in data.items():
--> 307 data_vars[key] = numpy_to_data_array(
308 values,
309 var_name=key,
310 coords=coords,
311 dims=dims.get(key),
312 default_dims=default_dims,
313 index_origin=index_origin,
314 skip_event_dims=skip_event_dims,
315 )
316 return xr.Dataset(data_vars=data_vars, attrs=make_attrs(attrs=attrs, library=library))
File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/base.py:254, in numpy_to_data_array(ary, var_name, coords, dims, default_dims, index_origin, skip_event_dims)
252 # filter coords based on the dims
253 coords = {key: xr.IndexVariable((key,), data=np.asarray(coords[key])) for key in dims}
--> 254 return xr.DataArray(ary, coords=coords, dims=dims)
File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/xarray/core/dataarray.py:402, in DataArray.__init__(self, data, coords, dims, name, attrs, indexes, fastpath)
400 data = _check_data_shape(data, coords, dims)
401 data = as_compatible_data(data)
--> 402 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
403 variable = Variable(dims, data, attrs, fastpath=True)
404 indexes = dict(
405 _extract_indexes_from_coords(coords)
406 ) # needed for to_dataset
File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/xarray/core/dataarray.py:121, in _infer_coords_and_dims(shape, coords, dims)
119 dims = tuple(dims)
120 elif len(dims) != len(shape):
--> 121 raise ValueError(
122 "different number of dimensions on data "
123 f"and dims: {len(shape)} vs {len(dims)}"
124 )
125 else:
126 for d in dims:
ValueError: different number of dimensions on data and dims: 2 vs 3
In [7]: cmdstanpy.__version__
Out[7]: '1.0.1' |
The issue should be now fixed (thanks to #2023) but we should still add a test for this situation. |
EDIT (@OriolAbril): See #1646 (comment) for an up to date description of the issue.
Describe the bug
When fitting a model using cmdstanpy and reading it into an arviz.InferenceData object, I sometimes end up with a vector parameter with length 1.
When this happens, the io_cmdstanpy code incorrectly ignores the last dimension, causing an error when I provide coord/dims for that parameter.
To Reproduce
I have created a reproducible example here:
The error looks like this:
I think this is happening in this line: https://github.com/arviz-devs/arviz/blob/main/arviz/data/io_cmdstanpy.py#L594-L595
Expected behavior
The from_cmdstanpy method should ideally know (from the dims) that this is a 1d vector or array, and label it as such.
Additional context
arviz version:
0.11.2
(github hash: 23e14fb)cmdstanpy version:
0.9.68
python version: 3.8.7 (default, Feb 3 2021, 06:31:03) \n[Clang 12.0.0 (clang-1200.0.32.29)]
The text was updated successfully, but these errors were encountered: