Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pymc3.Data converts input data to float64 type - so int data cannot later be used as an index #3813

Closed
hottwaj opened this issue Feb 19, 2020 · 12 comments · Fixed by #3925
Closed

Comments

@hottwaj
Copy link
Contributor

hottwaj commented Feb 19, 2020

Hi there guys

I'd like to create a model that I want to fit many times to different datasets for cross validation purposes.

One of my columns of input data is categorical, so I use it to index a vector of RVs depending on which category is presented in each sample of data. Something like this:

cat_mvs = [pymc3.Normal(c, mu = 0, sd = 0.05) for c in unique_categories]

cat_mv_vector = pymc3.math.stack(cat_mvs)
cat_data = pymc3.Data('Categorical Input Data', category_codes)
sample_cat_mv = cat_mv_vector[cat_data]

Note that my category_codes data is a numpy array of integers

That last line of code above triggers an error, here's the traceback within pymc3:

~/.pyenv/versions/3.7.2/envs/jupyterlab-3.7.2/lib/python3.7/site-packages/theano/tensor/var.py in __getitem__(self, args)
    568                             TensorVariable, TensorConstant,
    569                             theano.tensor.sharedvar.TensorSharedVariable))):
--> 570                 return self.take(args[axis], axis)
    571             else:
    572                 return theano.tensor.subtensor.advanced_subtensor(self, *args)

~/.pyenv/versions/3.7.2/envs/jupyterlab-3.7.2/lib/python3.7/site-packages/theano/tensor/var.py in take(self, indices, axis, mode)
    612 
    613     def take(self, indices, axis=None, mode='raise'):
--> 614         return theano.tensor.subtensor.take(self, indices, axis, mode)
    615 
    616     # COPYING

~/.pyenv/versions/3.7.2/envs/jupyterlab-3.7.2/lib/python3.7/site-packages/theano/tensor/subtensor.py in take(a, indices, axis, mode)
   2448             return advanced_subtensor1(a.flatten(), indices)
   2449         elif axis == 0:
-> 2450             return advanced_subtensor1(a, indices)
   2451         else:
   2452             if axis < 0:

~/.pyenv/versions/3.7.2/envs/jupyterlab-3.7.2/lib/python3.7/site-packages/theano/gof/op.py in __call__(self, *inputs, **kwargs)
    613         """
    614         return_list = kwargs.pop('return_list', False)
--> 615         node = self.make_node(*inputs, **kwargs)
    616 
    617         if config.compute_test_value != 'off':

~/.pyenv/versions/3.7.2/envs/jupyterlab-3.7.2/lib/python3.7/site-packages/theano/tensor/subtensor.py in make_node(self, x, ilist)
   1701         ilist_ = theano.tensor.as_tensor_variable(ilist)
   1702         if ilist_.type.dtype not in theano.tensor.integer_dtypes:
-> 1703             raise TypeError('index must be integers')
   1704         if ilist_.type.ndim != 1:
   1705             raise TypeError('index must be vector')

TypeError: index must be integers

It seems that within pymc3.Data(), my category_codes data is being coerced to float64, which is not a valid indexing type.

Looking at the source for pymc3.Data() I think the problem is ultimately in the called function pymc3.model.pandas_to_array which converts its input data to a float on its last line, see https://github.com/pymc-devs/pymc3/blob/master/pymc3/model.py#L1495

Can pymc3.Data() and/or pymc3.model.pandas_to_array be changed to be preserve the input data type?

Thanks!

@hottwaj
Copy link
Contributor Author

hottwaj commented Feb 19, 2020

A temporary workaround seems to be to use theano to cast my index data back to ints:

from theano import tensor as tt
sample_cat_mv = cat_mv_vector[tt.cast(cat_data, 'int8')]

@rpgoldman
Copy link
Contributor

I have something like this in my code:

medium_id = self.Data("medium", df_value("medium_idx", dtype='int'))

self.Data is so I can store some information about the data before invoking pm.Data, df_value pulls information out of a Pandas DataFrame. Internally it does df[key].to_numpy(dtype=dtype).

So I believe that what happens here is that pm.Data if given something that has a dtype specified, will retain that dtype.

TBQH, I am not sure why this works, because it looks like the code inside pm.Data, which invokes pymc3.util.pandas_to_array() on its data argument, which in turn invokes pm.theanof.floatX(), doesn't seem to do anything to take into account the dtype of the array.

One thing to do would be to simply sidestep pm.Data and use theano.shared()

@AlexAndorra
Copy link
Contributor

AlexAndorra commented Feb 19, 2020

Hi!
I think you can also use something like pm.intX(pm.Data("category_codes", category_codes)). I tested it several times to do what you want to do and it should work.

Robert's solution (using theano.shared() instead of pm.Data) is also quite elegant, depending on how familiar you are with theano!
PyMCheers 🖖

@rpgoldman
Copy link
Contributor

FWIW, this makes me wonder why pm.Data does not accept a dtype argument.

@AlexAndorra
Copy link
Contributor

AlexAndorra commented Feb 19, 2020 via email

@hottwaj
Copy link
Contributor Author

hottwaj commented Feb 23, 2020

Hi guys, thanks for your comments! I would be happy to have a go at a PR, but what form should the fix take? From what you said I think the choices are 1&2 and A&B below?

either:

  1. changing pymc3.Data to accept a dtype kwarg

  2. or, getting pymc3.Data to check the input array's dtype automatically?

and
A. calling theano.shared directly from pymc3.Data() (possibly dropping the call to pymc3.util.pandas_to_array()?)

B. or, changing pymc3.util.pandas_to_array() to take accept/check for different dtypes?

Cheers

@rpgoldman
Copy link
Contributor

My preference would be to do number 1, and ideally to have pymc3.Data check input dtype as well, and raise a UserWarning on data type mismatch (e.g., when an integer is turned into a float).

I don't have a strong opinion on A vs. B, but i haven't looked into this yet.

The argument for this is the principle of minimal surprise (I didn't expect that integer to turn into a float!), but the argument against this is that programmers using numpy are notoriously sloppy about types, because numpy tries to be clever about this. Personally, I'm not a big fan of that approach, but my bet is that there are a ton of people who sloppily use 1 or 0 to initialize arrays, confident that they will turn into 1.0 or 0.0...

@AlexAndorra
Copy link
Contributor

Yeah, I think the best would probably be 1 and 2 😆 But if too big a change, and if we assume that people are sloppy about types, then maybe 2 would be better, as it automates the process.

No strong opinion on A and B either -- maybe a slight preference for A: if theano.shared offers a robust solution, why rely on another utility? But I haven't looked into this yet and don't have the same knowledge of the code base as Robert.

@rpgoldman
Copy link
Contributor

@hottwaj To return to this, I think 1 + B would be the right approach. I note that there is already an intX function in theanof.py that could help.

I'm not sure what dtypes we should permit, though. Anything other than just int and float? Or do we also need to handle categoricals in pm.Data()? @AlexAndorra ?

How about making a WIP pull request to get this started?

@AlexAndorra
Copy link
Contributor

Yeah intX looks quite useful; I'm already using it when I need to use pm.Data for indexing -- pm.intX(pm.Data("category_codes", category_codes)).

I'm not sure we need any other types than int and float, as pm.Data is used either for predictors or for indexers. So, in both cases, these are numbers and I don't see how strings (categoricals, etc.) could be used, as you need them for computation in the model.

@hottwaj
Copy link
Contributor Author

hottwaj commented Feb 26, 2020

Great thanks guys, I will go with 1+B and add support for int alongside existing assumption of float. It will maybe take me a week or so to find a time slot to squeeze the work into. Will submit a PR to indicate WIP. Cheers!

hottwaj added a commit to hottwaj/pymc3 that referenced this issue Feb 26, 2020
…nput data (previously all input data was coerced to float)

WIP for pymc-devs#3813
@hottwaj
Copy link
Contributor Author

hottwaj commented Feb 26, 2020

Actually rather than sitting on this I have done some initial changes and submitted a PR.

pandas_to_array now accepts a dtype kwarg and assumes float by default. Only float and int types (including numpy float/int 32/64 types) are accepted.

(minor issue: should e.g. int8, int16 be accepted? is there a generic way of testing that a dtype is a numpy int/float type?)

Data() also accepts a dtype kwarg. The dtype kwarg is set to None by default, indicating that Data() should try to determine the dtype for itself, in which case it uses the following logic:

  1. use dtype of passed input data, if it the input data has a dtype attr
  2. otherwise test if input data is instance of int
  3. otherwise assume float

So I've deviated a bit from what we agreed and implemented more of a 2+B :) Happy to revert to an implementation in the style of 1+B if that's what you'd prefer though.

Thanks!

lucianopaz pushed a commit that referenced this issue May 18, 2020
* Initial changes to allow pymc3.Data() to support both int and float input data (previously all input data was coerced to float)
WIP for #3813

* added exception for invalid dtype input to pandas_to_array

* Refined implementation

* Finished dtype conversion handling

* Added SharedVariable option to getattr_value

* Added dtype handling to set_data function

* Added tests for pm.Data used for index variables

* Added tests for using pm.data as RV input

* Ran Black on data tests files

* Added release note

* Updated release notes

* Updated code in light of Luciano's comments

* Fixed implementation of integer checking

* Simplified implementation of type checking

* Corrected implementation for other uses of pandas_to_array

Co-authored-by: hottwaj <jonathan.a.clarke@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants