Allow specification of dims instead of shape #3551

aseyboldt · 2019-07-18T11:05:19Z

Right now we always have to specify the shapes of random variables as numbers. If we use arviz to analyse the trace we then need to specify the dimension names and coordinates for those variables. It would be much easier if we could specify the dimension names and coordinates while building the model.
This is a WIP implementation of that idea. There are two ways to specify new dimensions and their coordinates:

Using a pm.Data variable with a pandas dataset. We can extract the index and columns of this and remember it for later
Using the new coords argument to model.

The stochastic volatility model from the getting started notebook could look this this:

# Has a datetime index
returns = pd.read_csv(pm.get_data('SP500.csv'), parse_dates=True, index_col=0)

with pm.Model() as sp500_model:
    # The model remembers the datetime index with the name 'date'
    # Maybe we should add a `store_coords=True` argument, so that
    # it becomes more obvious where the new dimension / coordinate
    # is coming from.
    values = pm.Data('returns', returns['change'], dims='date')
    
    nu = pm.Exponential('nu', 1/10., testval=5.)
    sigma = pm.Exponential('sigma', 1/0.02, testval=.1)

    # We can now figure out the shape of this variable based on the
    # index of the dataset
    s = pm.GaussianRandomWalk('s', sigma=sigma, dims='date')
    volatility_process = pm.Deterministic('volatility_process', pm.math.exp(-2*s)**0.5, dims='date')

    r = pm.StudentT('r', nu=nu, sigma=volatility_process, observed=values, dims='date')

Arviz can then look at model.coords and model.RV_dims so that the coordinates show up in the trace:

>>> sp500_model.coords
{'date': DatetimeIndex(['2008-05-02', '2008-05-05', '2008-05-06', '2008-05-07',
                '2008-05-08', '2008-05-09', '2008-05-12', '2008-05-13',
                '2008-05-14', '2008-05-15',
                ...
                '2009-11-18', '2009-11-19', '2009-11-20', '2009-11-23',
                '2009-11-24', '2009-11-25', '2009-11-27', '2009-11-30',
                '2009-12-01', '2009-12-02'],
               dtype='datetime64[ns]', name='date', length=401, freq=None)}

>>> sp500_model.RV_dims
{'returns': ('date',),
 's': ('date',),
 'volatility_process': ('date',),
 'r': ('date',)}

We can specify completely new coordinates like this:

coords = {
    'a_dim2': pd.Index([6, 7, 8, 9]),
}

with pm.Model(coords=coords) as model:
    pm.Data('foo', pd.Series([1, 2, 3]), dims='a_dim')
    pm.Normal('a', dims=('a_dim', 'a_dim2'))

TODO

More testing and documentation. Do we want to use this in the documentation? (eg getting started).
Do we need to change the dims keyword in pm.Data to make it more obvious that we are also declaring a new dimension?
Add support for model.coords and model.RV_dims to arviz. --> That's WIP Import coords and dims from pymc3 arviz-devs/arviz#757 and will depend on this PR

pymc3/model.py

fonnesbeck · 2019-07-18T14:28:50Z

Perhaps change the getting started notebook to use this? Would be nice to have the example handy.

aseyboldt · 2019-07-19T09:54:42Z

We can also add a pm.TidyData, that deals with translating strings in a dataframe into integer arrays for slicing, and makes sure we still get the right labels in the output.

rpgoldman · 2019-07-19T15:12:26Z

What are the implications of the interaction between this and changing the shape of the model, e.g., by resetting a pm.Data? So, for example, if I change the shape of a set of predictors, and I had a variable whose dims included a set of rows?

aseyboldt · 2019-07-19T16:21:45Z

@rpgoldman It neither fixes nor breaks (as far as I know) current functionality of changing the data. So basically, it doesn't work. :-(
You can always change the values, but if you change the shape, that might also change the shape of unobserved variables. And a lot of our code assumes that those shapes can't change.
I've been experimenting with fixes for this, but for the PR here I think that is out of scope.

rpgoldman · 2019-07-19T18:12:23Z

@rpgoldman It neither fixes nor breaks (as far as I know) current functionality of changing the data. So basically, it doesn't work. :-(
You can always change the values, but if you change the shape, that might also change the shape of unobserved variables. And a lot of our code assumes that those shapes can't change.
I've been experimenting with fixes for this, but for the PR here I think that is out of scope.

As my adviser used to say, "we have reduced this to a previously unsolved problem." 😉

I take your point -- this is really a problem with pm.Data in general, not a problem with coordinates and dimensions. Follow-up suggestion: if the coords are a function of a pm.Data, what about making it possible to extract them from the pm.Data? So if the pm.Data has a number of rows, instead of making us separately specify the row indices, how about providing a mechanism for pulling them out of the pm.Data object? My guess is that there is already a way to do this, but it probably requires knowledge of the underlying theano or at least the implementation of pm.Data.

aseyboldt · 2019-07-19T18:41:15Z

"we have reduced this to a previously unsolved problem."

I like that one 😆

Getting the coords is just df.index and df.columns, but we could definitely add that to pm.Data. I'm not sure what the use-case would be though.
Maybe just pm.Data().coords and pm.Data().dims? The first would return a dict from dim name to df.columns and df.index, the second (df.index.name or kwargs['dims'][0]], df.columns.name or kwargs['dims'][1]).

rpgoldman · 2019-07-19T18:43:45Z

"we have reduced this to a previously unsolved problem."

I like that one 😆

Getting the coords is just df.index and df.columns, but we could definitely add that to pm.Data. I'm not sure what the use-case would be though.
Maybe just pm.Data().coords and pm.Data().dims? The first would return a dict from dim name to df.columns and df.index, the second (df.index.name or kwargs['dims'][0]], df.columns.name or kwargs['dims'][1]).

pm.Data doesn't always come from a pandas dataframe, though, so if a modeler is writing code that will build a model with pm.Data, they might want to use one dimension of the shape of a numpy array, for example.

rpgoldman · 2019-07-20T14:39:32Z

We can also add a pm.TidyData, that deals with translating strings in a dataframe into integer arrays for slicing, and makes sure we still get the right labels in the output.

This looks really useful! I have a model where I have done a lot of indexing (building on the ideas that @junpenglao showed in his schizophrenia study example), and I think this would make that sort of thing much easier.

aseyboldt · 2019-07-20T15:02:09Z

What case study is that? I've been looking for a nice little example that we can use to demonstrate this. rpgoldman <notifications@github.com> schrieb am Sa., 20. Juli 2019, 16:39:

…

We can also add a pm.TidyData, that deals with translating strings in a dataframe into integer arrays for slicing, and makes sure we still get the right labels in the output. This looks really useful! I have a model where I have done a lot of indexing (building on the ideas that @junpenglao <https://github.com/junpenglao> showed in his schizophrenia study example), and I think this would make that sort of thing much easier. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3551>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOLSHN5QS26ZYGLQX5HI23QAMPSRANCNFSM4IEZRJMQ> .

rpgoldman · 2019-07-21T20:14:39Z

What case study is that? I've been looking for a nice little example that we can use to demonstrate this.

This is the one:

https://colab.research.google.com/github/junpenglao/advance-bayesian-modelling-with-PyMC3/blob/master/Notebooks/Code10%20-%20Schizophrenic_case_study.ipynb

michaelosthege · 2020-06-05T11:48:16Z

I just realize that I have a use case for this feature!
Anybody else interested in re-activating this PR?

Oh and the rt.live model could actually use this too. Currently the coords there are tracked as a separate dict. (cc @twiecki @AlexAndorra @canyon289 )

AlexAndorra · 2020-06-05T12:16:05Z

This feature does sound interesting to have! Le ven. 5 juin 2020 à 13:48, Michael Osthege <notifications@github.com> a écrit :

…

I just realize that I have a use case for this feature! Anybody else interested in re-activating this PR? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3551 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHIJMTALIKG5G3SDPMS44J3RVDLRRANCNFSM4IEZRJMQ> .

michaelosthege · 2020-06-05T15:36:37Z

@aseyboldt @AlexAndorra I fixed the conflicts - let's see what Travis says.
Getting this into 3.9.0 would play nicely with the new return_inferencedata option.

AlexAndorra · 2020-06-06T12:55:30Z

I don't know about you, but on my browser the tests are still pending, which is weird, since your latest commit is from 21 hours ago 🤔

michaelosthege · 2020-06-07T14:36:36Z

@AlexAndorra I triggered a pipeline build (https://travis-ci.org/github/pymc-devs/pymc3/builds/695670690) but the commit IDs don't match. I don't understand it..

But please review and there'll likely be some change that will trigger them again.

AlexAndorra

This is very nice, thanks a lot for reviving it @michaelosthege 👏
I left a bunch of comments, as this is my first review on this PR. Tell me if anything is unclear.

Also, this will definitely need some new tests. I guess updating the Data container example NB will give us ideas about those tests 😉

Travis is weirdly still hanging on GH, but when you go to Travis page, it seems like the tests passed, so this looks a like a bug -- let's see if it disappears with the new commits 🤷‍♂️

RELEASE-NOTES.md

pymc3/data.py

pymc3/model.py

Co-authored-by: Alexandre ANDORRA <andorra.alexandre@gmail.com>

michaelosthege · 2020-06-09T10:53:51Z

I just pushed a few commits, including the removal of isidentifier. I changed my opinion about that restriction: As Alex pointed out, one big advantage of the feature is to remove the need for keeping dim names around separately. Restricting the names, would mean that the user still has to drag them around itself.

Some questions are still unresolved:

the name logic looks like it's closely connected to how the API shall be used
it's unclear to me if we want to keep TidyData around...

As this is a new feature, I'd personally prefer to keep it simple & not introduce 4 different ways how coords/dims can be specified.

AlexAndorra · 2020-06-10T10:59:54Z

Ok, just pushed my changes 🎉

Fixed some stuff in data.py
Updated the "Data container" and "Getting started" NBs with the new functionalities and reran them with latest PyMC and ArviZ (ReviewNB won't trigger because the PR was created before we connected the app to our repo I think)

I think the most important is to check the implementation in the code and the new docstrings I wrote for pm.Data.
I won't have time to write the tests today, so feel free to take the lead on that if my changes look good and you have some time 🙂

michaelosthege · 2020-06-10T17:31:37Z

GitHub doesn't show it, but the latest Travis build is green 🎉

@AlexAndorra please mark your approval and merge :)

AlexAndorra · 2020-06-10T17:51:57Z

@michaelosthege I approved but I can't merge yet because Travis didn't tell GH it sucessfully ran 🤦

StanczakDominik · 2020-06-18T14:48:27Z

A note to the next unfortunate soul who searches: the updated notebooks are in the repo, but not yet on the website!

Also, thanks for this, this looks like it'll simplify my models a fair bit! 🙇‍♂️

OlafHaag · 2020-09-17T01:36:30Z

Is there any comprehensible documentation on how to use it? The Primer on Bayesian Methods for Multilevel Modeling uses the dims argument, but never explains what it actually does. I'm new to pymc3 and would like to understand how to use dims/shape and what happens with the slicing notation on TransformedRV/FreeRV with TensorSharedVariables like in the example:

coords = {"Level": ["Basement", "Floor"], "obs_id": np.arange(floor.size)}
with pm.Model(coords=coords) as pooled_model:
    floor_idx = pm.Data("floor_idx", floor, dims="obs_id")
    a = pm.Normal("a", 0.0, sigma=10.0, dims="Level")

    theta = a[floor_idx]
    sigma = pm.Exponential("sigma", 1.0)

    y = pm.Normal("y", theta, sigma=sigma, observed=log_radon, dims="obs_id")

What happens when dims are a tuple and when to use it?
How would one add a state-level above counties using dims in the primer example?

Or take the temperature example. What if the data was in long format with an unequal number of measurements for each city? And how to add a level, e.g. size (small, medium, big), climate zone, or continents using dims?

arvids · 2021-04-04T14:50:37Z

Or take the temperature example. What if the data was in long format with an unequal number of measurements for each city? And how to add a level, e.g. size (small, medium, big), climate zone, or continents using dims?

Did you find a solution for this type of problem?

michaelosthege · 2021-04-04T20:26:59Z

@arvids I'm sorry our documentation is still sparse on this topic. However, the dims will become more important with the upcoming PyMC3 v4and we will work on the docs for that specifically.

To answer your question: If the number of measurements for each city differs, that's a typical scenario for imputation. . It should work with dims just fine. Essentially you just put None into the array positions where you don't have data.

Just a note: If your data is very big and very sparse at the same time, you may run into performance limitations because of the imputed values take some memory too. In these cases one can model the data in its long-form and use tensor indexing when constructing deterministics.

Allow specification of dims instead of shape

e8199eb

aseyboldt mentioned this pull request Jul 18, 2019

WIP Import coords and dims from pymc3 arviz-devs/arviz#757

Closed

ferrine reviewed Jul 18, 2019

View reviewed changes

pymc3/model.py Outdated Show resolved Hide resolved

pymc3/model.py Outdated Show resolved Hide resolved

Add pm.TidyData

169feef

Create coords for pm.Data(ndarray)

57909b0

Merge branch 'master' into dims

6cbcdb7

michaelosthege added this to the 3.9.0 milestone Jun 5, 2020

empty commit to trigger CI

e603855

AlexAndorra requested changes Jun 7, 2020

View reviewed changes

michaelosthege and others added 3 commits June 8, 2020 10:38

Apply suggestions from code review

a0753ac

Co-authored-by: Alexandre ANDORRA <andorra.alexandre@gmail.com>

apply black formatting

e7cb979

address review comments & formatting

b9acb96

michaelosthege mentioned this pull request Jun 8, 2020

Automatically take coords and dims from PyMC3 model arviz-devs/arviz#1228

Merged

michaelosthege added 2 commits June 8, 2020 18:56

Add demonstration of named coordinates/dims

fcb709c

don't require dim names to be identifiers

26c3fc8

robustify Dtype and ShapeError

b178207

AlexAndorra added 4 commits June 9, 2020 19:04

Removed TidyData and refined dims and coords implementation

9a943ca

Changed name of kwarg export_dims and improved docstrings

10c617b

Add link to ArviZ in docstrings

39b6d92

Removed TidyData from __all__

903ee61

AlexAndorra mentioned this pull request Jun 9, 2020

changed ar1 logp to use ar1 precision instead of innovation precision #3899

Merged

AlexAndorra added 4 commits June 10, 2020 10:35

Polished Data container NB

68a863f

Fixed line break in data.py

2e73535

Fix inference of coords for dataframes

0236ccd

Refined Data container NB

a4c832b

AlexAndorra mentioned this pull request Jun 10, 2020

New Feature: Add a TidyData Class #3953

Closed

AlexAndorra added 3 commits June 10, 2020 12:28

Updated getting started NB with new dims and coords features

80aaa35

Reran getting started NB

aec6d9c

Blackified NBs

add54c4

AlexAndorra added the enhancements label Jun 10, 2020

michaelosthege and others added 4 commits June 10, 2020 15:54

rerun with ArviZ branch

072c6a4

use np.shape to be compatible with tuples/lists

abfeba9

add tests for named coordinate handling

7a5c327

Extended tests for data container

cc25d47

michaelosthege approved these changes Jun 10, 2020

View reviewed changes

AlexAndorra approved these changes Jun 10, 2020

View reviewed changes

AlexAndorra merged commit 76f3a7b into pymc-devs:master Jun 10, 2020

Uh oh!

Allow specification of dims instead of shape #3551

Allow specification of dims instead of shape #3551

Uh oh!

Conversation

aseyboldt commented Jul 18, 2019 • edited by AlexAndorra Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fonnesbeck commented Jul 18, 2019

Uh oh!

aseyboldt commented Jul 19, 2019

Uh oh!

rpgoldman commented Jul 19, 2019

Uh oh!

aseyboldt commented Jul 19, 2019

Uh oh!

rpgoldman commented Jul 19, 2019

Uh oh!

aseyboldt commented Jul 19, 2019

Uh oh!

rpgoldman commented Jul 19, 2019

Uh oh!

rpgoldman commented Jul 20, 2019

Uh oh!

aseyboldt commented Jul 20, 2019 via email

Uh oh!

rpgoldman commented Jul 21, 2019

Uh oh!

michaelosthege commented Jun 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexAndorra commented Jun 5, 2020 via email

Uh oh!

michaelosthege commented Jun 5, 2020

Uh oh!

AlexAndorra commented Jun 6, 2020

Uh oh!

michaelosthege commented Jun 7, 2020

Uh oh!

AlexAndorra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelosthege commented Jun 9, 2020

Uh oh!

AlexAndorra commented Jun 10, 2020

Uh oh!

michaelosthege commented Jun 10, 2020

Uh oh!

AlexAndorra commented Jun 10, 2020

Uh oh!

StanczakDominik commented Jun 18, 2020

Uh oh!

OlafHaag commented Sep 17, 2020

Uh oh!

arvids commented Apr 4, 2021

Uh oh!

michaelosthege commented Apr 4, 2021

Uh oh!

Uh oh!

aseyboldt commented Jul 18, 2019 •

edited by AlexAndorra

Loading

michaelosthege commented Jun 5, 2020 •

edited

Loading