Consistent API for choosing model RVs / value vars #4846

ricardoV94 · 2021-07-08T07:43:43Z

We have these three types of variables: RandomVariables --- Value vars --- Transformed value vars

with pm.Model() as m:
    x = pm.Normal('x')
    y = pm.HalfNormal('y')

model["x"] returns the x RV
model["y"] returns the y RV
model["y_log__"] returns the y transformed value var.

model.rvs_to_values[model["x"]] returns the x value var
model.rvs_to_values[model["y"]] returns the y transformed value var

The automatic transforms were always a source of confusion, and I am afraid the new changes will make things a bit more confusing still. We should try to be consistent in what is expected as input in PyMC sampling methods. e.g., when:

assigning variables to step methods (see Bring back SMC and allow prior_predictive_sampling to return transformed values #4769)
passing a starting point dictionary (see pymc3.sample() not starting from the specified start value #4689)
choosing which variables to include prior/posterior predictive sampling (see Handle RVs assigned to steps #4701)
Probably more...

Proposal

I suggest that calling model["varname"] should always return the corresponding RV. All PyMC3 functions that need a list of variables as inputs should expect an RV and obtain the respective (transformed) value vars if needed. In line with this:

Calling model["y_log__"] should fail.
When defining a starting dict, the untransformed value of y should be expected and automatically transformed to y_log__ if needed
Step methods should expect a list of RVs as input and retrieve the respective value vars, so that step = pm.Metropolis([model["x"], model["y"]]) would work.
Any methods that accept variable names instead of the actual RVs / value vars should expect these to refer to the name of the RVs.

Defining everything in terms of value vars makes less sense to me. For example they are not used in prior/predictive sampling. Users are also less likely to understand what they mean/ do.

The current approach makes even less sense, and results in a lot of duck typing / guessing on the backend to figure out what the user is trying to select.

The text was updated successfully, but these errors were encountered:

brandonwillard · 2021-07-08T18:38:16Z

We have these three types of variables: RandomVariables --- Value vars --- Transformed value vars

Transformed variables are not a type of variable in the same sense that random variables and their value variables are; instead, they're value variables without explicit random variables. This is due to the old design of PyMC3, which exposes transformed variables at the user-level and makes them an explicit part of the API.

I suggest that calling model["varname"] should always return the corresponding RV.

In general, it would probably be better to have the user only ever deal with random variables, and transforms should only ever be used internally.

Defining everything in terms of value vars makes less sense to me.

What do you mean by "Defining everything in terms of value vars"?

Users are also less likely to understand what they mean/ do.

To which user-level interactions with value variables are you referring? We would need to start enumerating those in order to determine which could/should be improved. One difficulty with that is determining what we would consider a user-level function, method, member, etc. Not everything exposed by classes like Model is actually user-level functionality; much of the machinery in those classes is only ever used internally. In other words, we don't have well defined user/developer boundaries.

ricardoV94 · 2021-07-09T12:55:08Z

To which user-level interactions with value variables are you referring? We would need to start enumerating those in order to determine which could/should be improved. One difficulty with that is determining what we would consider a user-level function, method, member, etc. Not everything exposed by classes like Model is actually user-level functionality; much of the machinery in those classes is only ever used internally. In other words, we don't have well defined user/developer boundaries.

I agree.

I gave 3 examples that I had in mind mostly because they were mentioned in recent user issues or I had to work with the internals (so it's pretty much ad-hoc):

Defining the start point dictionary for sampling (to be used instead of the initval)
Assigning variables to different step methods manually
Specifying which variables a user wants to sample from prior / posterior predictive

I am sure there are others, perhaps in the more specialized modules

brandonwillard · 2021-07-09T16:28:54Z

I gave 3 examples that I had in mind mostly because they were mentioned in recent user issues or I had to work with the internals (so it's pretty much ad-hoc):

The examples you originally gave were very specific to Model.__getitem__, and that's an interface that's used by both internal library code and—assumedly—users, so the question is more about that particular interface, no?

Regardless, yes, I think we should make most/all user-level interactions work in terms of the random variables—and preferably the actual RandomVariable outputs themselves and not their names, because the names impose unnecessary constraints (e.g. unique naming).

ricardoV94 · 2021-07-10T06:33:38Z

The API ambiguity is manifest in all / most user facing methods that accept a list of variables (sometimes just the names), and whose docstrings usually reads something like ”input: list of model variables”

This is separate from the model.__getittem__ thing (although the two can interact —e.g., when variables are be retrieved name).

michaelosthege · 2021-07-17T00:08:59Z

Regardless, yes, I think we should make most/all user-level interactions work in terms of the random variables—and preferably the actual RandomVariable outputs themselves and not their names, because the names impose unnecessary constraints (e.g. unique naming).

Not sure if we're on the same page here: The existing API has this signature: Model.__getitem__(name: str) -> RV with RV being different things, right?
I think we should continue to ingest a name: str there, because why would anybody even use a __getitem__ interface if they already have a handle on the object?

ricardoV94 added the request discussion label Jul 8, 2021

michaelosthege mentioned this issue Oct 1, 2021

Implement and switch to lazy initval evaluation framework #4983

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent API for choosing model RVs / value vars #4846

Consistent API for choosing model RVs / value vars #4846

ricardoV94 commented Jul 8, 2021 •

edited

Loading

brandonwillard commented Jul 8, 2021 •

edited

Loading

ricardoV94 commented Jul 9, 2021

brandonwillard commented Jul 9, 2021 •

edited

Loading

ricardoV94 commented Jul 10, 2021 •

edited

Loading

michaelosthege commented Jul 17, 2021

Consistent API for choosing model RVs / value vars #4846

Consistent API for choosing model RVs / value vars #4846

Comments

ricardoV94 commented Jul 8, 2021 • edited Loading

Proposal

brandonwillard commented Jul 8, 2021 • edited Loading

ricardoV94 commented Jul 9, 2021

brandonwillard commented Jul 9, 2021 • edited Loading

ricardoV94 commented Jul 10, 2021 • edited Loading

michaelosthege commented Jul 17, 2021

ricardoV94 commented Jul 8, 2021 •

edited

Loading

brandonwillard commented Jul 8, 2021 •

edited

Loading

brandonwillard commented Jul 9, 2021 •

edited

Loading

ricardoV94 commented Jul 10, 2021 •

edited

Loading