Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent API for choosing model RVs / value vars #4846

Open
ricardoV94 opened this issue Jul 8, 2021 · 5 comments
Open

Consistent API for choosing model RVs / value vars #4846

ricardoV94 opened this issue Jul 8, 2021 · 5 comments

Comments

@ricardoV94
Copy link
Member

ricardoV94 commented Jul 8, 2021

We have these three types of variables: RandomVariables --- Value vars --- Transformed value vars

with pm.Model() as m:
    x = pm.Normal('x')
    y = pm.HalfNormal('y')

model["x"] returns the x RV
model["y"] returns the y RV
model["y_log__"] returns the y transformed value var.

model.rvs_to_values[model["x"]] returns the x value var
model.rvs_to_values[model["y"]] returns the y transformed value var

The automatic transforms were always a source of confusion, and I am afraid the new changes will make things a bit more confusing still. We should try to be consistent in what is expected as input in PyMC sampling methods. e.g., when:

  1. assigning variables to step methods (see Bring back SMC and allow prior_predictive_sampling to return transformed values #4769)
  2. passing a starting point dictionary (see pymc3.sample() not starting from the specified start value #4689)
  3. choosing which variables to include prior/posterior predictive sampling (see Handle RVs assigned to steps #4701)
  4. Probably more...

Proposal

I suggest that calling model["varname"] should always return the corresponding RV. All PyMC3 functions that need a list of variables as inputs should expect an RV and obtain the respective (transformed) value vars if needed. In line with this:

  • Calling model["y_log__"] should fail.
  • When defining a starting dict, the untransformed value of y should be expected and automatically transformed to y_log__ if needed
  • Step methods should expect a list of RVs as input and retrieve the respective value vars, so that step = pm.Metropolis([model["x"], model["y"]]) would work.
  • Any methods that accept variable names instead of the actual RVs / value vars should expect these to refer to the name of the RVs.

Defining everything in terms of value vars makes less sense to me. For example they are not used in prior/predictive sampling. Users are also less likely to understand what they mean/ do.

The current approach makes even less sense, and results in a lot of duck typing / guessing on the backend to figure out what the user is trying to select.

@brandonwillard
Copy link
Contributor

brandonwillard commented Jul 8, 2021

We have these three types of variables: RandomVariables --- Value vars --- Transformed value vars

Transformed variables are not a type of variable in the same sense that random variables and their value variables are; instead, they're value variables without explicit random variables. This is due to the old design of PyMC3, which exposes transformed variables at the user-level and makes them an explicit part of the API.

I suggest that calling model["varname"] should always return the corresponding RV.

In general, it would probably be better to have the user only ever deal with random variables, and transforms should only ever be used internally.

Defining everything in terms of value vars makes less sense to me.

What do you mean by "Defining everything in terms of value vars"?

Users are also less likely to understand what they mean/ do.

To which user-level interactions with value variables are you referring? We would need to start enumerating those in order to determine which could/should be improved. One difficulty with that is determining what we would consider a user-level function, method, member, etc. Not everything exposed by classes like Model is actually user-level functionality; much of the machinery in those classes is only ever used internally. In other words, we don't have well defined user/developer boundaries.

@ricardoV94
Copy link
Member Author

To which user-level interactions with value variables are you referring? We would need to start enumerating those in order to determine which could/should be improved. One difficulty with that is determining what we would consider a user-level function, method, member, etc. Not everything exposed by classes like Model is actually user-level functionality; much of the machinery in those classes is only ever used internally. In other words, we don't have well defined user/developer boundaries.

I agree.

I gave 3 examples that I had in mind mostly because they were mentioned in recent user issues or I had to work with the internals (so it's pretty much ad-hoc):

  1. Defining the start point dictionary for sampling (to be used instead of the initval)
  2. Assigning variables to different step methods manually
  3. Specifying which variables a user wants to sample from prior / posterior predictive

I am sure there are others, perhaps in the more specialized modules

@brandonwillard
Copy link
Contributor

brandonwillard commented Jul 9, 2021

I gave 3 examples that I had in mind mostly because they were mentioned in recent user issues or I had to work with the internals (so it's pretty much ad-hoc):

The examples you originally gave were very specific to Model.__getitem__, and that's an interface that's used by both internal library code and—assumedly—users, so the question is more about that particular interface, no?

Regardless, yes, I think we should make most/all user-level interactions work in terms of the random variables—and preferably the actual RandomVariable outputs themselves and not their names, because the names impose unnecessary constraints (e.g. unique naming).

@ricardoV94
Copy link
Member Author

ricardoV94 commented Jul 10, 2021

The API ambiguity is manifest in all / most user facing methods that accept a list of variables (sometimes just the names), and whose docstrings usually reads something like ”input: list of model variables”

This is separate from the model.__getittem__ thing (although the two can interact —e.g., when variables are be retrieved name).

@michaelosthege
Copy link
Member

Regardless, yes, I think we should make most/all user-level interactions work in terms of the random variables—and preferably the actual RandomVariable outputs themselves and not their names, because the names impose unnecessary constraints (e.g. unique naming).

Not sure if we're on the same page here: The existing API has this signature: Model.__getitem__(name: str) -> RV with RV being different things, right?
I think we should continue to ingest a name: str there, because why would anybody even use a __getitem__ interface if they already have a handle on the object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants