Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean up the API for renaming and changing dimensions / coordinates #4825

Open
keewis opened this issue Jan 19, 2021 · 5 comments
Open

clean up the API for renaming and changing dimensions / coordinates #4825

keewis opened this issue Jan 19, 2021 · 5 comments

Comments

@keewis
Copy link
Collaborator

keewis commented Jan 19, 2021

From #4108:

I wonder if it would be better to first "reorganize" all of the existing functions: we currently have rename (and Dataset.rename_dims / Dataset.rename_vars), set_coords, reset_coords, set_index, reset_index and swap_dims, which overlap partially. For example, the code sample from #4417 works if instead of

ds = ds.rename(b='x')
ds = ds.set_coords('x')

we use

ds = ds.set_index(x="b")

and something similar for the code sample in #4107.

I believe we currently have these use cases (not sure if that list is complete, though):

  • rename a DataArrayrename
  • rename a existing variable to a name that is not yet in the object → rename / Dataset.rename_vars / Dataset.rename_dims
  • convert a data variable to a coordinate (not a dimension coordinate) → set_coords
  • convert a coordinate (not a dimension coordinate) to a data variable → reset_coords
  • swap a existing dimension coordinate with a coordinate (which may not exist) and rename the dimension → swap_dims
  • use a existing coordinate / data variable as a dimension coordinate (do not rename the dimension) → set_index
  • stop using a coordinate as dimension coordinate and append _ to its name (do not rename the dimension) → reset_index
  • use two existing coordinates / data variables as a MultiIndex → set_index
  • stop using a MultiIndex as a dimension coordinate and use its levels as coordinates → reset_index

Sometimes, some of these can be emulated by combinations of others, for example:

# x is a dimension without coordinates
assert_identical(ds.set_index({"x": "b"}), ds.swap_dims({"x": "b"}).rename({"b": "x"}))
assert_identical(ds.swap_dims({"x": "b"}), ds.set_index({"x": "b"}).rename({"x": "b"}))

and, with this PR:

assert_identical(ds.set_index({"x": "b"}), ds.set_coords("b").rename({"b": "x"}))
assert_identical(ds.swap_dims({"x": "b"}), ds.rename({"b": "x"}))

which means that it would increase the overlap of rename, set_index, and swap_dims.

In any case I think we should add a guide which explains which method to pick in which situation (or extend howdoi).

Originally posted by @keewis in #4108 (comment)

@max-sixty
Copy link
Collaborator

I agree it's confusing, and I find myself looking up the correct way to do something fairly frequently.

I'm not sure exactly how to rationalize them though.

Without thinking this through in enough detail, Does set_ set the variable to the relevant type and reset_ unsets it to a normal variable?
In that framework, I think the only aberrant is swap_dims, which could instead be set_dims?

@max-sixty
Copy link
Collaborator

There's also rename_dims, but only on a Dataset! There's a TODO to deprecate swap_dims in favor of that. Though I'm not sure that's right — swap_dims changes the dim values for another variable's, but rename_dims just changes the name?

@dcherian
Copy link
Contributor

I'm not sure exactly how to rationalize them though.

How about

  1. rename_* should only allow renaming to unused names,
  2. swap_dims should only work with existing names.
  3. set_index should raise an error for the "renaming" and "swapping" cases.

@keewis
Copy link
Collaborator Author

keewis commented Jan 25, 2021

if we do 2 I think we should add a new_dims kwarg to reset_index, otherwise swapping to a dimension without a coordinate is a lot more difficult.

@benbovy
Copy link
Member

benbovy commented Sep 10, 2021

I think that the explicit index refactoring would be an opportunity to clarify things a bit. With no concept of a dimension coordinate/ implicit index, ideally we could have the following behavior with a clear distinction between dimensions/coordinates and indexes, following more closely the "explicit is better than implicit" principle:

  • set_index / reset_index would only affect the indexes without implicitly renaming any coordinate or dimension.

    • actually, setting a new index for a given dimension doesn't make much sense anymore with the new data model. Setting a new index for a given set of variables and/or coordinates makes more sense
    • trying to set an index from non-existing data variables or coordinates would raise a KeyError
    • .reset_index("x") would not rename the x coordinate to x_
    • .set_index(x="b") (where x is a dimension without index/coordinate and b is a variable with dimension x) would be replaced by a more explicit set_index("b").rename({"b": "x"}). Renaming b to x wouldn't even be necessary if we want to use the b coordinate as an index.
  • rename_* would only affect the name of coordinates or dimensions (or the name of the DataArray) without implicitly adding or removing any index.

    • .set_coords("b").rename({"b": "x"}) in the example above would not implictly create an index for the "x" coordinate
  • swap_dims would only affect the existing dimension names without implicitly adding or removing any index and without renaming any coordinate.

This would be in an ideal world, though. More practically, I don't know how we could make a smooth transition.

There's also rename_dims, but only on a Dataset! There's a TODO to deprecate swap_dims in favor of that. Though I'm not sure that's right — swap_dims changes the dim values for another variable's, but rename_dims just changes the name?

swap_dims used as a workaround for switching from one to another coordinate as index on a dimension will no longer be needed after the explicit indexes refactoring. Should we eventually depreciate it, then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants