-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-index levels as coordinates #947
Conversation
if name != var.dims[0]: | ||
continue | ||
level_coords.update(var.to_coord().get_level_coords()) | ||
return level_coords |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something here? Wouldn't this also work without the continue
statement?
for name in self._coord_names:
var = self.variables[name]
if name == var.dims[0]:
level_coords.update(var.to_coord().get_level_coords())
return level_coords
In the example above, I wasn't happy with the initial implementation of using levels in |
This is very exciting to see! A few thoughts on implementation: Instead of always creating a dictionary of level coordinates, I would add an attribute It's much cheaper to call
It's even more extreme for larger indexes. If possible, we should use something closer to this approach when formatting coordinates.
I would actually be happy to disallow both, which might be even easy. It seems like a fine rule to say that you cannot call |
I would suggest putting the logic to create new variables for levels in the private This could get us most of the way there, but there are still a few things to watch out for:
|
I'm conflicted about how to handle the repr. On the one hand, I like how Let me try to sketch out some concrete proposals to encourage the peanut gallery to speak up: Option 1: no special indicator for the MultiIndex:
Option 2: both MultiIndex and levels in repr:
Option 3: both MultiIndex and levels in repr, different symbol for levels:
Option 4: both MultiIndex and levels in repr, different symbol for levels, with indentation:
A separate question (if we pick one of options 2-4) is how to represent the Option A: The tradeoffs here are whether or not we include the exact dtype information ( I'm currently leaning toward Option 3A, but I don't have a strong opinion. |
|
||
Coordinates must always be 1-dimensional. In addition to Variable methods | ||
and properties (attributes, encoding, broadcasting), they support some | ||
pandas.Index methods directly (e.g., get_indexer), even though pandas does | ||
not (yet) support duck-typing for indexes. | ||
""" | ||
|
||
def __init__(self, name, data, attrs=None, encoding=None, fastpath=False): | ||
super(Coordinate, self).__init__(name, data, attrs, encoding, fastpath) | ||
def __init__(self, name, data, attrs=None, encoding=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would slightly rather change the signature from name
-> dims
in the first argument (to match Variable
), and add name=None
as an optional parameter at the end (before fastpath
). This could possibly break use of Coordinate(name=..., data=...)
but I expect that usage is relatively rare -- this part of the API is usually not touched by users.
Possibly it would be a good idea to rename Coordinate
-> IndexVariable
, and change it's intended usage to cover anytime someone wants to represent a pandas.Index
in xarray. This could be useful for anyone who needs support for custom pandas types like Period or Categorical, even if they won't actually use the array for indexing.
Agreed. I have always the tendency to want to make things as flexible as possible, but this is definitely not needed here.
I'm also +1 for Option 3A. Maybe one little argument in favor of Options 3E or 3F is that they still show a consistent repr when a scalar is returned from the multi-index (see below), even though I don't like how they would display duplicate information.
As a recent xarray user, I indeed remember that I initially found confusing to have Dataset or DataArray "coordinates" that can be either |
Sounds good, I will do this in a separate PR. |
I just made some updates.
Done.
Done, although it currently slices an arbitrary number (30) of first elements rather than calculating the number of elements needed for display.
Done.
I tried but it broke some existing tests. It actually triggered data loading for Coordinate objects (via calls to
Right, but in the current implementation this is still used internally.
Done. It should also work with multi-index levels although not tested yet.
I've chosen option 3A for the repr, but I can change it depending on others' opinions.
Done.
They don't appear in there. If we keep Option 3A for the repr, I also think that we can avoid changes here. |
|
||
|
||
def _summarize_coord_levels(coord, col_width, marker): | ||
# TODO: maybe slicing based on calculated number of displayed values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's almost certainly overkill, but you could write:
max_width = OPTIONS['display_width']
max_possibly_relevant = max(int(np.ceil(max_width / 2.0)), 1)
relevant_coord = coord[:max_possibly_relevant]
Either way, you probably do want to slice the coordinate outside the look.
@@ -210,6 +224,7 @@ def __init__(self, data_vars=None, coords=None, attrs=None, | |||
coords = {} | |||
if data_vars is not None or coords is not None: | |||
self._set_init_vars_and_dims(data_vars, coords, compat) | |||
self._check_multiindex_level_names() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than putting this check in the DataArray
and Dataset
constructors, let's calling this from the Variable
constructor instead. Probably the cleanest place to put this check in PandasIndexAdapter
, which is already used to wrap the data from all pandas Indexes:
https://github.com/pydata/xarray/blob/master/xarray/core/indexing.py#L434
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmh not sure to understand as this checks for unique multi-index level names among all the (multi-index) coordinates of a new DataArray
or a Dataset
object. I guess we need to iterate over those coordinates... I can rename it if it's not clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh... that makes perfect sense. I was confused.
In that case, the logic should go into merge.py
, to ensure these checks will be performed every time variables are modified, too (which doesn't necessarily use the constructor). We should be doing this check on the result of merge_variables()
in merge_coords_without_align
, merge_coords
and merge_core
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see, Dataset._set_init_vars_and_dims
indeed depends on merge.py
but I actually didn't go more into that module. I'll take a look.
4672448
to
9dc2c16
Compare
Not sure how to write the tests for this PR, as there are quite many small changes spread in the API (e.g., repr, data object and coordinate properties, etc.). Should I write new tests specific to multi-index or should I modify existing tests (e.g., |
I would usually lean towards dedicated tests (e.g., |
@@ -151,6 +151,12 @@ def to_dataset(self): | |||
def _update_coords(self, coords): | |||
from .dataset import calculate_dimensions | |||
|
|||
for key in coords: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not already checked by all the callers, e.g., in merge_coords_without_align
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, currently it is not, though this should be indeed checked there!
3d9dc31
to
1d6a96f
Compare
@shoyer this is ready for another round of review. I don't see any remaining issue, I added some tests and I updated the docs. |
@@ -208,13 +215,41 @@ def _to_dataset(self, shallow_copy=True): | |||
def to_dataset(self): | |||
return self._to_dataset() | |||
|
|||
def __setitem__(self, key, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just define this method once on AbstractCoordinates
instead of repeating it twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think these should also be caught by the checks in merge.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! So I can remove this.
Rather than adding an independent |
if var.ndim == 1: | ||
level_names = var.to_index_variable().level_names | ||
if level_names is not None: | ||
dim = var.dims[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use tuple unpacking instead: dim, = var.dims
I agree that |
I think what BTW, I will be away over the holiday weekend (in the US), but I expect we will probably be able to merge this shortly after I get back. |
I've made changes according to your comments.
I just thought that returning a dict could be a little more convenient (i.e., using a single call) if we need to get either one particular or several or all level(s) as
Happy holidays! (I'll be on holiday next week too). |
I think the main (only?) thing left to do here is to remove the |
Just removed it. |
OK, in it goes. Big thanks to @benbovy ! |
Implements 2, 4 and 5 in #719.
Demo:
Some notes about the implementation:
Coordinate
so that it allows setting different values for the names of the coordinate and its dimension. There is no breaking change.Coordinate.get_level_coords
method to get independent, single-index coordinates objects from a MultiIndex coordinate.Remaining issues:
Coordinate.get_level_coords
callspandas.MultiIndex.get_level_values
for each level and is itself called each time when indexing and for repr. This can be very costly!! It would be nice to return some kind of lazy index object instead of computing the actual level values.*
for level coordinates.DataArray.level_1
doesn't return anotherDataArray
object:DataArray
orDataset
creation.Of course still needs proper tests and docs...