-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-index indexing #802
Multi-index indexing #802
Conversation
This looks very nice! I would probably opt for making |
I followed your suggestions. Two more comments (not critical issues I think) :
In summary, |
Indeed. This would require an another data structure somewhere keeping track of level names -- and ideally also ensuring that they are always unique (like dimensions). This seems fine to me for now.
I agree -- better to require the user to be explicit. I also don't see many use cases for specifying the coordinate value and level name but not the dimension name. What happens if you type |
After refactoring a bit,
|
Dictionaries not hashable, so we might be able to detect this case by On Fri, Mar 25, 2016 at 10:10 AM, Benoit Bovy notifications@github.com
|
Unless you see any other issues, I think that this feature doesn't need more development for now. I'll be back next week to finish this PR (write some tests and doc). |
This will be a great feature. I for one am really looking forward to using it. Will this work also allow saving to/reading from hdf5 and netcdf files with a MultiIndex? If not, can you give a sketch outline of the approach you (Stephan or Benoit) would take? I assume it would involve saving the information about the MultiIndex structure in some transformed way that fits into an hdf5 file, then reconstructing it on the read. I might need to hack together something for that before MultiIndex serialization makes it into xarray, but I'd like to make sure I don't veer too far off from the real solution that will ultimately come out. |
I hope to have some time next week to work again on this PR. @tippetts You can see in #719 a few comments about saving/reading xarray data objects with multi-index to/from netCDF. I also looking forward to see this feature implemented - actually I need it for another project that uses xarray - so maybe I'll find some time in the next couple of weeks to start a new PR on this. |
ae12850
to
93ef7d2
Compare
I finally managed to add some tests and docs. Two more comments:
@shoyer I think that it is ready for review. I successfully run the tests on my local branch. Currently, CI tests seem broken for some reason I don't know. |
|
||
da_midx.sel(x=(list('ab'), [0])) | ||
|
||
Indexing with dictionaries uses the ``MultiIndex.get_loc_level`` pandas method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an implementation detail -- probably best to leave it out of the public docs.
e0569dc
to
5f7d670
Compare
for k, v in iteritems(self._variables): | ||
if k in indexes.keys(): | ||
idx = indexes[k] | ||
variables[k] = Coordinate(idx.name, idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If idx.name != k
above, then this could be constructing an invalid dataset.
I think we should create Coordinate(k, idx)
and then remap back to the original names below, if necessary.
Just went through and gave another full review -- this is looking quite nice, just a few more things to clean up! |
Thanks for your second review @shoyer ! It actually helped me to better understand the indexing logic of pandas (never too late)! I made some updates according to your comments. I think we're getting closer to a working feature! |
However, the alternate ``from_series`` constructor will automatically unpack | ||
any hierarchical indexes it encounters by expanding the series into a | ||
multi-dimensional array, as described in :doc:`pandas`. | ||
Xarray supports labeling coordinate values with a :py:class:`pandas.MultiIndex`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense simply to drop this paragraph instead -- do we really need to explicitly call out MultiIndex if it's supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't think we need it. However, it might be good to put a sentence somewhere in the docs to recommend users to set names for multi-index levels before creating data arrays or datasets. What do you think?
I did a little bit of testing. Here is one case I found where things don't work as I expected:
I would expect an index drop in the last case, too. I guess we need to check for scalars. |
indexer, new_index = index.get_loc_level( | ||
label, level=list(range(len(label))) | ||
) | ||
|
||
else: | ||
label = _asarray_tuplesafe(label) | ||
if label.ndim == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where scalars end up -- probably need to add a clause here to handle MultiIndex
Mind if I ask if this will get merged into master? It looks like a lot of work went into the pull request, and the discussion + passed checks lead me to believe it could be close to going in. Is there anything a third party can do to push it across the finish line? |
Follows #767.
This is incomplete (it still needs some tests and documentation updates), but it is working for both
Dataset
andDataArray
objects. I also don't know if it is fully compatible with lazy indexing (Dask).Using the example from #767:
As shown in this example, similarily to pandas, it automatically renames the dimension and assigns a new coordinate when the selection doesn't return a
pd.MultiIndex
(here it returns apd.FloatIndex
).In some cases this behavior may be unwanted (??), so I added a
drop_level
keyword argument (ifFalse
it keeps the multi-index and doesn't change the dimension/coordinate names):Note that it also works with
DataArray.loc
, but (for now) in that case it always returns the multi-index:This is however inconsistent with
Dataset.sel
andDataset.loc
that both applydrop_level=True
by default, due to their different implementation. Two solutions: (1) makeDataArray.loc
apply drop_level by default, or (2) usedrop_level=False
by default everywhere.