Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-index indexing #802

Merged
merged 24 commits into from
Jul 19, 2016
Merged

Multi-index indexing #802

merged 24 commits into from
Jul 19, 2016

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Mar 24, 2016

Follows #767.

This is incomplete (it still needs some tests and documentation updates), but it is working for both Dataset and DataArray objects. I also don't know if it is fully compatible with lazy indexing (Dask).

Using the example from #767:

In [4]: da.sel(band_wavenumber={'band': 'foo'})
Out[4]:
<xarray.DataArray (wavenumber: 2)>
array([ 0.00017,  0.00014])
Coordinates:
  * wavenumber  (wavenumber) float64 4.05e+03 4.05e+03

As shown in this example, similarily to pandas, it automatically renames the dimension and assigns a new coordinate when the selection doesn't return a pd.MultiIndex (here it returns a pd.FloatIndex).

In some cases this behavior may be unwanted (??), so I added a drop_level keyword argument (if False it keeps the multi-index and doesn't change the dimension/coordinate names):

In [5]: da.sel(band_wavenumber={'band': 'foo'}, drop_level=False)
Out[5]:
<xarray.DataArray (band_wavenumber: 2)>
array([ 0.00017,  0.00014])
Coordinates:
  * band_wavenumber  (band_wavenumber) object ('foo', 4050.2) ('foo', 4050.3)

Note that it also works with DataArray.loc, but (for now) in that case it always returns the multi-index:

In [6]: da.loc[{'band_wavenumber': {'band': 'foo'}}]
Out[6]:
<xarray.DataArray (band_wavenumber: 2)>
array([ 0.00017,  0.00014])
Coordinates:
  * band_wavenumber  (band_wavenumber) object ('foo', 4050.2) ('foo', 4050.3)

This is however inconsistent with Dataset.sel and Dataset.loc that both apply drop_level=True by default, due to their different implementation. Two solutions: (1) make DataArray.loc apply drop_level by default, or (2) use drop_level=False by default everywhere.

@shoyer
Copy link
Member

shoyer commented Mar 25, 2016

This looks very nice! I would probably opt for making drop_level=True the default for .loc -- it's hard to see when that wouldn't be desirable. In fact, I'm not even sure that it's worth having the option to set drop_level=False.

@benbovy
Copy link
Member Author

benbovy commented Mar 25, 2016

I followed your suggestions.

Two more comments (not critical issues I think) :

  • Multi-index level drop only works when a dict is provided for a dimension, i.e., da.sel(band_wavenumber='foo') still returns the full pandas.MultiIndex. This would require for indexing.convert_label_indexer to return an updated index also when label is not dict-like, which is not that straightforward I think.
  • There is a potential conflict when providing a dictionary to DataArray.loc, it may be interpreted either as a mapping of dimensions / labels or a mapping of index levels / labels for the 1st dimension. For now, the second option is not handled, e.g., da.loc[{'band': 'foo'}] raises a KeyError.

In summary, da.loc[{'band_wavenumber': {'band': 'foo'}}] returns a DataArray with updated index and coordinate / dimension name, da.loc['foo'] returns a DataArray with the full multi-index, and da.loc[{'band': 'foo'}] raises a KeyError. Three different results for the same operation, although I think that the first one is more explicit and better.

@shoyer
Copy link
Member

shoyer commented Mar 25, 2016

Multi-index level drop only works when a dict is provided for a dimension, i.e., da.sel(band_wavenumber='foo') still returns the full pandas.MultiIndex. This would require for indexing.convert_label_indexer to return an updated index also when label is not dict-like, which is not that straightforward I think.

Indeed. This would require an another data structure somewhere keeping track of level names -- and ideally also ensuring that they are always unique (like dimensions). This seems fine to me for now.

There is a potential conflict when providing a dictionary to DataArray.loc, it may be interpreted either as a mapping of dimensions / labels or a mapping of index levels / labels for the 1st dimension. For now, the second option is not handled, e.g., da.loc[{'band': 'foo'}] raises a KeyError.

I agree -- better to require the user to be explicit. I also don't see many use cases for specifying the coordinate value and level name but not the dimension name. What happens if you type da.loc[{'band': 'foo'}, :] or da.loc[{'band': 'foo'}, ...]? In principle these are not ambiguous, but I could only guess at what happens without looking at the code.

@benbovy
Copy link
Member Author

benbovy commented Mar 25, 2016

After refactoring a bit, da.loc[{'band': 'foo'}, :] and da.loc[{'band': 'foo'}, ...] now work as expected.

da.loc[{'band': 'foo'}] still raises a KeyError. Although it is possible to catch this exception, it still remains ambiguous so it is probably better to do nothing.

@shoyer
Copy link
Member

shoyer commented Mar 25, 2016

Dictionaries not hashable, so we might be able to detect this case by
calling hash() and raising a KeyError with alternative suggestions. Just an
idea -- not sure if there's a clean way to implement that.

On Fri, Mar 25, 2016 at 10:10 AM, Benoit Bovy notifications@github.com
wrote:

After refactoring a bit, da.loc[{'band': 'foo'}, :] and da.loc[{'band':
'foo'}, ...] now work as expected.

da.loc[{'band': 'foo'}] still raises a KeyError. Although it is possible
to catch this exception, it still remains ambiguous so it is probably
better to do nothing.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#802 (comment)

@benbovy
Copy link
Member Author

benbovy commented Mar 25, 2016

Unless you see any other issues, I think that this feature doesn't need more development for now.

I'll be back next week to finish this PR (write some tests and doc).

@tippetts
Copy link

tippetts commented Apr 6, 2016

This will be a great feature. I for one am really looking forward to using it.

Will this work also allow saving to/reading from hdf5 and netcdf files with a MultiIndex? If not, can you give a sketch outline of the approach you (Stephan or Benoit) would take? I assume it would involve saving the information about the MultiIndex structure in some transformed way that fits into an hdf5 file, then reconstructing it on the read. I might need to hack together something for that before MultiIndex serialization makes it into xarray, but I'd like to make sure I don't veer too far off from the real solution that will ultimately come out.

@benbovy
Copy link
Member Author

benbovy commented Apr 6, 2016

I hope to have some time next week to work again on this PR.

@tippetts You can see in #719 a few comments about saving/reading xarray data objects with multi-index to/from netCDF. I also looking forward to see this feature implemented - actually I need it for another project that uses xarray - so maybe I'll find some time in the next couple of weeks to start a new PR on this.

@benbovy
Copy link
Member Author

benbovy commented May 30, 2016

I finally managed to add some tests and docs.

Two more comments:

  • Index replacement: if the multi-index has unnamed levels, the chosen name for the replaced index (if any) is <old_dim>_unnamed_level (old_dim is the name of the replaced multi-index dimension).
    Not sure whether or not it is really necessary to track the index level number. Single indexes returned by get_loc_level don't keep that information. Anyway, it is always better to set explicit names for the index levels.
  • Indexing on multi-index now also works with nested tuples (multi-index panel to xrarray dataarray index selection #850).

@shoyer I think that it is ready for review. I successfully run the tests on my local branch. Currently, CI tests seem broken for some reason I don't know.


da_midx.sel(x=(list('ab'), [0]))

Indexing with dictionaries uses the ``MultiIndex.get_loc_level`` pandas method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an implementation detail -- probably best to leave it out of the public docs.

for k, v in iteritems(self._variables):
if k in indexes.keys():
idx = indexes[k]
variables[k] = Coordinate(idx.name, idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If idx.name != k above, then this could be constructing an invalid dataset.

I think we should create Coordinate(k, idx) and then remap back to the original names below, if necessary.

@shoyer
Copy link
Member

shoyer commented Jun 7, 2016

Just went through and gave another full review -- this is looking quite nice, just a few more things to clean up!

@benbovy
Copy link
Member Author

benbovy commented Jun 10, 2016

Thanks for your second review @shoyer ! It actually helped me to better understand the indexing logic of pandas (never too late)!

I made some updates according to your comments.

I think we're getting closer to a working feature!

However, the alternate ``from_series`` constructor will automatically unpack
any hierarchical indexes it encounters by expanding the series into a
multi-dimensional array, as described in :doc:`pandas`.
Xarray supports labeling coordinate values with a :py:class:`pandas.MultiIndex`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense simply to drop this paragraph instead -- do we really need to explicitly call out MultiIndex if it's supported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't think we need it. However, it might be good to put a sentence somewhere in the docs to recommend users to set names for multi-index levels before creating data arrays or datasets. What do you think?

@shoyer
Copy link
Member

shoyer commented Jun 11, 2016

I did a little bit of testing. Here is one case I found where things don't work as I expected:

In [32]: idx = pd.MultiIndex.from_product([['a', 'b'], [0, 1, 2]], names=['lev0', 'lev1'])

In [33]: data = xr.DataArray(range(6), [('x', idx)])

In [34]: data.sel(x=('a',))
Out[34]:
<xarray.DataArray (lev1: 3)>
array([0, 1, 2])
Coordinates:
  * lev1     (lev1) int64 0 1 2

In [35]: data.sel(x='a')
Out[35]:
<xarray.DataArray (x: 3)>
array([0, 1, 2])
Coordinates:
  * x        (x) object ('a', 0) ('a', 1) ('a', 2)

I would expect an index drop in the last case, too. I guess we need to check for scalars.

indexer, new_index = index.get_loc_level(
label, level=list(range(len(label)))
)

else:
label = _asarray_tuplesafe(label)
if label.ndim == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where scalars end up -- probably need to add a clause here to handle MultiIndex

@tippetts
Copy link

Mind if I ask if this will get merged into master? It looks like a lot of work went into the pull request, and the discussion + passed checks lead me to believe it could be close to going in. Is there anything a third party can do to push it across the finish line?

@shoyer
Copy link
Member

shoyer commented Jul 19, 2016

@tippetts Thanks for checking in. I actually hadn't noticed the last round of changes here, but I think this is good to go in now! Big thanks to @benbovy!!!

@shoyer shoyer merged commit 7a9e84b into pydata:master Jul 19, 2016
@benbovy benbovy deleted the multi-index_select branch July 19, 2016 10:46
@benbovy
Copy link
Member Author

benbovy commented Jul 19, 2016

Thanks again @shoyer for your reviews and comments!

Merging this PR broke the tests added in #881. I fixed that in #903.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants