-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiIndex and data selection #767
Comments
Mmm now I'm wondering if the problem I explained above isn't just related to the 3rd TODO item in #719 (make levels accessible as coordinate variables). Sorry for the post if it is the case. |
This is a really good point that honestly I had not thought carefully about before. I agree that it would be very nice to have this behavior, though. This will require a bit of internal refactoring to pass on the level information to the MultiIndex during indexing. To remove unused levels after unstacking, you need to add an explicit I raised another issue for the bug related to copying MultiIndex that you had in the earlier version of this PR (#769). More broadly, if you care about MultiIndex support, it would be great to get some help pushing it. I'm happy to answer questions, but I'm at a new job and don't have a lot of time to work on new development. |
Thanks for the tip. So I finally obtain the desired result when selecting the band 'bar' by doing this:
But it's still a lot of code to write for such a common operation. I'd be happy to think more deeply about this and contribute to the development of this great package ! (within the limits of my skills) |
Thinking about this issue, I'd like to know what you think of the suggestions below before considering any pull request. The following line code gives the same result than in my previous comment, but it is more explicit and shorter: da.unstack('band_wavenumber').sel(band='bar').dropna('wavenumber', how='any') A nice shortcut to this would be adding a new da.xs('bar', dim='band_wavenumber', level='band', drop_level=True) Like Pandas, the default value of I think that this solution is better than, e.g., directly providing index level names as arguments of the Another, though less elegant, solution would be to provide dictionnaries to the da.sel(band_wavenumber={'band': 'bar'}) Besides this, It would be nice if the |
OK, I've read more carefully the discussion you referred to, and now I understand why it is preferable to call The da.xs('bar', dim='band_wavenumber', level='band', dropna=True) |
The good news about writing our own custom way to select levels is that because we can avoid the stack/unstack, we can simply omit unused levels without worrying about doing I would be OK with Last year at the SciPy conference sprints, @jonathanrocher was working on adding similar dictionary support into
This is a fair point, but such scenarios are unlikely to appear in practice. We might be able to, for example, update our handling of MultiIndexes to guarantee that level names cannot conflict with other variables. This might be done by inserting dummy-variables of some sort into the
Yes, agreed. Unfortunately the pandas code that handles this is a complete mess of spaghetti code (see pandas/core/indexers.py). So are welcome to try decoding it, but in my opinion you might be better off starting from scratch. In xarray, the function convert_label_indexer would need an updated interface that allows it to possibly return a new |
From this point of view I agree that Unless I miss a better solution, we can use
will a-priori return a stacked
|
If you try that doing that indexing with a pandas.Series, you actually get an error message:
I guess it's also worth investigating |
[Edited for more clarity]
First of all, I find the MultiIndex very useful and I'm looking forward to see the TODOs in #719 implemented in the next releases, especially the three first ones in the list!
Apart from these issues, I think that some other aspects may be improved, notably regarding data selection. Or maybe I've not correctly understood how to deal with multi-index and data selection...
To illustrate this, I use some fake spectral data with two discontinuous bands of different length / resolution:
I extract the band 'bar' using
sel
:It selects the data the way I want, although using the dimension name is confusing in this case. It would be nice if we can also use the
MultiIndex
names as arguments of thesel
method, even though I don't know if it is easy to implement.Futhermore,
da_bar
still has the 'band_wavenumber' dimension and the 'band' index-level, but it is not very useful anymore. Ideally, I'd rather like to obtain aDataArray
object with a 'wavenumber' dimension / coordinate and the 'bar' band name dropped from the multi-index, i.e., something would require automatic index-level removal and/or automatic unstack when selecting data.Extracting the band 'bar' from the pandas
Series
object gives something closer to what I need (see below), but using pandas is not an option as my spectral data involves other dimensions (e.g., time, scans, iterations...) not shown here for simplicity.The problem is also that the unstacked
DataArray
object resulting from the selection has the same dimensions and size than the original, unstackedDataArray
object. The only difference is that unselected values are replaced bynan
.The text was updated successfully, but these errors were encountered: