Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby_bins: exclude bin or assign bin with nan when bin has no values #1019

Closed
byersiiasa opened this issue Sep 29, 2016 · 10 comments · Fixed by #1027
Closed

groupby_bins: exclude bin or assign bin with nan when bin has no values #1019

byersiiasa opened this issue Sep 29, 2016 · 10 comments · Fixed by #1027

Comments

@byersiiasa
Copy link

When using groupby_bins there are cases where no values are found for some of the bins specified. Currently, it appears that in these cases, the bin is skipped, with no value neither a bin entry added to the output dataarray.

Is there a way to identify which bins have been skipped. Or preferably, is it possible to have an option to include those bins, but with nan values. This would make comparing two dataarrays easier in cases where despite the same bin intervals as inputs, the outputs result in dataarrays with different variable and coordinates lengths.

import xarray as xr
var = xr.open_dataset('c:\\users\\saveMWE.nc')
pop = xr.open_dataset('c:\\users\\savePOP.nc')
# binns includes very small bin to test this
binns = [-100, -50, 0, 50, 50.00001, 100]
binned = pop.p2010T.groupby_bins(var.EnsembleMean, binns).sum()
print binned
print binned.EnsembleMean_bins

In this case, no data falls in the 4th bin between 50 and 50.00001.

<xarray.DataArray 'p2010T' (EnsembleMean_bins: 4)>
array([  2.64352214e+09,   3.46869168e+09,   3.08998110e+08,
         1.48247440e+07])
Coordinates:
  * EnsembleMean_bins  (EnsembleMean_bins) object '(0, 50]' '(-50, 0]' ...
<xarray.DataArray 'EnsembleMean_bins' (EnsembleMean_bins: 4)>
array(['(0, 50]', '(-50, 0]', '(51, 100]', '(-100, -50]'], dtype=object)

Obviously one can count the lengths but this doesn't indicate which bin was skipped. An option to include the empty bin with a nan value would be useful! Thanks

bins_example.zip

@rabernat
Copy link
Contributor

Just to understand better, what is the advantage to having this empty bin? How would you use that feature?

As is, the resulting Dataset can still be aligned with other bin objects that have different coordinates (i.e. non empty final bin).

@byersiiasa
Copy link
Author

So if I plot the current output as a bar chart/histogram, that bin interval will be skipped. For example if I did:
plt.plot(binns[0:-2], binned) #using left edges of the bins
I would get an error if a bin present in binns has been skipped in binned.

I guess that perhaps there is a cleverer way of plotting the output data than this.

This leads to more important questions:

  1. Do you know the logic to the ordering of the binned data and the bin objects? In this example, the bins input is monotonically increasing, but the bin object does not correspond.
    e.g.
binns = [-100, -50, 0, 50, 50.00001, 100]
array(['(0, 50]', '(-50, 0]', '(51, 100]', '(-100, -50]'], dtype=object)
  1. Does the order of output values in the summed array (binned) correspond to the input bins or the output bin object? If the latter, how do I reorder the data more in line with the monotonically increasing input bins array?

Thanks

@rabernat
Copy link
Contributor

The sorting of bins should have been fixed in #952.

@rabernat
Copy link
Contributor

As for the empty bins, I can see how this would be useful. I suppose it is a bug. Curious what @shoyer thinks about this case...

@byersiiasa
Copy link
Author

@rabernat I don't have much capability to help, but if any changes are made I am happy to help test this particular case.

@rabernat
Copy link
Contributor

rabernat commented Sep 29, 2016

For now, can you just confirm what version of xarray you are using (xarray.__version__)?

I'm not sure if #952 has been released yet, but if you are using the latest master, that should at least fix the sorting issue.

@byersiiasa
Copy link
Author

0.8.2 updated from conda a few days ago. I'll try the master. Thanks

@shoyer
Copy link
Member

shoyer commented Sep 29, 2016

We actually already have some similar for ensuring that all resampled bins appear (see GroupBy._maybe_restore_empty_groups). If we set full_index = binned.categories in GroupBy.__init__ I think that should take care of it.

@byersiiasa
Copy link
Author

Thanks @shoyer and @rabernat . @gidden and I may have a go next week.
Otherwise if someone wants to jump in, I made a notebook to test/demonstrate the issue.
groupby_bins_test_nb.zip

rabernat added a commit to rabernat/xarray that referenced this issue Oct 2, 2016
shoyer pushed a commit that referenced this issue Oct 3, 2016
* Fix a typo

* fixes #1019

* fixed tiny bug

* changed default

* got rid of keyword arg
@byersiiasa
Copy link
Author

byersiiasa commented Oct 3, 2016

@rabernat @shoyer thank you very much - (at least for my purposes) this appears to be working well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants