Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Multidimensional bins #59

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Jun 7, 2021

Work-in-progress attempt to implement #28 . Builds atop #49.

  • Allows bins to be xr.DataArrays
  • Aligns and reshapes the bins the same way as the other arrays
  • Promotes bins from a single keyword argument to a list of arrays similar to data or weights
  • Passes them down into _bincount_2d_vectorized
  • Applies np.digitize in a loop over the broadcast dimension using different bins (can't apply digitize directly to the whole array because it only accepts 1D bins input)
  • Reconstructs the bin coordinates on the output in an ND manner

However my test for checking the results of a 2D bins arrays doesn't actually pass and I haven't yet worked out why :/ It generates the right shape but the wrong counts.

(There are also a couple of other failing tests but I think they are just to do with input sanitization edge cases)

@aaronspring

@TomNicholas
Copy link
Member Author

I've realised that this feature is kind of inconvenient to try and implement in xhistogram in its current form.

xhistogram's numpy API means that all ND bins have to be passed as pure numpy arrays through xhistogram.core.histogram, even when the bins are originally given as xr.DataArrays. To use those bins (in blockwise or plain _bincount) you have to be sure which axis is which, but that requires either passing already-broadcast bins arrays (breaking the API), providing extra information as to which axes are which on the bins (again breaking the API), or automatically broadcasting the numpy arrays inside xhistogram.core.histogram (possible but error-prone and frustrating when I would prefer to simply call xr.broadcast_like). The only reason to do even do this is to satisfy xhistogram's pure-numpy API (which I'm not sure anyone even uses).

Instead, IMO it would be much easier (/a better use of time) to implement this feature in xarray directly (either in pydata/xarray#5400 or in a later PR). That's because in a pure xarray implementation there would be no need to have an internal function accepting only unlabelled numpy arrays: as you always know which axis on your given bins array is which you can grab the indices from the dim labels as needed and pass them straight to dask.array.blockwise.

I like to think that this PR will be useful as a template for the xarray implementation though.

@aaronspring FYI

@TomNicholas
Copy link
Member Author

TomNicholas commented Jun 15, 2021

For future reference this SO discussion is about whether there are clever vectorized ways to do the bincounting when the weights are >1D.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants