Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

applying multiple reductions to the same groups #194

Closed
keewis opened this issue Nov 28, 2022 · 2 comments
Closed

applying multiple reductions to the same groups #194

keewis opened this issue Nov 28, 2022 · 2 comments
Labels
wontfix This will not be worked on

Comments

@keewis
Copy link
Contributor

keewis commented Nov 28, 2022

I'm frequently trying to apply multiple reductions to the same groups (e.g. mean, count, min, max), and I wonder if there's any way we can get the API to support that use case? I don't understand the algorithm enough to be able to tell if that would be possible to implement in the algorithm, or if we'd basically add a wrapper that would do a new groupby per reduction.

I guess what I'm hoping for is for GroupBy objects to have something similar to pandas' .agg method.

Edit: not sure if that would be better to reraise on the xarray issue tracker?

@dcherian
Copy link
Collaborator

dcherian commented Nov 28, 2022

It would be a decent bit of complexity to add, and I'm not inclined to add it.

There would be two advantages:

  1. The data are only factorized once, and the integer codes are reused.
  2. We could drastically reduce the number of tasks in the dask graph at the cost of more complicated code. Number of tasks is reduced because we can maker a single task calculate all the necessary intermediates for all reductions.

I'm not sure (1) is worth it, at least for xarray, because after pydata/xarray#7206, we will get this for free by just calling each individual method on a saved GroupBy object (for xarray).

I'm not sure (2) is worth it for a couple of cases:

  1. It will also mean that to calculate max only you will calculate every other reduction and then discard it.
  2. If you're writing the output to zarr for example, you lose parallelism again.
  3. It could be an advantage to only compute count once and reuse it for count, mean but not sure its worth it. We could get this advantage by instead breaking up the current algo to. compute count and sum separately for mean. Then the dask optimizer will handle the shared count computation for us.

@keewis
Copy link
Contributor Author

keewis commented Nov 28, 2022

(1) sounds good to me, actually, so it's fine to not do this here (and I can just create a PR for GroupBy.agg or something similar).

Edit: basically, let's make use of pydata/xarray#7206

@dcherian dcherian added the wontfix This will not be worked on label Nov 28, 2022
@keewis keewis closed this as completed Nov 29, 2022
@keewis keewis closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants