Skip to content

Commit 105c22d

Browse files
committed
more edits
1 parent 44a8f02 commit 105c22d

File tree

1 file changed

+41
-33
lines changed

1 file changed

+41
-33
lines changed

src/posts/flox-smart/index.md

Lines changed: 41 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -14,57 +14,61 @@ summary: 'flox adds heuristics for automatically choosing an appropriate strateg
1414

1515
## What is flox?
1616

17-
[`flox` implements](https://flox.readthedocs.io/) grouped reductions for chunked array types like cubed and dask using a tree reduction approach.
18-
Tree reduction ([example](https://people.csail.mit.edu/xchen/gpu-programming/Lecture11-reduction.pdf)) are a parallel-friendly way of computing common reduction operations like `sum`, `mean` etc.
19-
Without flox, Xarray shuffles or sorts the data to extract all values in a single group, and then runs the reduction group-by-group.
20-
Depending on data layout ("chunking"), this shuffle can be quite expensive.
21-
With flox installed, Xarray instead uses the parallel-friendly tree reduction approach for the same calculation.
17+
[`flox` implements](https://flox.readthedocs.io/) grouped reductions for chunked array types like [cubed](https://cubed-dev.github.io/cubed/) and [dask](https://docs.dask.org/en/stable/array.html) using tree reductions.
18+
Tree reductions ([example](https://people.csail.mit.edu/xchen/gpu-programming/Lecture11-reduction.pdf)) are a parallel-friendly way of computing common reduction operations like `sum`, `mean` etc.
19+
Without flox, Xarray effectively shuffles — sorts the data to extract all values in a single group — and then runs the reduction group-by-group.
20+
Depending on data layout or "chunking" this shuffle can be quite expensive.
21+
![shuffle](https://flox.readthedocs.io/en/latest/_images/new-split-apply-combine-annotated.svg)
22+
With flox installed, Xarray instead uses its parallel-friendly tree reduction.
2223
In many cases, this is a massive improvement.
24+
Notice how much cleaner the graph is in this image:
25+
![map-reduce](https://flox.readthedocs.io/en/latest/_images/new-map-reduce-reindex-True-annotated.svg)
2326
See our [previous blog post](https://xarray.dev/blog/flox) for more.
2427

2528
Two key realizations influenced the development of flox:
2629

2730
1. Array workloads frequently group by a relatively small in-memory array. Quite frequently those arrays have patterns to their values e.g. `"time.month"` is exactly periodic, `"time.dayofyear"` is approximately periodic (depending on calendar), `"time.year"` is commonly a monotonic increasing array.
2831
2. Chunk sizes (or "partition sizes") for arrays can be quite small along the core-dimension of an operation. This is an important difference between arrays and dataframes!
2932

30-
These two properties are particularly relevant for "climatology" calculations (e.g. `groupby("time.month").mean()`) — a common Xarray workload.
33+
These two properties are particularly relevant for "climatology" calculations (e.g. `groupby("time.month").mean()`) — a common Xarray workload in the Earth Sciences.
3134

3235
## Tree reductions can be catastrophically bad
3336

34-
For a catastrophic example, consider `ds.groupby("time.year").mean()`, or the equivalent `ds.resample(time="Y").mean()` for a 100 year long dataset of monthly averages with chunk size of **1** (or **4**) along the time dimension.
37+
Consider `ds.groupby("time.year").mean()`, or the equivalent `ds.resample(time="Y").mean()` for a 100 year long dataset of monthly averages with chunk size of **1** (or **4**) along the time dimension.
3538
This is a fairly common format for climate model output.
3639
The small chunk size along time is offset by much larger chunk sizes along the other dimensions — commonly horizontal space (`x, y` or `latitude, longitude`).
3740

38-
A naive tree reduction would accumulate all averaged values into a single output chunk of size 100.
39-
Depending on the chunking of the input dataset, this may overload the worker memory and fail catastrophically.
40-
More importantly, there is a lot of wasteful communication — computing on the last year of data is completely independent of computing on the first year of the data, and there is no reason the two values need to reside in the same output chunk.
41+
A naive tree reduction would accumulate all averaged values into a single output chunk of size 100 — one value per year for 100 years.
42+
Depending on the chunking of the input dataset, this may overload the final worker's memory and fail catastrophically.
43+
More importantly, there is a lot of wasteful communication — computing on the last year of data is completely independent of computing on the first year of the data, and there is no reason the results for the two years need to reside in the same output chunk.
44+
This issue does not arise for regular reductions where the final result depends on the values in all chunks, and all data along the reduced axes are reduced down to one final value.
4145

4246
## Avoiding catastrophe
4347

4448
Thus `flox` quickly grew two new modes of computing the groupby reduction.
4549

4650
First, `method="blockwise"` which applies the grouped-reduction in a blockwise fashion.
4751
This is great for `resample(time="Y").mean()` where we group by `"time.year"`, which is a monotonic increasing array.
48-
With an appropriate (and usually quite cheap) rechunking, the problem is embarassingly parallel.
52+
With an appropriate (and usually quite cheap) rechunking, the problem is embarrassingly parallel.
4953
![blockwise](https://flox.readthedocs.io/en/latest/_images/new-blockwise-annotated.svg)
5054

5155
Second, `method="cohorts"` which is a bit more subtle.
5256
Consider `groupby("time.month")` for the monthly mean dataset i.e. grouping by an exactly periodic array.
5357
When the chunk size along the core dimension "time" is a divisor of the period; so either 1, 2, 3, 4, or 6 in this case; groups tend to occur in cohorts ("groups of groups").
54-
For example, with a chunk size of 4, monthly mean input data for Jan, Feb, Mar, and April ("one cohort") are _always_ in the same chunk, and totally separate from any of the other months.
58+
For example, with a chunk size of 4, monthly mean input data for the "cohort" Jan/Feb/Mar/Apr are _always_ in the same chunk, and totally separate from any of the other months.
59+
Here is a schematic illustration where each month is represented by a different shade of red:
5560
![monthly cohorts](https://flox.readthedocs.io/en/latest/_images/cohorts-month-chunk4.png)
5661
This means that we can run the tree reduction for each cohort (three cohorts in total: `JFMA | MJJA | SOND`) independently and expose more parallelism.
5762
Doing so can significantly reduce compute times and in particular memory required for the computation.
5863

59-
Importantly if there isn't much separation of groups into cohorts; example, the groups are randomly distributed, then we'd like the standard `method="map-reduce"` for low overhead.
64+
Importantly if there isn't much separation of groups into cohorts; example, the groups are randomly distributed, then it's hard to do better than the standard `method="map-reduce"`.
6065

6166
## Choosing a strategy is hard, and harder to teach.
6267

63-
These strategies are great, but the downside is some sophistication is required to apply them.
68+
These strategies are great, but the downside is that some sophistication is required to apply them.
6469
Worse, they are hard to explain conceptually! I've tried! ([example 1](https://discourse.pangeo.io/t/optimizing-climatology-calculation-with-xarray-and-dask/2453/20?u=dcherian), [example 2](https://discourse.pangeo.io/t/understanding-optimal-zarr-chunking-scheme-for-a-climatology/2335)).
6570

6671
What we need is to choose the appropriate strategy automatically.
67-
And guess what, `flox>=0.9` will now choose an appropriate method automatically!
6872

6973
## Problem statement
7074

@@ -103,7 +107,7 @@ I use set _containment_, or a "normalized intersection", to determine the simila
103107
C = |Q ∩ X| / |Q| ≤ 1; (∩ is set intersection)
104108
```
105109

106-
Unlike Jaccard similarity, _containment_ [isn't skewed](http://ekzhu.com/datasketch/lshensemble.html) when one of the sets is much larger than the other.
110+
Unlike [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index), _containment_ [isn't skewed](http://ekzhu.com/datasketch/lshensemble.html) when one of the sets is much larger than the other.
107111

108112
The steps are as follows:
109113

@@ -114,33 +118,36 @@ The steps are as follows:
114118
1. Use `"blockwise"` when every group is contained to one block each.
115119
1. Use `"cohorts"` when every chunk only has a single group, but that group might extend across multiple chunks
116120
1. [and more](https://github.com/xarray-contrib/flox/blob/e6159a657c55fa4aeb31bcbcecb341a4849da9fe/flox/core.py#L408-L426)
117-
1. At this point, we want to merge groups in to cohorts when they occupy _approximately_ the same chunks. For each group `i` we can quickly compute containment against
118-
all other groups `j` as `C = S.T @ S / number_chunks_per_group`.
121+
1. Now invert `S` to compute an initial set of cohorts whose groups are in the same exact chunks (this is another groupby!).
122+
Later we will want to merge together the detected cohorts when they occupy _approximately_ the same chunks, using the containment metric.
123+
1. For that we first quickly compute containment for all groups `i` against all other groups `j` as `C = S.T @ S / number_chunks_per_group`.
119124
1. To choose between `"map-reduce"` and `"cohorts"`, we need a summary measure of the degree to which the labels overlap with
120125
each other. We can use _sparsity_ --- the number of non-zero elements in `C` divided by the number of elements in `C`, `C.nnz/C.size`.
121126
We use _sparsity_ --- the number of non-zero elements in `C` divided by the number of elements in `C`, `C.nnz/C.size`. When sparsity is relatively high, we use `"map-reduce"`, otherwise we use `"cohorts"`.
127+
1. If the sparsity is high enough, we merge together similar cohorts using a for-loop.
128+
1. Finally we execute one tree-reduction per cohort and concatenate the results.
122129

123-
For more detail [see the docs](https://flox.readthedocs.io/en/latest/implementation.html#heuristics).
130+
For more detail [see the docs](https://flox.readthedocs.io/en/latest/implementation.html#heuristics) or [the code](https://github.com/xarray-contrib/flox/blob/e6159a657c55fa4aeb31bcbcecb341a4849da9fe/flox/core.py#L336).
131+
Suggestions and improvements are very welcome!
124132

125-
Here is C for a range of chunk sizes from 1 to 12, for computing `groupby("time.month")` of a monthly mean dataset, [the title on each image is (chunk size, sparsity)].
133+
Here is `C` for a range of chunk sizes from 1 to 12, for computing `groupby("time.month")` of a monthly mean dataset, [the title on each image is (chunk size, sparsity)].
126134
![flox sparsity image](https://flox.readthedocs.io/en/latest/_images/containment.png)
127135

128-
flox will choose:
136+
Given the above `C`, flox will choose:
129137

130138
1. `"blockwise"` for chunk size 1,
131139
2. `"cohorts"` for (2, 3, 4, 6, 12),
132140
3. and `"map-reduce"` for the rest.
133141

134142
Cool, isn't it?!
135143

136-
Importantly this inference is fast — 400ms for the [US county GroupBy problem in our previous post](https://xarray.dev/blog/flox)!
144+
Importantly this inference is fast — [400ms for the US county](https://flox.readthedocs.io/en/latest/implementation.html#example-spatial-grouping) GroupBy problem in our [previous post](https://xarray.dev/blog/flox)!
137145
But we have not tried with bigger problems (example: GroupBy(100,000 watersheds) in the US).
138146

139147
## What's next?
140148

141-
flox' ability to do cool inferences entirely relies on the input chunking, which is a major user-tunable knob.
142-
Perfect optimization still requires some user-tuned chunking.
143-
Recent Xarray feature makes that a lot easier for time grouping:
149+
flox' ability to do such inferences relies entirely on the input chunking, a big knob.
150+
A recent Xarray feature makes such rechunking a lot easier for time grouping:
144151

145152
```python
146153
from xarray.groupers import TimeResampler
@@ -150,13 +157,14 @@ rechunked = ds.chunk(time=TimeResampler("YE"))
150157

151158
will rechunk so that a year of data is in a single chunk.
152159

153-
Even so, it would be nice to automatically rechunk to minimize number of cohorts detected, or to a perfectly blockwise application.
154-
A key limitation is that we have lost _context_.
155-
The string `"time.month"` tells me that I am grouping a perfectly periodic array with period 12; similarly
156-
the _string_ `"time.dayofyear"` tells me that I am grouping by a (quasi-)periodic array with period 365, and that group `366` may occur occasionally (depending on calendar).
157-
This context is hard to infer from integer group labels `[1, 2, 3, 4, 5, ..., 1, 2, 3, 4, 5]`.
158-
_[Get in touch](https://github.com/xarray-contrib/flox/issues) if you have ideas for how to do this inference!_.
160+
Even so, it would be nice to automatically rechunk to minimize number of cohorts detected, or to a perfectly blockwise application when that's cheap.
161+
A challenge here is that we have lost _context_ when moving from Xarray to flox.
162+
The string `"time.month"` tells Xarray that I am grouping a perfectly periodic array with period 12; similarly
163+
the _string_ `"time.dayofyear"` tells Xarray that I am grouping by a (quasi-)periodic array with period 365, and that group `366` may occur occasionally (depending on calendar).
164+
But Xarray passes flox an array of integer group labels `[1, 2, 3, 4, 5, ..., 1, 2, 3, 4, 5]`.
165+
It's hard to infer the context from that!
166+
_[Get in touch](https://github.com/xarray-contrib/flox/issues) if you have ideas for how to do this inference._
159167

160-
One way to preserve context may be to use Xarray's new Grouper objects, and let them report ["preferred chunks"](https://github.com/pydata/xarray/blob/main/design_notes/grouper_objects.md#the-preferred_chunks-method-) for a particular grouping.
161-
This would allow a downstream system like `flox` or `dask-expr` to take this in to account later (or even earlier!) in the pipeline.
168+
One way to preserve context may be be to have Xarray's new Grouper objects report ["preferred chunks"](https://github.com/pydata/xarray/blob/main/design_notes/grouper_objects.md#the-preferred_chunks-method-) for a particular grouping.
169+
This would allow a downstream system like `flox` or `cubed` or `dask-expr` to take this in to account later (or even earlier!) in the pipeline.
162170
That is an experiment for another day.

0 commit comments

Comments
 (0)