-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algorithm for multi-stage rechunking #89
Conversation
Codecov Report
@@ Coverage Diff @@
## master #89 +/- ##
==========================================
+ Coverage 96.02% 96.35% +0.32%
==========================================
Files 11 11
Lines 503 548 +45
Branches 112 105 -7
==========================================
+ Hits 483 528 +45
Misses 13 13
Partials 7 7
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
rechunker/algorithm.py
Outdated
if max_mem < min_mem: # basic sanity test | ||
raise ValueError( | ||
"max_mem ({max_mem}) cannot be smaller than min_mem ({min_mem})" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might consider making this more strict, e.g., requiring max_mem < 2 * min_mem
. If min_mem
is only slightly smaller than max_mem
, the algorithm will exit before achieving the min_mem objective due to increasing IO op count. For example, I cannot exceed min_mem=350MB
if max_mem=500MB
in my ERA5 example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean requiring min_mem < 2 * max_mem
. I would support adding that constraint.
Thinking about this a little bit more, this may be a harder problem for the rechunker data model than I thought. The problem is that my algorithm does not guarantee that intermediate chunks can be written to Zarr. For example, consider stage 1 in my proposed rechunking plan above: In Xarray-Beam and Dask, this is handled by producing temporary irregular chunks from the "split" step. This is fine in principle, but won't work for the current version of Rechunker, which stores intermediates in Zarr. I see a few possible ways to resolve this:
|
Thanks a lot for this Stephan! It will take me a couple of days to digest this. |
This aligns the notion of rechunk "stage" more closely with a stage in multi-stage rechunking: pangeo-data/rechunker#89 PiperOrigin-RevId: 375538957
This aligns the notion of rechunk "stage" more closely with a stage in multi-stage rechunking: pangeo-data/rechunker#89 PiperOrigin-RevId: 375538957
This aligns the notion of rechunk "stage" more closely with a stage in multi-stage rechunking: pangeo-data/rechunker#89 PiperOrigin-RevId: 375553383
I tested this in Xarray-Beam today, rechunking a single 1.3 TB float32 array from
Edit: you can find the example code for this over in |
I am happy to rebase, but would anyone be up for reviewing this PR? The alternative would be to fork this code into Xarray-Beam. |
@rabernat any thoughts here? |
HI Stephan! Sorry for letting this lie dormant for so long. Will definitely take a look this week. Thanks for your patience. |
for more information, see https://pre-commit.ci
Thanks! I just merged in "master" so this should be ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this PR Stephan! It looks incredibly clever, useful, and well tested.
I spent about an hour reviewing this PR. Maybe because it's late in the day, I was not able to fully understand everything that is going on at the algorithm level. In particular, I am not quite comprehending the what gives rise to the issue of irregular intermediate chunks, which I believe is linked to the least-common multiple calculation. My understanding is that the lcm calculation is only needed in calculate_single_stage_io_ops
in order to figure out the number of i/o ops required in each stage. (The reason we didn't need this logic before was that we simply didn't care how many i/o ops were used to achieve the rechunking.) Source / target chunk pairs with small least-common multiples can be copied with fewer i/o ops. Is that a correct interpretation?
So my main suggestion at this stage would be to to inject a few more choice comments into the code which will help other developers recreated the mental model you're using to reason about this stuff. This would make the code more maintainable by others.
Other than that, it all looks great, and all tests pass. 🎉
Going forward, we should explore the irregular chunks approach. You should check out https://zarr.dev/zeps/draft/ZEP0003.html, which proposes supporting this at the Zarr level.
rechunker/algorithm.py
Outdated
# Add a small floating-point epsilon so we don't inadvertently | ||
# round-down even chunk-sizes. | ||
chunks = tuple( | ||
floor(rc ** (1 - power) * wc**power + epsilon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love a comment explaining the math. That would help make the code more maintainable by others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually,np.geomspace
implements exactly what I was trying to do here. So I'll just use that instead!
|
||
|
||
def _count_num_splits(source_chunk: int, target_chunk: int, size: int) -> int: | ||
multiple = lcm(source_chunk, target_chunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help explain why the least-common multiple calculation is needed here? We managed to avoid these types of constraints (and the edge cases that come with them) in the original algorithm, and I'm not quite grokking why it's needed here. Clearly this is related to the irregular chunk problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just about counting how many operations would be required for a rechunk operation. I wrote a docstring with an example to help clarify.
rechunker/algorithm.py
Outdated
if max_mem < min_mem: # basic sanity test | ||
raise ValueError( | ||
"max_mem ({max_mem}) cannot be smaller than min_mem ({min_mem})" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean requiring min_mem < 2 * max_mem
. I would support adding that constraint.
raise AssertionError( | ||
"Failed to find a feasible multi-staging rechunking scheme satisfying " | ||
f"min_mem ({min_mem}) and max_mem ({max_mem}) constraints. " | ||
"Please file a bug report on GitHub: " | ||
"https://github.com/pangeo-data/rechunker/issues\n\n" | ||
"Include the following debugging info:\n" | ||
f"shape={shape}, source_chunks={source_chunks}, " | ||
f"target_chunks={target_chunks}, itemsize={itemsize}, " | ||
f"min_mem={min_mem}, max_mem={max_mem}, " | ||
f"consolidate_reads={consolidate_reads}, " | ||
f"consolidate_writes={consolidate_writes}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great error message!
new_chunks : tuple | ||
The new chunks, size guaranteed to be <= mam_mem | ||
""" | ||
(stage,) = multistage_rechunking_plan( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate how this was implemented with clean backwards compatibility.
((100,), (43,), (51,), 4), | ||
((100,), (43,), (40,), 5), | ||
((100,), (43,), (10,), 12), | ||
((100,), (43,), (1,), 100), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the impression that there is a lot of intuition and understanding encoded into the above test cases. It would be nice to get more of that into comments to help other devs understand the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some brief comments :)
rechunker/algorithm.py
Outdated
def calculate_single_stage_io_ops( | ||
shape: Sequence[int], in_chunks: Sequence[int], out_chunks: Sequence[int] | ||
) -> int: | ||
"""Estimate the number of irregular chunks required for rechunking.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot reconcile the docstring with the function name. Where are "irregular chunks" being calculated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the note about irregular chunks -- that's really unrelated.
for more information, see https://pre-commit.ci
This is a good point! The code as I wrote it was rather misleading -- the LCM logic is indeed just about counting IO ops. Irregular chunks only become necessary because of the very simple algorithm (geometric spacing) I use for picking intermediate chunk sizes in multi-stage plans. With some cleverness, this could likely be avoided in many cases. The current rechunker algorithm does actually support "irregular chunks" in the form of overlapping reads of source arrays. So it's certainly possible to use this limited form of irregular chunking to support arbitrary rechunks, and I suspect with effort we could make it work for multi-stage rechunks, too. The most obvious way to do this would be to simply pick some intermediate chunk size, and then run the existing rechunker algorithm recursively twice, for three total temporary Zarr arrays (or 7 or 15). This would be quite a bit less flexible than what I implemented here, though. |
Fantastic. These comments are a big help. Thanks for this contribution. |
As described in pangeo-data/rechunker#89, this can yield significant performance benefits for rechunking large arrays. PiperOrigin-RevId: 375836954
As described in pangeo-data/rechunker#89, this can yield significant performance benefits for rechunking large arrays. PiperOrigin-RevId: 375836954
As described in pangeo-data/rechunker#89, this can yield significant performance benefits for rechunking large arrays. PiperOrigin-RevId: 375836954
As described in pangeo-data/rechunker#89, this can yield significant performance benefits for rechunking large arrays. PiperOrigin-RevId: 375836954
As described in pangeo-data/rechunker#89, this can yield significant performance benefits for rechunking large arrays. PiperOrigin-RevId: 375836954
As described in pangeo-data/rechunker#89, this can yield significant performance benefits for rechunking large arrays. PiperOrigin-RevId: 375836954
As described in pangeo-data/rechunker#89, this can yield significant performance benefits for rechunking large arrays. PiperOrigin-RevId: 518325665
Background
Rechunker's current "Push / Pull Consolidated" algorithm can be thought of as a mix of "split" and "combine" steps:
source_chunks
->read_chunks
(to fill upmax_mem
per chunk)read_chunks != write_chunks
:read_chunks
->int_chunks
int_chunks
->write_chunks
write_chunks
->target_chunks
This is pretty clever, but can run into scalability issues. In particular, sometimes
int_chunks
must be very small, which results in a significant overhead for reading/writing small files (#80).Proposal
I think the right way to fix this is to extend Rechunker's algorithm to allow for multiple split/combine stages -- as many as necessary, to avoid creating tiny intermediate chunk sizes. This PR implements the math for such as algorithm, in a fully backwards compatible fashion. Users can control the number of stages via the new
min_mem
parameter, which specifies a minimum chunk size in bytes.Multi-stage rechunking is not yet hooked up to any of Rechunker's executors. I'm proposing adding it to rechunking because it reuses/modifies the existing code, and I need the math for rechunking inside Xarray-Beam (this was an easier way to explore writing the Beam executor part). Unfortunately it isn't easy to add into Rechunker's existing model, which stores intermediate results in Zarr, because multi-stage rechunking (at least this version) requires irregular chunks.
Dask also does multi-stage rechunking, which brought significant efficiency gains (dask/dask#417). I considered copying Dask's rechunk planning algorithm here, but it involves a lot of complex logic so I decided to try replacing it with a simple heuristic instead. See
algorithm.py
for details.Example
My specific motivation for this PR is experimenting with rechunking on Pangeo's ERA5 single-level dataset, which contains a number of 1.5 TB variables (at least once decoded into float32). Rechunking these arrays with
shape=(350640, 721, 1440)
from "whole image" chunks(31, 721, 1440)
to "whole time-series" chunks(350640, 10, 10)
with Rechunker's current algorithm produces a very large number of small chunks. It works, but seems much slower than it should be.I wrote a little Python script to compare Rechunker's current method (
min_mem=0
), with my proposed multi-stage method (min_mem=10MB
) and Dask's rechunking method:Comparing my new multi-stage algorithm (
max_mem=10MB
) to Rechunker's existing algorithm (max_mem=0
), the multi-stage pipeline does two extra dataset copies, but reduces the number of IO operations by ~50x.Comparing my new algorithm to Dask's algorithm, the plans actually look remarkably similar. My estimates suggest that my algorithm should involve about half the number of IO operations, but Dask's plan uses slightly "nicer" chunk sizes. I have no idea which is better is practice, and note that I'm using Dask's algorithm without adjusting any of the control knobs.
I have not yet benchmarked any of these algorithms on real rechunking tasks.See below for benchmarking results from Xarray-Beam, for which it significantly improves real-world performance.