-
Notifications
You must be signed in to change notification settings - Fork 296
Rechunk derived #6516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Rechunk derived #6516
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6516 +/- ##
==========================================
+ Coverage 89.80% 89.87% +0.07%
==========================================
Files 90 90
Lines 23752 23927 +175
Branches 4418 4463 +45
==========================================
+ Hits 21331 21505 +174
+ Misses 1672 1670 -2
- Partials 749 752 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Update 2025-06-18Thanks @trexfeathers @stephenworsley for offlines discussions on this, I'm happy that tests now give full code coverage + I'm marking this ready for review. |
⏱️ Performance Benchmark Report: c617e86Performance shifts
Full benchmark results
Generated by GHA run |
Closes #6404
Automatic rechunking of derived coordinates
Investigation of the #6404 problem reveals that the points/bounds arrays of our derived (aka factory) coords have arrays which are mostly single chunks, which could thus be very large.
This was due to the fact that they tend to be like a broadcast product of several simple one-dimensional coords (dim or aux), each spanning a different dim or two, which themselves are quite small and so tend to be all single chunks.
When these are broadcast together, the result then tends to be one massive chunk, which can blow memory.
For example:
a result formed like A * B * C,
where these might have dims (T, Z, Y, X) of:
which are all relatively small, and so can be single chunks.
Say NT, NZ, NY, NX = 100, 70, 1000, 500.
then the result is (100 * 70 * 1000 * 500) -> 3.5Gpoints.
If element size is a typical 4 bytes, and dask chunksize is a typical 200Mb, then we expect a chunk ~50M array elements.
An array of this size, loaded from an input netcdf file, might get chunked (1, 70, 1000, 500) ~35M elements, or 140Mb.
But our derived coord will have the whole array, 3,500 Melements --> ~14Gb in a single chunk.
It seems likely that this problem has been noticed more recently because, since #5369, we now have derived coordinates which are time-dependent, so that is multiplying up the total size where before it did not.
However even before this, we were potentially mutliplying e.g. the size of a field * the number of model levels, which already lead to single-chunk arrays larger than ideal. Typical numbers : 70 * 1024 * 768 * 4 = 220Mib, already reaching the standard Dask chunksize of 200Mib (so hi-res fields or double resolution will clearly exceed).
Todo: