-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method to make uniformly chunked Dask Arrays #3302
Comments
The current state would be to use rechunk and specify a uniform chunk size manually:
What I'm hearing here is to determine what the |
Lately I've been mindlessly doing something like More generally past experiences with chunking inform me that there are wrong answers, but differences between "right" answers aren't too noticeable. Though this could be specific to what I have been doing. Of course there are lots of things that the approaches above are not taking into account. Mainly am looking for a baked in heuristic that makes things a little easier most of the time. |
OK, taking hints from existing chunking seems reasonable to me. Presumably we might look at something like the 75 percentile of chunk size in bytes, some sort of average aspect ratio, and then look for nice round numbers? |
I would propose that rechunk takes a variety of possible inputs, including some typical, pluggable, named rechunking scenarios. I don't think there is any "best" heuristic that we can hope to achieve, since that would depend heavily on the particulars in each case. In particular, "least work" (smallest task graph) would be a hard metric to optimise for. Additionally, I would make a The rechunking methods I would start with:
|
Very thorough! Thanks for giving this some thought, @martindurant. Need to mull over this a bit. Generally agree though. |
Ah, good point in the linked issues, that sometimes you would want the existing chunking scheme, if there is one. I was thinking of to_zarr as always overwriting/creating. |
In the process of applying some operations on Dask Arrays, chunk sizes can become a bit heterogeneous. However storage applications tend to prefer more homogeneous chunk sizes. While it is possible to the store a Dask Array to a storage format with different chunk sizes, this typically comes at a penalty of needing locking. Certainly one can rechunk the Dask Array manually. However it would be nice to have a method for Dask Arrays, which has some better ideas as to what sort of rechunking will come with low penalty. This method can then determine some reasonable homogeneous chunk size and rechunk the Dask Array using them.
The text was updated successfully, but these errors were encountered: