Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method to make uniformly chunked Dask Arrays #3302

Open
jakirkham opened this issue Mar 19, 2018 · 6 comments
Open

Method to make uniformly chunked Dask Arrays #3302

jakirkham opened this issue Mar 19, 2018 · 6 comments
Labels

Comments

@jakirkham
Copy link
Member

In the process of applying some operations on Dask Arrays, chunk sizes can become a bit heterogeneous. However storage applications tend to prefer more homogeneous chunk sizes. While it is possible to the store a Dask Array to a storage format with different chunk sizes, this typically comes at a penalty of needing locking. Certainly one can rechunk the Dask Array manually. However it would be nice to have a method for Dask Arrays, which has some better ideas as to what sort of rechunking will come with low penalty. This method can then determine some reasonable homogeneous chunk size and rechunk the Dask Array using them.

@mrocklin
Copy link
Member

The current state would be to use rechunk and specify a uniform chunk size manually:

x = x.rechunk((1000, 1000, 10))

What I'm hearing here is to determine what the (1000, 1000, 10) argument should be automatically. Do you have thoughts on how to do this? I'm somewhat pessimistic that a nice solution can be found that does not add more questions than it answers :)

@jakirkham
Copy link
Member Author

Lately I've been mindlessly doing something like tuple(max(c) for c in x.chunks). Finding the gcd is another option, but would likely bias towards much smaller chunks. Have also had the storage format provide some best guess at chunking and then rechunked Dask Arrays to match it. Other ideas welcome.

More generally past experiences with chunking inform me that there are wrong answers, but differences between "right" answers aren't too noticeable. Though this could be specific to what I have been doing.

Of course there are lots of things that the approaches above are not taking into account. Mainly am looking for a baked in heuristic that makes things a little easier most of the time.

@mrocklin
Copy link
Member

OK, taking hints from existing chunking seems reasonable to me. Presumably we might look at something like the 75 percentile of chunk size in bytes, some sort of average aspect ratio, and then look for nice round numbers?

@martindurant
Copy link
Member

I would propose that rechunk takes a variety of possible inputs, including some typical, pluggable, named rechunking scenarios. I don't think there is any "best" heuristic that we can hope to achieve, since that would depend heavily on the particulars in each case. In particular, "least work" (smallest task graph) would be a hard metric to optimise for.

Additionally, I would make a to_zarr() and other storage methods, which takes a chunking= keyword to pass to rechunk, if required (perhaps default to False, which will simply error if the chunks are not regular, as now). Bag, dataframe and xarray all have to_* storage methods. In the case of zarr, we would have to interpret a URL and accept storage_options, and make a mapper instance, like in intake.

The rechunking methods I would start with:

  • "preserve_number": split each axis into the same number of chunks as currently exist, don't process regular axes as all, even if last chunk is small; this is simple to understand and I would recommend as the default.
  • "loose_memory": given chunk size in bytes, make chunks with aspect given by the current number of chunks in each dimension, aiming for no small last-chunks in each dimension, and again skipping dimensions that are already regular
  • "tight_memory": given chunk size in bytes, get as close to it by rechunking all dimensions and not worrying about regular dimensions or last-chunks
  • "regularize": ensure that last-chunks have the same shape as normal chunks, which might be useful for later concatenation. This might mean one-chunk dimensions if the length happens to be prime.
  • it would be nice to also be able to specify some dimension chunks explicitly or say which dimensions are not to be considered.

@jakirkham
Copy link
Member Author

Very thorough! Thanks for giving this some thought, @martindurant. Need to mull over this a bit. Generally agree though.

@martindurant
Copy link
Member

Ah, good point in the linked issues, that sometimes you would want the existing chunking scheme, if there is one. I was thinking of to_zarr as always overwriting/creating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants