Method to make uniformly chunked Dask Arrays #3302

jakirkham · 2018-03-19T20:04:07Z

In the process of applying some operations on Dask Arrays, chunk sizes can become a bit heterogeneous. However storage applications tend to prefer more homogeneous chunk sizes. While it is possible to the store a Dask Array to a storage format with different chunk sizes, this typically comes at a penalty of needing locking. Certainly one can rechunk the Dask Array manually. However it would be nice to have a method for Dask Arrays, which has some better ideas as to what sort of rechunking will come with low penalty. This method can then determine some reasonable homogeneous chunk size and rechunk the Dask Array using them.

mrocklin · 2018-03-19T20:38:47Z

The current state would be to use rechunk and specify a uniform chunk size manually:

x = x.rechunk((1000, 1000, 10))

What I'm hearing here is to determine what the (1000, 1000, 10) argument should be automatically. Do you have thoughts on how to do this? I'm somewhat pessimistic that a nice solution can be found that does not add more questions than it answers :)

jakirkham · 2018-03-20T14:33:59Z

Lately I've been mindlessly doing something like tuple(max(c) for c in x.chunks). Finding the gcd is another option, but would likely bias towards much smaller chunks. Have also had the storage format provide some best guess at chunking and then rechunked Dask Arrays to match it. Other ideas welcome.

More generally past experiences with chunking inform me that there are wrong answers, but differences between "right" answers aren't too noticeable. Though this could be specific to what I have been doing.

Of course there are lots of things that the approaches above are not taking into account. Mainly am looking for a baked in heuristic that makes things a little easier most of the time.

mrocklin · 2018-03-20T14:44:26Z

OK, taking hints from existing chunking seems reasonable to me. Presumably we might look at something like the 75 percentile of chunk size in bytes, some sort of average aspect ratio, and then look for nice round numbers?

martindurant · 2018-05-01T13:49:07Z

I would propose that rechunk takes a variety of possible inputs, including some typical, pluggable, named rechunking scenarios. I don't think there is any "best" heuristic that we can hope to achieve, since that would depend heavily on the particulars in each case. In particular, "least work" (smallest task graph) would be a hard metric to optimise for.

Additionally, I would make a to_zarr() and other storage methods, which takes a chunking= keyword to pass to rechunk, if required (perhaps default to False, which will simply error if the chunks are not regular, as now). Bag, dataframe and xarray all have to_* storage methods. In the case of zarr, we would have to interpret a URL and accept storage_options, and make a mapper instance, like in intake.

The rechunking methods I would start with:

"preserve_number": split each axis into the same number of chunks as currently exist, don't process regular axes as all, even if last chunk is small; this is simple to understand and I would recommend as the default.
"loose_memory": given chunk size in bytes, make chunks with aspect given by the current number of chunks in each dimension, aiming for no small last-chunks in each dimension, and again skipping dimensions that are already regular
"tight_memory": given chunk size in bytes, get as close to it by rechunking all dimensions and not worrying about regular dimensions or last-chunks
"regularize": ensure that last-chunks have the same shape as normal chunks, which might be useful for later concatenation. This might mean one-chunk dimensions if the length happens to be prime.
it would be nice to also be able to specify some dimension chunks explicitly or say which dimensions are not to be considered.

jakirkham · 2018-05-02T01:29:41Z

Very thorough! Thanks for giving this some thought, @martindurant. Need to mull over this a bit. Generally agree though.

martindurant · 2018-05-02T16:27:50Z

Ah, good point in the linked issues, that sometimes you would want the existing chunking scheme, if there is one. I was thinking of to_zarr as always overwriting/creating.

jakirkham added the array label Mar 19, 2018

jakirkham mentioned this issue Mar 28, 2018

Storing Dask Arrays directly to Zarr Group fails zarr-developers/zarr-python#251

Closed

mrocklin mentioned this issue Apr 5, 2018

Add __array_ufunc__ protocol to DataFrame and Series (#3044) #3370

Merged

3 tasks

jakirkham mentioned this issue May 2, 2018

Adding from_zarr and to_zarr methods #3457

Closed

martindurant mentioned this issue May 2, 2018

add to/read_zarr #3460

Merged

2 tasks

martindurant mentioned this issue May 10, 2018

Auto rechunk mechanisms #3488

Closed

2 tasks

jakirkham mentioned this issue Jul 19, 2018

Add a chunksize convenience property #3777

Merged

2 tasks

mrocklin mentioned this issue Aug 9, 2018

Support for array chunking 'strategies' #3864

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method to make uniformly chunked Dask Arrays #3302

Method to make uniformly chunked Dask Arrays #3302

jakirkham commented Mar 19, 2018

mrocklin commented Mar 19, 2018

jakirkham commented Mar 20, 2018

mrocklin commented Mar 20, 2018

martindurant commented May 1, 2018

jakirkham commented May 2, 2018

martindurant commented May 2, 2018

Method to make uniformly chunked Dask Arrays #3302

Method to make uniformly chunked Dask Arrays #3302

Comments

jakirkham commented Mar 19, 2018

mrocklin commented Mar 19, 2018

jakirkham commented Mar 20, 2018

mrocklin commented Mar 20, 2018

martindurant commented May 1, 2018

jakirkham commented May 2, 2018

martindurant commented May 2, 2018