-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
No chunk warning if empty #6402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If an array has zero size (due to an empty dimension), it is saved as a single chunk regardless of Dask chunking on other dimensions (pydata#5742). If the `chunks` parameter is provided for other dimensions when loading the Zarr file, xarray gives a warning about potentially degraded performance from splitting the single chunk. When the array has zero size, this warning seems inappropriate because: - performance degradation on an empty array should be negligible. - we don't always know if one of the dimensions is empty until loading. I would use the `chunks` parameter for dimensions that have known chunksize (to specify some multiple of that chunksize), but this only works without warning when the array is nonempty.
This looks very reasonable, @jaicher, thanks a lot. I don't know this area well at all, so I'll leave open so someone who knows it better can check, or for a couple of days. Hope that's OK. |
@stanwest Can you suggest how to fix this merge conflict please? |
Sure. I recommend the following before the # Warn where requested chunks break preferred chunks, provided that the variable
# contains data.
if var.size:
for dim, size, chunk_sizes in zip(dims, shape, chunk_shape):
try:
preferred_chunk_sizes = preferred_chunks[dim]
except KeyError:
continue
# Determine the stop indices of the preferred chunks, but omit the last stop
# (equal to the dim size). In particular, assume that when a sequence
# expresses the preferred chunks, the sequence sums to the size.
preferred_stops = (
range(preferred_chunk_sizes, size, preferred_chunk_sizes)
if isinstance(preferred_chunk_sizes, Number)
else itertools.accumulate(preferred_chunk_sizes[:-1])
)
# Gather any stop indices of the specified chunks that are not a stop index
# of a preferred chunk. Again, omit the last stop, assuming that it equals
# the dim size.
breaks = set(itertools.accumulate(chunk_sizes[:-1])).difference(
preferred_stops
)
if breaks:
warnings.warn(
"The specified Dask chunks separate the stored chunks along "
f'dimension "{dim}" starting at index {min(breaks)}. This could '
"degrade performance. Instead, consider rechunking after loading."
) |
@stanwest how's that? (feel free to correct if I missed something, forgive the delay on merging) Thanks! |
That looks great to me. |
Thanks all! |
chunks
opening dataset with empty dimension #6401whats-new.rst