Skip to content

Conversation

jaicher
Copy link
Contributor

@jaicher jaicher commented Mar 23, 2022

jaicher added 3 commits March 23, 2022 01:16
If an array has zero size (due to an empty dimension), it is saved as a
single chunk regardless of Dask chunking on other dimensions (pydata#5742).
If the `chunks` parameter is provided for other dimensions when loading
the Zarr file, xarray gives a warning about potentially degraded
performance from splitting the single chunk.

When the array has zero size, this warning seems inappropriate because:

- performance degradation on an empty array should be negligible.
- we don't always know if one of the dimensions is empty until loading.
  I would use the `chunks` parameter for dimensions that have known
  chunksize (to specify some multiple of that chunksize), but this only
  works without warning when the array is nonempty.
@max-sixty
Copy link
Collaborator

This looks very reasonable, @jaicher, thanks a lot.

I don't know this area well at all, so I'll leave open so someone who knows it better can check, or for a couple of days. Hope that's OK.

@max-sixty max-sixty added the plan to merge Final call for comments label Mar 23, 2022
@dcherian
Copy link
Contributor

dcherian commented Apr 8, 2022

@stanwest Can you suggest how to fix this merge conflict please?

@stanwest
Copy link
Contributor

stanwest commented Apr 8, 2022

@stanwest Can you suggest how to fix this merge conflict please?

Sure. I recommend the following before the return statement in xarray.core.dataset._get_chunk:

# Warn where requested chunks break preferred chunks, provided that the variable
# contains data.
if var.size:
    for dim, size, chunk_sizes in zip(dims, shape, chunk_shape):
        try:
            preferred_chunk_sizes = preferred_chunks[dim]
        except KeyError:
            continue
        # Determine the stop indices of the preferred chunks, but omit the last stop
        # (equal to the dim size).  In particular, assume that when a sequence
        # expresses the preferred chunks, the sequence sums to the size.
        preferred_stops = (
            range(preferred_chunk_sizes, size, preferred_chunk_sizes)
            if isinstance(preferred_chunk_sizes, Number)
            else itertools.accumulate(preferred_chunk_sizes[:-1])
        )
        # Gather any stop indices of the specified chunks that are not a stop index
        # of a preferred chunk.  Again, omit the last stop, assuming that it equals
        # the dim size.
        breaks = set(itertools.accumulate(chunk_sizes[:-1])).difference(
            preferred_stops
        )
        if breaks:
            warnings.warn(
                "The specified Dask chunks separate the stored chunks along "
                f'dimension "{dim}" starting at index {min(breaks)}. This could '
                "degrade performance. Instead, consider rechunking after loading."
            )

@max-sixty
Copy link
Collaborator

@stanwest how's that? (feel free to correct if I missed something, forgive the delay on merging)

Thanks!

@stanwest
Copy link
Contributor

stanwest commented Apr 9, 2022

@stanwest how's that?

That looks great to me.

@dcherian dcherian merged commit 851dade into pydata:main Apr 9, 2022
@dcherian
Copy link
Contributor

dcherian commented Apr 9, 2022

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

plan to merge Final call for comments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unnecessary warning when specifying chunks opening dataset with empty dimension

4 participants