-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr chunking fixes #5065
Zarr chunking fixes #5065
Conversation
Confused about the test error. It seems unrelated. In
|
Related to #5059, and it appears that @keewis came up with a fix for it in #5059 (comment) |
0a0b29d
to
bbd683d
Compare
Thanks Anderson. Fixed by rebasing. Now RTD build is failing, but there is no obvious error in the logs... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rabernat I only have some docstring suggestions
xarray/backends/zarr.py
Outdated
) | ||
if safe_chunks: | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really minor comment: Shouldn't this still be NotImplementedError
since we could technically support this by implementing locks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It used to be, but I changed it! Do we ever plan to implement locks?
xarray/core/variable.py
Outdated
@@ -1091,6 +1092,10 @@ def chunk(self, chunks={}, name=None, lock=False): | |||
|
|||
data = da.from_array(data, chunks, name=name, lock=lock, **kwargs) | |||
|
|||
# rechunking erases encoding | |||
if self._encoding and "chunks" in self._encoding: | |||
del self._encoding["chunks"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should mention this in the docstring for DataArray.chunk
Dataset.chunk
and Variable.chunk
I'm a little conflicted about dealing with
Maybe this isn't such a big deal in this particular case, especially if we don't think we would need to add such encoding specific logic to any other methods. But are we really sure about that -- what about cases like indexing? I guess the other alternative to make |
I see your point. I guess I don't fully understand where else in the code path encoding gets dropped. Consider this example import xarray as xr
ds = xr.Dataset({'foo': ('time', [1, 1], {'dtype': 'int16'})})
ds = xr.decode_cf(ds).compute()
assert "dtype" in ds.foo.encoding
assert "dtype" not in (0.5 * ds.foo).encoding Xarray knows to drop the |
To be honest, the existing convention is quite adhoc, just based on what seemed most appropriate at the time. #1614 is most comprehensive description of the current state of things. We were considering saying that |
There's a subtle difference. It drops all of @shoyer's point about indexing changing chunking is a good one too. Perhaps a kwarg in |
I would argue that this is unnecessary. If you want to explicitly drop encoding, just The problem here is with the default behavior of propagating chunk encoding through computations when it no longer makes sense. My example with the FWIW, I would also favor dropping |
We already drop all of |
Perhaps we could remove also |
I appreciate the discussion on this PR. Does anyone have a concrete suggestion of what to do? If we are not in agreement about the encoding stuff, perhaps I should remove that and just move forward with the |
In today's dev call, we proposed to handle encoding in The problem is, I can't figure out where this happens. Can someone point me to the place in the code where indexing operations delete encoding? A related question: I discovered this encoding option Line 396 in 57a4479
Should the Zarr backend be setting this? |
Yes, they are already defined in zarr: preferred_chunks=chunks. We decide to separate the
They are not necessarily the same. |
|
Replace xarray/xarray/core/variable.py Line 1084 in ddc352f
|
Thanks! Yeah that's what I had in mind. But I was wondering if there was an example of doing that it else I could copy. In any case, I'll give it a try now. |
A just pushed a new commit which deletes all encoding inside
Why is |
Yes |
So any ideas how to proceed? 🧐 |
Hmm. I would also be happy with explicitly deleting In the long term, the whole handling of encoding should be revisited, e.g., see #5082 |
xarray/core/variable.py
Outdated
new_encoding = None # rechunking removes all encoding | ||
return type(self)(self.dims, data, self._attrs, new_encoding, fastpath=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a simpler way to achieve the same thing is just to omit the argument:
new_encoding = None # rechunking removes all encoding | |
return type(self)(self.dims, data, self._attrs, new_encoding, fastpath=True) | |
return type(self)(self.dims, data, self._attrs, fastpath=True) |
This happens specifically on this line: Line 438 in ddc352f
So perhaps it would make sense to copy new_var = var.chunk(chunks, name=name2, lock=lock)
new_var.encoding = var.encoding |
I have removed the controversial If there are no further comments on this, I think this is good to go. |
Any further feedback on this now reduced-scope PR? Merging this would be helpful for moving forward Pangeo forge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few documentation issues, but otherwise looks good to me. I don't know a lot about chunking and zarr, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rabernat
Co-authored-by: keewis <keewis@users.noreply.github.com>
2acab90
to
626fa06
Compare
The pre-commit workflow is raising a blackdoc error I am not seeing in my local env diff --git a/doc/internals/duck-arrays-integration.rst b/doc/internals/duck-arrays-integration.rst
index eb5c4d8..2bc3c1f 100644
--- a/doc/internals/duck-arrays-integration.rst
+++ b/doc/internals/duck-arrays-integration.rst
@@ -25,7 +25,7 @@ argument:
...
def _repr_inline_(self, max_width):
- """ format to a single line with at most max_width characters """
+ """format to a single line with at most max_width characters"""
... |
the reason is that |
I think this PR has received a very thorough review. I would be pleased if someone from @pydata/xarray would merge it soon. |
Thanks @rabernat |
to_zarr
performance #2300, closes Allow "unsafe" mode for zarr writing #5056pre-commit run --all-files
whats-new.rst
This PR contains two small, related updates to how Zarr chunks are handled.
encoding
attribute at the Variable level wheneverchunk
is called. The persistence ofchunk
encoding has been the source of lots of confusion (see zarr and xarray chunking compatibility andto_zarr
performance #2300, automatic chunking of zarr archive #4046, Error when rechunking from Zarr store #4380, Writing to zarr fails with message "specified zarr chunks would overlap multiple dask chunks" xcube-dev/xcube#347)safe_chunks
into_zarr
which allows for bypassing the requirement of the many-to-one relationship between Zarr chunks and Dask chunks (see Allow "unsafe" mode for zarr writing #5056).Both these touch the internal logic for how chunks are handled, so I thought it was easiest to tackle them with a single PR.