-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix zarr datetime64 chunks #8253
Conversation
If this is the right way to go, we also need to implement this check when adding chunks to encoding: xarray/xarray/backends/zarr.py Lines 129 to 143 in d6c3767
|
Ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, so sorry @malmans2 that we didn't review this important PR sooner.
This seems to fix the problem. But I wonder...would it not be better fixed in encode_cf_variable
or CFDatetimeCoder
?
I would rather make the coders handle chunks consistently across all dtypes instead of "fixing" this problem at the Zarr layer.
xarray/backends/zarr.py
Outdated
@@ -106,6 +106,24 @@ def __getitem__(self, key): | |||
# could possibly have a work-around for 0d data here | |||
|
|||
|
|||
def _squeeze_var_chunks(var_chunks, name=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get a quick comment to explain what this function does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -317,6 +323,8 @@ def encode_zarr_variable(var, needs_copy=True, name=None): | |||
var = coder.encode(var, name=name) | |||
var = coding.strings.ensure_fixed_length_bytes(var) | |||
|
|||
if original_chunks and not var.chunks and "chunks" not in var.encoding: | |||
var.encoding["chunks"] = _squeeze_var_chunks(original_chunks, name=name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that fixing the issue this way reveals that our internal interfaces are leaky. It seems like a bandaid for a deeper problem.
Why does encode_cf_variable
work for some dask-based variables but not for certain datetimes? Why does CFDatetimeCoder
behave this way? Is it possible that the encoder is eagerly computing the dask array by mistake?
Indeed. This PR is a workaround. I went for the easy solution as I don't know the answers to these questions:
But I'm happy to invest some time on a better fix if we think that |
Though this also affecting normal non-cf datetimes — is the cfencoder mangling those? Or is this not an issue with the cf encoder? (Forgive me adding more questions than answers...) |
@@ -307,6 +318,7 @@ def encode_zarr_variable(var, needs_copy=True, name=None): | |||
out : Variable | |||
A variable which has been encoded as described above. | |||
""" | |||
original_chunks = var.chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of the crux. I cannot actually understand how / where the .chunks
attributes is defined for Variables. We want the decoding pipeline to preserve chunks unmodified.
.chunks
is not defined in the Variables class itself so must be somehow inherited from the other classes:
xarray/xarray/core/variable.py
Line 315 in 15328b6
class Variable(NamedArray, AbstractArray, VariableArithmetic): |
The word chunks
does not even appear in https://github.com/pydata/xarray/blob/main/xarray/coding/times.py.
I'm tempted to loop in @TomNicholas into this conversation, who recently refactored everything about how we handle chunked arrays, to help us sort through this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunks
is inherited from NamedArray:
xarray/xarray/namedarray/core.py
Line 660 in e5d163a
def chunks(self) -> _Chunks | None: |
Yes, the problem is that the encoder does not make any difference between dask/numpy arrays and always returns numpy arrays. I originally thought that was a mistake, but I wasn't so sure after I tried to change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DateTimes are always cast to numpy using np.asarray
AFAICT. And this is there since that part of the coding was implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are you saying that all datatime arrays are eagerly computed by the coding pipelines, even if they are Dask arrays?
Here is a PR just to show the naive test I've done: https://github.com/pydata/xarray/actions/runs/6823091519/job/18556346220?pr=8439 For example, see this tests: https://github.com/pydata/xarray/actions/runs/6823091519/job/18556346220?pr=8439#step:9:1385 Even if we force the encoder to retain the original chunks (i.e., cast to dask array), various tests break. That makes me think that either the encoder is doing the right thing (cast to numpy arrays) or we need a pretty involved refactor to fix this. It's just a guess though! |
One thing I don't understand here (and maybe no one does yet 😄) is why a bug in our CF encoding seems to leak into how standard datetime chunks work. Because CF times are our own more specialized implementation, bugs there are arguably less concerning. But when standard datatypes don't work, that has a broader impact. If we need CF expertise to fix them, we can fall into this space of most of us lacking sufficient context to fix a broad-based bug. I would love us not to drop this PR! (easy to say, harder to push through!). If there's a way to cleave off CF datetimes from the standard datetime format, I would be quite keen on that. For my own work, I'm coercing to ints now, which is a bit of a shame. |
If this is indeed the root cause, #7132 (comment), then we could fix it for numpy easily just looking at dtype. |
@max-sixty I have been following the conversation here and may post some more thoughts eventually—@rabernat has hit on something important—but for now I'll point out that CF encoding pertains to both standard datetimes and cftime datetimes (it refers to the broader topic of how you convert time-like values to numerical values, since most file formats do not support directly serializing As @kmuehlbauer notes the basic issue is that neither code path is dask-compatible. In principle, both encoding code paths (via pandas for Footnotes
|
Great, that makes sense, thanks @spencerkclark ! |
To give a little bit of background here -- the reason why we don't write datetime64 in a dask compatible way is that Xarray inspects data to figure out optimal CF time units, e.g., so we can output nice time units like "days" or "hours." This sometimes convenient, but I don't think it's necessary. It would probably be fine to always save datetime64 as "seconds since 1900-01-01T00:00:00" for dask arrays. |
+1. There are a few other bits we may want to be careful about when the encoding units are prescribed, but in the case that they are not, I might lobby for a default of |
Another option would be to only perform this check if the variable encoding is not already set. If the user has already specified |
I finally had some time to think about this again — see #8575 for what I think should be a solid start at addressing this on the |
whats-new.rst