Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix zarr datetime64 chunks #8253

Closed
wants to merge 18 commits into from

Conversation

malmans2
Copy link
Contributor

@malmans2 malmans2 commented Sep 28, 2023

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Sep 28, 2023
@malmans2
Copy link
Contributor Author

malmans2 commented Sep 28, 2023

If this is the right way to go, we also need to implement this check when adding chunks to encoding:

if any(len(set(chunks[:-1])) > 1 for chunks in var_chunks):
raise ValueError(
"Zarr requires uniform chunk sizes except for final chunk. "
f"Variable named {name!r} has incompatible dask chunks: {var_chunks!r}. "
"Consider rechunking using `chunk()`."
)
if any((chunks[0] < chunks[-1]) for chunks in var_chunks):
raise ValueError(
"Final chunk of Zarr array must be the same size or smaller "
f"than the first. Variable named {name!r} has incompatible Dask chunks {var_chunks!r}."
"Consider either rechunking using `chunk()` or instead deleting "
"or modifying `encoding['chunks']`."
)
# return the first chunk for each dimension
return tuple(chunk[0] for chunk in var_chunks)

@malmans2
Copy link
Contributor Author

Ready for review!

@max-sixty
Copy link
Collaborator

Sorry no one got to this, that's poor form of us. Thanks a lot for the PR @malmans2 .

I don't know this code that well — can anyone else have a look @pydata/xarray ?

It likely fixes #8432!

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, so sorry @malmans2 that we didn't review this important PR sooner.

This seems to fix the problem. But I wonder...would it not be better fixed in encode_cf_variable or CFDatetimeCoder?

I would rather make the coders handle chunks consistently across all dtypes instead of "fixing" this problem at the Zarr layer.

@@ -106,6 +106,24 @@ def __getitem__(self, key):
# could possibly have a work-around for 0d data here


def _squeeze_var_chunks(var_chunks, name=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a quick comment to explain what this function does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -317,6 +323,8 @@ def encode_zarr_variable(var, needs_copy=True, name=None):
var = coder.encode(var, name=name)
var = coding.strings.ensure_fixed_length_bytes(var)

if original_chunks and not var.chunks and "chunks" not in var.encoding:
var.encoding["chunks"] = _squeeze_var_chunks(original_chunks, name=name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that fixing the issue this way reveals that our internal interfaces are leaky. It seems like a bandaid for a deeper problem.

Why does encode_cf_variable work for some dask-based variables but not for certain datetimes? Why does CFDatetimeCoder behave this way? Is it possible that the encoder is eagerly computing the dask array by mistake?

@malmans2
Copy link
Contributor Author

This seems to fix the problem. But I wonder...would it not be better fixed in encode_cf_variable or CFDatetimeCoder?

I would rather make the coders handle chunks consistently across all dtypes instead of "fixing" this problem at the Zarr layer.

Indeed. This PR is a workaround.
As mentioned here, the very first thing I tried was to restore the dask chunks right after encode_cf_variable is called by the zarr backend (i.e., using.chunk rather than editing .encoding). However, it breaks some tests.

I went for the easy solution as I don't know the answers to these questions:

Why does encode_cf_variable work for some dask-based variables but not for certain datetimes? Why does CFDatetimeCoder behave this way? Is it possible that the encoder is eagerly computing the dask array by mistake?

But I'm happy to invest some time on a better fix if we think that encode_cf_variable or CFDatetimeCoder are not behaving properly.

@max-sixty
Copy link
Collaborator

Though this also affecting normal non-cf datetimes — is the cfencoder mangling those? Or is this not an issue with the cf encoder?

(Forgive me adding more questions than answers...)

@@ -307,6 +318,7 @@ def encode_zarr_variable(var, needs_copy=True, name=None):
out : Variable
A variable which has been encoded as described above.
"""
original_chunks = var.chunks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of the crux. I cannot actually understand how / where the .chunks attributes is defined for Variables. We want the decoding pipeline to preserve chunks unmodified.

.chunks is not defined in the Variables class itself so must be somehow inherited from the other classes:

class Variable(NamedArray, AbstractArray, VariableArithmetic):

The word chunks does not even appear in https://github.com/pydata/xarray/blob/main/xarray/coding/times.py.

I'm tempted to loop in @TomNicholas into this conversation, who recently refactored everything about how we handle chunked arrays, to help us sort through this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunks is inherited from NamedArray:

def chunks(self) -> _Chunks | None:

Yes, the problem is that the encoder does not make any difference between dask/numpy arrays and always returns numpy arrays. I originally thought that was a mistake, but I wasn't so sure after I tried to change that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateTimes are always cast to numpy using np.asarray AFAICT. And this is there since that part of the coding was implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are you saying that all datatime arrays are eagerly computed by the coding pipelines, even if they are Dask arrays?

@malmans2
Copy link
Contributor Author

malmans2 commented Nov 10, 2023

Here is a PR just to show the naive test I've done: https://github.com/pydata/xarray/actions/runs/6823091519/job/18556346220?pr=8439

For example, see this tests: https://github.com/pydata/xarray/actions/runs/6823091519/job/18556346220?pr=8439#step:9:1385

Even if we force the encoder to retain the original chunks (i.e., cast to dask array), various tests break. That makes me think that either the encoder is doing the right thing (cast to numpy arrays) or we need a pretty involved refactor to fix this. It's just a guess though!

@max-sixty
Copy link
Collaborator

One thing I don't understand here (and maybe no one does yet 😄) is why a bug in our CF encoding seems to leak into how standard datetime chunks work.

Because CF times are our own more specialized implementation, bugs there are arguably less concerning. But when standard datatypes don't work, that has a broader impact. If we need CF expertise to fix them, we can fall into this space of most of us lacking sufficient context to fix a broad-based bug.

I would love us not to drop this PR! (easy to say, harder to push through!). If there's a way to cleave off CF datetimes from the standard datetime format, I would be quite keen on that. For my own work, I'm coercing to ints now, which is a bit of a shame.

@dcherian
Copy link
Contributor

dcherian commented Nov 13, 2023

If this is indeed the root cause, #7132 (comment), then we could fix it for numpy easily just looking at dtype.

@spencerkclark
Copy link
Member

@max-sixty I have been following the conversation here and may post some more thoughts eventually—@rabernat has hit on something important—but for now I'll point out that CF encoding pertains to both standard datetimes and cftime datetimes (it refers to the broader topic of how you convert time-like values to numerical values, since most file formats do not support directly serializing datetime64[ns] values1, let alone cftime.datetime objects).

As @kmuehlbauer notes the basic issue is that neither code path is dask-compatible. In principle, both encoding code paths (via pandas for datetime64[ns] or via cftime for cftime.datetime objects) should be able to be made dask-compatible (it is long overdue).

Footnotes

  1. zarr is unusual in that it does, but to date we have not taken advantage of this in xarray.

@max-sixty
Copy link
Collaborator

since most file formats do not support directly serializing datetime64[ns] values

  1. zarr is unusual in that it does, but to date we have not taken advantage of this in xarray.

Great, that makes sense, thanks @spencerkclark !

@shoyer
Copy link
Member

shoyer commented Nov 14, 2023

To give a little bit of background here -- the reason why we don't write datetime64 in a dask compatible way is that Xarray inspects data to figure out optimal CF time units, e.g., so we can output nice time units like "days" or "hours."

This sometimes convenient, but I don't think it's necessary. It would probably be fine to always save datetime64 as "seconds since 1900-01-01T00:00:00" for dask arrays.

@spencerkclark
Copy link
Member

+1. There are a few other bits we may want to be careful about when the encoding units are prescribed, but in the case that they are not, I might lobby for a default of "nanoseconds since 1970-01-01" for datetime64[ns]. See also discussion in #3942, which points out another drawback of the existing units-selection logic.

@rabernat
Copy link
Contributor

Another option would be to only perform this check if the variable encoding is not already set. If the user has already specified var.encoding['units'] = "days since 1970-01-01" or any other valid encoding, there would be no need to peek at the data.

@spencerkclark
Copy link
Member

I finally had some time to think about this again — see #8575 for what I think should be a solid start at addressing this on the encode_cf_datetime side of things.

@malmans2 malmans2 deleted the fix-zarr-datetime-chunks branch January 27, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chunks management with datetime64 and timedelta64 datatype
7 participants