fix zarr datetime64 chunks #8253

malmans2 · 2023-09-28T21:48:32Z

Closes chunks management with datetime64 and timedelta64 datatype #8230
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

malmans2 · 2023-09-28T22:11:56Z

If this is the right way to go, we also need to implement this check when adding chunks to encoding:

Lines 129 to 143 in d6c3767

    
           if any(len(set(chunks[:-1])) > 1 for chunks in var_chunks): 
        
               raise ValueError( 
        
                   "Zarr requires uniform chunk sizes except for final chunk. " 
        
                   f"Variable named {name!r} has incompatible dask chunks: {var_chunks!r}. " 
        
                   "Consider rechunking using `chunk()`." 
        
               ) 
        
           if any((chunks[0] < chunks[-1]) for chunks in var_chunks): 
        
               raise ValueError( 
        
                   "Final chunk of Zarr array must be the same size or smaller " 
        
                   f"than the first. Variable named {name!r} has incompatible Dask chunks {var_chunks!r}." 
        
                   "Consider either rechunking using `chunk()` or instead deleting " 
        
                   "or modifying `encoding['chunks']`." 
        
               ) 
        
           # return the first chunk for each dimension 
        
           return tuple(chunk[0] for chunk in var_chunks)

This reverts commit d5cf4b3.

malmans2 · 2023-09-29T07:54:36Z

Ready for review!

max-sixty · 2023-11-09T19:54:41Z

Sorry no one got to this, that's poor form of us. Thanks a lot for the PR @malmans2 .

I don't know this code that well — can anyone else have a look @pydata/xarray ?

It likely fixes #8432!

rabernat

I agree, so sorry @malmans2 that we didn't review this important PR sooner.

This seems to fix the problem. But I wonder...would it not be better fixed in encode_cf_variable or CFDatetimeCoder?

I would rather make the coders handle chunks consistently across all dtypes instead of "fixing" this problem at the Zarr layer.

rabernat · 2023-11-09T22:46:36Z

xarray/backends/zarr.py

@@ -106,6 +106,24 @@ def __getitem__(self, key):
        # could possibly have a work-around for 0d data here


+def _squeeze_var_chunks(var_chunks, name=None):


Can we get a quick comment to explain what this function does?

rabernat · 2023-11-09T22:58:20Z

xarray/backends/zarr.py

@@ -317,6 +323,8 @@ def encode_zarr_variable(var, needs_copy=True, name=None):
    var = coder.encode(var, name=name)
    var = coding.strings.ensure_fixed_length_bytes(var)

+    if original_chunks and not var.chunks and "chunks" not in var.encoding:
+        var.encoding["chunks"] = _squeeze_var_chunks(original_chunks, name=name)


I worry that fixing the issue this way reveals that our internal interfaces are leaky. It seems like a bandaid for a deeper problem.

Why does encode_cf_variable work for some dask-based variables but not for certain datetimes? Why does CFDatetimeCoder behave this way? Is it possible that the encoder is eagerly computing the dask array by mistake?

malmans2 · 2023-11-10T06:57:26Z

This seems to fix the problem. But I wonder...would it not be better fixed in encode_cf_variable or CFDatetimeCoder?

I would rather make the coders handle chunks consistently across all dtypes instead of "fixing" this problem at the Zarr layer.

Indeed. This PR is a workaround.
As mentioned here, the very first thing I tried was to restore the dask chunks right after encode_cf_variable is called by the zarr backend (i.e., using.chunk rather than editing .encoding). However, it breaks some tests.

I went for the easy solution as I don't know the answers to these questions:

Why does encode_cf_variable work for some dask-based variables but not for certain datetimes? Why does CFDatetimeCoder behave this way? Is it possible that the encoder is eagerly computing the dask array by mistake?

But I'm happy to invest some time on a better fix if we think that encode_cf_variable or CFDatetimeCoder are not behaving properly.

max-sixty · 2023-11-10T08:39:15Z

Though this also affecting normal non-cf datetimes — is the cfencoder mangling those? Or is this not an issue with the cf encoder?

(Forgive me adding more questions than answers...)

rabernat · 2023-11-10T08:50:02Z

xarray/backends/zarr.py

@@ -307,6 +318,7 @@ def encode_zarr_variable(var, needs_copy=True, name=None):
    out : Variable
        A variable which has been encoded as described above.
    """
+    original_chunks = var.chunks


This is kind of the crux. I cannot actually understand how / where the .chunks attributes is defined for Variables. We want the decoding pipeline to preserve chunks unmodified.

.chunks is not defined in the Variables class itself so must be somehow inherited from the other classes:

xarray/xarray/core/variable.py

Line 315 in 15328b6

class Variable(NamedArray, AbstractArray, VariableArithmetic):

The word chunks does not even appear in https://github.com/pydata/xarray/blob/main/xarray/coding/times.py.

I'm tempted to loop in @TomNicholas into this conversation, who recently refactored everything about how we handle chunked arrays, to help us sort through this.

chunks is inherited from NamedArray:

xarray/xarray/namedarray/core.py

Line 660 in e5d163a

def chunks(self) -> _Chunks | None:

Yes, the problem is that the encoder does not make any difference between dask/numpy arrays and always returns numpy arrays. I originally thought that was a mistake, but I wasn't so sure after I tried to change that.

DateTimes are always cast to numpy using np.asarray AFAICT. And this is there since that part of the coding was implemented.

So are you saying that all datatime arrays are eagerly computed by the coding pipelines, even if they are Dask arrays?

malmans2 · 2023-11-10T09:57:57Z

Here is a PR just to show the naive test I've done: https://github.com/pydata/xarray/actions/runs/6823091519/job/18556346220?pr=8439

For example, see this tests: https://github.com/pydata/xarray/actions/runs/6823091519/job/18556346220?pr=8439#step:9:1385

Even if we force the encoder to retain the original chunks (i.e., cast to dask array), various tests break. That makes me think that either the encoder is doing the right thing (cast to numpy arrays) or we need a pretty involved refactor to fix this. It's just a guess though!

max-sixty · 2023-11-13T20:42:31Z

One thing I don't understand here (and maybe no one does yet 😄) is why a bug in our CF encoding seems to leak into how standard datetime chunks work.

Because CF times are our own more specialized implementation, bugs there are arguably less concerning. But when standard datatypes don't work, that has a broader impact. If we need CF expertise to fix them, we can fall into this space of most of us lacking sufficient context to fix a broad-based bug.

I would love us not to drop this PR! (easy to say, harder to push through!). If there's a way to cleave off CF datetimes from the standard datetime format, I would be quite keen on that. For my own work, I'm coercing to ints now, which is a bit of a shame.

dcherian · 2023-11-13T21:27:56Z

If this is indeed the root cause, #7132 (comment), then we could fix it for numpy easily just looking at dtype.

spencerkclark · 2023-11-13T23:49:23Z

@max-sixty I have been following the conversation here and may post some more thoughts eventually—@rabernat has hit on something important—but for now I'll point out that CF encoding pertains to both standard datetimes and cftime datetimes (it refers to the broader topic of how you convert time-like values to numerical values, since most file formats do not support directly serializing datetime64[ns] values¹, let alone cftime.datetime objects).

As @kmuehlbauer notes the basic issue is that neither code path is dask-compatible. In principle, both encoding code paths (via pandas for datetime64[ns] or via cftime for cftime.datetime objects) should be able to be made dask-compatible (it is long overdue).

zarr is unusual in that it does, but to date we have not taken advantage of this in xarray. ↩

max-sixty · 2023-11-14T00:03:11Z

since most file formats do not support directly serializing datetime64[ns] values

zarr is unusual in that it does, but to date we have not taken advantage of this in xarray.

Great, that makes sense, thanks @spencerkclark !

shoyer · 2023-11-14T11:11:26Z

To give a little bit of background here -- the reason why we don't write datetime64 in a dask compatible way is that Xarray inspects data to figure out optimal CF time units, e.g., so we can output nice time units like "days" or "hours."

This sometimes convenient, but I don't think it's necessary. It would probably be fine to always save datetime64 as "seconds since 1900-01-01T00:00:00" for dask arrays.

spencerkclark · 2023-11-14T12:10:50Z

+1. There are a few other bits we may want to be careful about when the encoding units are prescribed, but in the case that they are not, I might lobby for a default of "nanoseconds since 1970-01-01" for datetime64[ns]. See also discussion in #3942, which points out another drawback of the existing units-selection logic.

rabernat · 2023-11-14T13:07:04Z

Another option would be to only perform this check if the variable encoding is not already set. If the user has already specified var.encoding['units'] = "days since 1970-01-01" or any other valid encoding, there would be no need to peek at the data.

spencerkclark · 2023-12-30T01:28:32Z

I finally had some time to think about this again — see #8575 for what I think should be a solid start at addressing this on the encode_cf_datetime side of things.

fix zarr datetime64 chunks

58ed18b

github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Sep 28, 2023

malmans2 added 2 commits September 28, 2023 23:54

what's new

d3b5891

fix docs

d5cf4b3

malmans2 added 9 commits September 29, 2023 00:20

check chunks

1831839

cleanup

5e7ec10

typo

61aca65

Revert "fix docs"

967c9f9

This reverts commit d5cf4b3.

use original chunks

3b3cf7b

cleanup and comments

8a0ccd9

restore squeeze chunks

83ece89

cleanup

f393375

better docs

8358fb5

malmans2 added 3 commits October 11, 2023 09:01

Merge branch 'main' into fix-zarr-datetime-chunks

b3bba6d

Merge branch 'main' into fix-zarr-datetime-chunks

c670e95

fix docs

a99c54e

dcherian mentioned this pull request Nov 9, 2023

Writing a datetime coord ignores chunks #8432

Closed

5 tasks

Merge branch 'main' into fix-zarr-datetime-chunks

2e043c4

rabernat reviewed Nov 9, 2023

View reviewed changes

malmans2 added 2 commits November 10, 2023 08:22

add comment

818ba67

Merge branch 'main' into fix-zarr-datetime-chunks

1dc9020

rabernat reviewed Nov 10, 2023

View reviewed changes

malmans2 mentioned this pull request Nov 10, 2023

Restore dask arrays rather than editing encoding #8439

Closed

spencerkclark mentioned this pull request Dec 30, 2023

Add chunk-friendly code path to encode_cf_datetime and encode_cf_timedelta #8575

Merged

7 tasks

dcherian closed this in #8575 Jan 29, 2024

malmans2 deleted the fix-zarr-datetime-chunks branch January 27, 2025 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix zarr datetime64 chunks #8253

fix zarr datetime64 chunks #8253

malmans2 commented Sep 28, 2023 •

edited

Loading

malmans2 commented Sep 28, 2023 •

edited

Loading

malmans2 commented Sep 29, 2023

max-sixty commented Nov 9, 2023

rabernat left a comment

rabernat Nov 9, 2023

malmans2 Nov 10, 2023

rabernat Nov 9, 2023

malmans2 commented Nov 10, 2023

max-sixty commented Nov 10, 2023

rabernat Nov 10, 2023

malmans2 Nov 10, 2023

kmuehlbauer Nov 10, 2023

rabernat Nov 13, 2023

malmans2 commented Nov 10, 2023 •

edited

Loading

max-sixty commented Nov 13, 2023

dcherian commented Nov 13, 2023 •

edited

Loading

spencerkclark commented Nov 13, 2023

max-sixty commented Nov 14, 2023

shoyer commented Nov 14, 2023 •

edited

Loading

spencerkclark commented Nov 14, 2023

rabernat commented Nov 14, 2023

spencerkclark commented Dec 30, 2023

		@@ -106,6 +106,24 @@ def __getitem__(self, key):
		# could possibly have a work-around for 0d data here


		def _squeeze_var_chunks(var_chunks, name=None):

fix zarr datetime64 chunks #8253

fix zarr datetime64 chunks #8253

Conversation

malmans2 commented Sep 28, 2023 • edited Loading

malmans2 commented Sep 28, 2023 • edited Loading

malmans2 commented Sep 29, 2023

max-sixty commented Nov 9, 2023

rabernat left a comment

Choose a reason for hiding this comment

rabernat Nov 9, 2023

Choose a reason for hiding this comment

malmans2 Nov 10, 2023

Choose a reason for hiding this comment

rabernat Nov 9, 2023

Choose a reason for hiding this comment

malmans2 commented Nov 10, 2023

max-sixty commented Nov 10, 2023

rabernat Nov 10, 2023

Choose a reason for hiding this comment

malmans2 Nov 10, 2023

Choose a reason for hiding this comment

kmuehlbauer Nov 10, 2023

Choose a reason for hiding this comment

rabernat Nov 13, 2023

Choose a reason for hiding this comment

malmans2 commented Nov 10, 2023 • edited Loading

max-sixty commented Nov 13, 2023

dcherian commented Nov 13, 2023 • edited Loading

spencerkclark commented Nov 13, 2023

Footnotes

max-sixty commented Nov 14, 2023

shoyer commented Nov 14, 2023 • edited Loading

spencerkclark commented Nov 14, 2023

rabernat commented Nov 14, 2023

spencerkclark commented Dec 30, 2023

malmans2 commented Sep 28, 2023 •

edited

Loading

malmans2 commented Sep 28, 2023 •

edited

Loading

malmans2 commented Nov 10, 2023 •

edited

Loading

dcherian commented Nov 13, 2023 •

edited

Loading

shoyer commented Nov 14, 2023 •

edited

Loading