Fill values in time arrays (numpy.datetime64) are lost in zarr #7790

christine-e-smit · 2023-04-26T22:03:06Z

What happened?

I have a time array of type numpy.datetime64 with fill values. When I save the dateset I created with xarray to zarr and then read that zarr store back out again with xarray, the fill values are lost.

What did you expect to happen?

I expected my fill values to still be in my time array when I read it back out of the zarr store.

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr
import zarr


# Create a numpy array of type np.datetime64 with one fill value and one date
time_fill_value = np.datetime64("NaT")
time = np.array([time_fill_value,'2023-01-02'],dtype='M8[ns]')

# Create a dataset with this one array
xr_time_array = xr.DataArray(data=time,dims=['time'],name='time')
xr_ds = xr.Dataset(dict(time=xr_time_array))

print("******************")
print("Created with fill value (NaT)")
print(xr_ds["time"])

# Save the dataset to zarr
location = "from_xarray.zarr"
encoding = {
    "time":{"_FillValue":time_fill_value}
}
xr_ds.to_zarr(location,encoding=encoding)

xr_read = xr.open_zarr(location)
print("******************")
print("No fill value")
print(xr_read["time"])

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

When I look in the .zmetadata file generated for this zarr store, I see that the time array as been converted to float and there is a units attribute:

{
    "metadata": {
        ".zattrs": {},
        ".zgroup": {
            "zarr_format": 2
        },
        "time/.zarray": {
            "chunks": [
                2
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": -9.223372036854776e+18,
            "filters": null,
            "order": "C",
            "shape": [
                2
            ],
            "zarr_format": 2
        },
        "time/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "time"
            ],
            "calendar": "proleptic_gregorian",
            "units": "days since 2023-01-02 00:00:00"
        }
    },
    "zarr_consolidated_format": 1
}

If I open the data using zarr, the actual values in the time array look fine, given the transformation:

z = zarr.open(location,'r')

print(f'Fill value: {z["time"].fill_value}')
print(f'Data values: {z["time"][:]}')

Output:

Fill value: -9.223372036854776e+18
Data values: [-9.22337204e+18  0.00000000e+00]

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.3 | packaged by conda-forge | (main, Apr 6 2023, 08:58:31) [Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.4.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2023.4.2
pandas: 2.0.1
numpy: 1.24.3
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.14.2
cftime: None
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.7.2
pip: 23.1.2
conda: None
pytest: None
mypy: None
IPython: 8.12.0
sphinx: None

The text was updated successfully, but these errors were encountered:

welcome · 2023-04-26T22:03:08Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

kmuehlbauer · 2023-04-27T11:19:24Z

@christine-e-smit

So, I'm no expert for zarr, but it turns out that your NaT was converted to -9.223372036854776e+18 in the encoding step. I'm assuming that zarr is converting NaT as the format doesn't allow to use NaT directly, so it chooses a (default) value.

The _FillValue is not lost, but it will be preserved in the .encoding-dict of the underlying Variable:

xr_read = xr.open_zarr(location)
print("******************")
print("No fill value")
print(xr_read["time"])
print(xr_read["time"].encoding)

******************
No fill value
<xarray.DataArray 'time' (time: 2)>
array([                          'NaT', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] NaT 2023-01-02
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9.223372036854776e+18, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('float64')}

You might also check this without decoding (decode_cd=False):

with xr.open_zarr(location, decode_cf=False) as xr_read:
    print("******************")
    print("No fill value")
    print(xr_read["time"])
    print(xr_read["time"].encoding)

******************
No fill value
<xarray.DataArray 'time' (time: 2)>
array([-9.223372e+18,  0.000000e+00])
Coordinates:
  * time     (time) float64 -9.223e+18 0.0
Attributes:
    calendar:    proleptic_gregorian
    units:       days since 2023-01-02 00:00:00
    _FillValue:  -9.223372036854776e+18
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('float64')}

Maybe a zarr-expert can chime in here, what's the best practice for time-fill_values.

kmuehlbauer · 2023-04-27T11:26:15Z

Xref: discussion #7776, which got no attention up to now.

christine-e-smit · 2023-04-27T14:08:37Z

Ah! Okay. I did not know about the .encoding option, which does indeed have the fill value. Thank you.

Interestingly, -9.223372036854776e+18 is just the float equivalent of numpy.datetime64('NaT'):

float(np.datetime64('NaT').view('i8'))

-9.223372036854776e+18

And I know this isn't an issue with zarr and NaT because I can create the zarr store directly with the zarr library and it's perfectly happy:

# Create a zarr store directly with numpy.datetime64 type
location_zarr_direct = "from_zarr.zarr"
root = zarr.open(location_zarr_direct,mode='w')
z_time_array = root.create_dataset(
    "time",data=time,shape=time.shape,chunks=time.shape,dtype=time.dtype,
    fill_value=time_fill_value
)
zarr.convenience.consolidate_metadata(location_zarr_direct)
# Read it back out again
read_zarr = zarr.open(location_zarr_direct,mode='r')
print(read_zarr["time"][:])

[                          'NaT' '2023-01-02T00:00:00.000000000']

christine-e-smit · 2023-04-27T14:13:58Z

Interestingly, xarray is also perfectly happy to read a numpy.datetime64 array out of a zarr store as long as the xarray metadata is present. xarray even helpfully creates an '_FillValue" attribute for the array so there is no confusion:

# Create a zarr store directly with numpy.datetime64 type
location_zarr_direct = "from_zarr.zarr"
root = zarr.open(location_zarr_direct,mode='w')
z_time_array = root.create_dataset(
    "time",data=time,shape=time.shape,chunks=time.shape,dtype=time.dtype,
    fill_value=time_fill_value
)
# Add xarray metadata
z_time_array.attrs["_ARRAY_DIMENSIONS"] = ["time"]
zarr.convenience.consolidate_metadata(location_zarr_direct)
# Use xarray to read this data out
xr_read_from_zarr = xr.open_zarr(location_zarr_direct)
print(xr_read_from_zarr["time"])

<xarray.DataArray 'time' (time: 2)>
array([                          'NaT', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] NaT 2023-01-02
Attributes:
    _FillValue:  NaT

So I am extremely confused as to why xarray encodes time arrays so strangely when it creates the zarr store itself! (Hence #7776)

kmuehlbauer · 2023-04-27T14:23:16Z

@christine-e-smit I see, thanks for the details. AFAICT from the code it looks like zarr is special-cased in some ways compared to other backends. I'd really rely on some zarr-expert shedding light here and over at #7776.

dcherian · 2023-04-27T19:18:12Z

I think the issue is that we're always running "CF encoding" which is more appropriate for netCDF4 than Zarr, since Zarr supports datetime64 natively. And currently there's no way to control whether the datetime encoder is applied or not, we just look at the dtype:

xarray/xarray/coding/times.py

Lines 697 to 704 in 0f4e99d

    
           class CFDatetimeCoder(VariableCoder): 
        
               def __init__(self, use_cftime: bool | None = None) -> None: 
        
                   self.use_cftime = use_cftime 
        
               def encode(self, variable: Variable, name: T_Name = None) -> Variable: 
        
                   if np.issubdtype( 
        
                       variable.data.dtype, np.datetime64 
        
                   ) or contains_cftime_datetimes(variable):

I think the right way to fix this is to allow the user to run the encode and write steps separately, with the encoding steps being controllable: #4412

kmuehlbauer · 2023-04-28T06:21:38Z

Thanks @dcherian for filling in the details.

I've digged up some more related issues: #2265, #3942, #4045

IIUC, #4684 did a great job to iron out much of these issues, but as it looks like only in the case when no NaT is within the time array (cc @spencerkclark). @christine-e-smit If you have no NaT in your time array then you can just omit encoding completely and Xarray will use int64 per default and your data should be fine on disk.

In the presence of NaT it looks like one workaround to circumvent that issue for the time being is to add the dtype in addition to _FillValue when writing out to zarr :

encoding = {
    "time":{"_FillValue": time_fill_value, "dtype": np.int64}
xr_ds.to_zarr(location, encoding=encoding)
}

One note to this: Xarray is deducing the units from the current time data. So for the above example it will result in 'days since 2023-01-02 00:00:00' where days would now be the resolution in the file. If you want the resolution to be nanoseconds on disk units would need to be added to the encoding.

encoding = {
    "time":{"_FillValue": time_fill_value, "dtype": np.int64, 'units': 'nanoseconds since 2023-01-02'}
}
xr_ds.to_zarr(location, encoding=encoding)

@christine-e-smit It would be great if you could confirm that from your side (some sanity check needed on my side).

christine-e-smit · 2023-04-28T18:07:40Z

@kmuehlbauer - I think I'm not understanding what you are suggesting because the zarr store is still not being read correctly when I switch the fill value to a different date:

# Create a numpy array of type np.datetime64 with one fill value and one date
time_fill_value = np.datetime64("1900-01-01")
time = np.array([time_fill_value,'2023-01-02'],dtype='M8[ns]')

# Create a dataset with this one array
xr_time_array = xr.DataArray(data=time,dims=['time'],name='time')
xr_ds = xr.Dataset(dict(time=xr_time_array))

print("******************")
print("Created with fill value 1900-01-01")
print(xr_ds["time"])

# Save the dataset to zarr
location_new_fill = "from_xarray_new_fill.zarr"
encoding = {
    "time":{"_FillValue":time_fill_value,"dtype":np.int64}
}
xr_ds.to_zarr(location_new_fill,encoding=encoding)

xr_read = xr.open_zarr(location)
print("******************")
print("Read back out of the zarr store with xarray")
print(xr_read["time"])
print(xr_read["time"].encoding)

******************
Created with fill value 1900-01-01
<xarray.DataArray 'time' (time: 2)>
array(['1900-01-01T00:00:00.000000000', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 1900-01-01 2023-01-02
******************
<xarray.DataArray 'time' (time: 2)>
array(['2023-01-02T00:00:00.000000000', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2023-01-02 2023-01-02
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9.223372036854776e+18, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('float64')}

christine-e-smit · 2023-04-28T18:08:16Z

The zarr store does indeed use an integer in this case according to the .zmetadata file:

{
    "metadata": {
        ".zattrs": {},
        ".zgroup": {
            "zarr_format": 2
        },
        "time/.zarray": {
            "chunks": [
                2
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<i8",
            "fill_value": -25567,
            "filters": null,
            "order": "C",
            "shape": [
                2
            ],
            "zarr_format": 2
        },
        "time/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "time"
            ],
            "calendar": "proleptic_gregorian",
            "units": "days since 1900-01-01 00:00:00"
        }
    },
    "zarr_consolidated_format": 1
}

Once again the values in the zarr store are correct given the units, but xarray misreads the fill value for some reason:

z_new_fill = zarr.open('from_xarray_new_fill.zarr','r')
z_new_fill["time"][:]

array([    0, 44926])

christine-e-smit · 2023-04-28T18:39:01Z

Where in the code is the time array being decoded? That seems to be where a lot of the issue is?

dcherian · 2023-04-28T20:43:44Z

xarray/xarray/coding/times.py

Lines 717 to 735 in 25d9a28

    
           def decode(self, variable: Variable, name: T_Name = None) -> Variable: 
        
               units = variable.attrs.get("units", None) 
        
               if isinstance(units, str) and "since" in units: 
        
                   dims, data, attrs, encoding = unpack_for_decoding(variable) 
        
                   units = pop_to(attrs, encoding, "units") 
        
                   calendar = pop_to(attrs, encoding, "calendar") 
        
                   dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime) 
        
                   transform = partial( 
        
                       decode_cf_datetime, 
        
                       units=units, 
        
                       calendar=calendar, 
        
                       use_cftime=self.use_cftime, 
        
                   ) 
        
                   data = lazy_elemwise_func(data, transform, dtype) 
        
                   return Variable(dims, data, attrs, encoding, fastpath=True) 
        
               else: 
        
                   return variable

kmuehlbauer · 2023-04-30T16:52:25Z

xr_ds.to_zarr(location_new_fill,encoding=encoding)

xr_read = xr.open_zarr(location)
print("******************")
print("Read back out of the zarr store with xarray")
print(xr_read["time"])
print(xr_read["time"].encoding)

@christine-e-smit Is this just a remnant of copy&paste? The above code writes to location_new_fill, but reads from location.

Here is my code and output for comparison (using latest zarr/xarray):

# Create a numpy array of type np.datetime64 with one fill value and one date
time_fill_value = np.datetime64("1900-01-01")
time = np.array([np.datetime64("NaT"), '2023-01-02'], dtype='M8[ns]')

# Create a dataset with this one array
xr_time_array = xr.DataArray(data=time,dims=['time'],name='time')
xr_ds = xr.Dataset(dict(time=xr_time_array))

print("******************")
print("Created with fill value 1900-01-01")
print(xr_ds["time"])

# Save the dataset to zarr
location_new_fill = "from_xarray_new_fill.zarr"
encoding = {
    "time":{"_FillValue":time_fill_value,"dtype":np.int64}
}
xr_ds.to_zarr(location_new_fill, encoding=encoding)

xr_read = xr.open_zarr(location_new_fill)
print("******************")
print("Read back out of the zarr store with xarray")
print(xr_read["time"])
print(xr_read["time"].encoding)

******************
Created with fill value 1900-01-01
<xarray.DataArray 'time' (time: 2)>
array([                          'NaT', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] NaT 2023-01-02
******************
Read back out of the zarr store with xarray
<xarray.DataArray 'time' (time: 2)>
array([                          'NaT', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] NaT 2023-01-02
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -25567, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}

This doesn't look correct either. At least the decoded _FillValue or the units are wrong. So -25567 is 1900-01-01 when referenced to of unix-epoch (Question: Is zarr time based on unix epoch?). When read back via zarr only this would decode into:

<xarray.DataArray 'time' (time: 2)>
array(['1953-01-02T00:00:00.000000000', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')

I totally agree with @christine-e-smit, this is all very confusing. As said at the beginning, I have little knowledge of zarr. I'm currently digging into cf encoding/decoding which made me jump on here.

AFAICT, it looks like already the encoding has a problem, at least the data on disk is already not what we expect. It seems that somehow the xarray cf_encoding/decoding is not well aligned with the zarr writing/reading of datetimes.

kmuehlbauer · 2023-05-01T16:05:19Z

So, after some debugging I think I've found two issues here with the current code.

First, we need to give the fillvalue with a fitting resolution. Second, we have an issue with inferring the units from the data (if not given).

Here is some workaround code which (finally, 🤞) should at least write and read correct data (added comments below):

# Create a numpy array of type np.datetime64 with one fill value and one date
# FIRST ISSUE WITH _FillValue
# we need to provide ns resolution here too, otherwise we get wrong fillvalues (day-reference)
time_fill_value = np.datetime64("1900-01-01 00:00:00.00000000", "ns")
time = np.array([np.datetime64("NaT", "ns"), '2023-01-02 00:00:00.00000000'], dtype='M8[ns]')

# Create a dataset with this one array
xr_time_array = xr.DataArray(data=time,dims=['time'],name='time')
xr_ds = xr.Dataset(dict(time=xr_time_array))

print("******************")
print("Created with fill value 1900-01-01")
print(xr_ds["time"])

# Save the dataset to zarr
location_new_fill = "from_xarray_new_fill.zarr"
# SECOND ISSUE with inferring units from data
# We need to specify "dtype" and "units" which fit our data
# Note: as we provide a _FillValue with a reference to unix-epoch 
# we need to provide a fitting units too 
encoding = {
    "time":{"_FillValue":time_fill_value, "dtype":np.int64, "units":"nanoseconds since 1970-01-01"}
}
xr_ds.to_zarr(location_new_fill, mode="w", encoding=encoding)

xr_read = xr.open_zarr(location_new_fill)
print("******************")
print("Read back out of the zarr store with xarray")
print(xr_read["time"])
print(xr_read["time"].attrs)
print(xr_read["time"].encoding)

z_new_fill = zarr.open('from_xarray_new_fill.zarr','r', )
print("******************")
print("Read back out of the zarr store with zarr")

print(z_new_fill["time"])
print(z_new_fill["time"].attrs)
print(z_new_fill["time"][:])

******************
Created with fill value 1900-01-01
<xarray.DataArray 'time' (time: 2)>
array([                          'NaT', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] NaT 2023-01-02
******************
Read back out of the zarr store with xarray
<xarray.DataArray 'time' (time: 2)>
array([                          'NaT', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] NaT 2023-01-02
{}
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -2208988800000000000, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
******************
Read back out of the zarr store with zarr
<zarr.core.Array '/time' (2,) int64 read-only>
<zarr.attrs.Attributes object at 0x7f086ab8e710>
[-2208988800000000000  1672617600000000000]

@christine-e-smit Please let me know, if the above workaround gives you correct results in your workflow. If so, then we can think about how to automatically align fillvalue-resolution with data-resolution and what needs to be done to correctly deduce the units.

christine-e-smit · 2023-05-01T18:37:47Z

Oops! Yes. You are right. I had some cross-wording on the variable names. So I started a new notebook. Unfortunately, I think you may have also gotten some wires crossed? You set the time fill value to 1900-01-01, but then use NaT in the actual array?

Here is a fresh notebook with a stand-alone cell with everything that I think you were doing, but I'm not 100%. The fill value is still wrong when it gets read out, but it is at least different? The fill value is now set to the units for some reason. This seems like progress?

import numpy as np
import xarray as xr
import zarr

# Create a time array with one fill value, NaT
time = np.array([np.datetime64("NaT", "ns"), '2023-01-02 00:00:00.00000000'], dtype='M8[ns]')

# Create xarray with this fill value
xr_time_array = xr.DataArray(data=time,dims=['time'],name='time')
xr_ds = xr.Dataset(dict(time=xr_time_array))
print("**********************")
print("xarray created with NaT fill value")
print("----------------------")
print(xr_ds["time"])

# Save as zarr
location_with_units = "xarray_and_units.zarr"
encoding = {
    "time":{"_FillValue":np.datetime64("NaT","ns"),"dtype":np.int64,"units":"nanoseconds since 1970-01-01"}
}
xr_ds.to_zarr(location_with_units,mode="w",encoding=encoding)


# Read it back out again
xr_read = xr.open_zarr(location_with_units)
print("**********************")
print("xarray created read with NaT fill value")
print("----------------------")
print(xr_read["time"])
print(xr_read["time"].attrs)
print(xr_read["time"].encoding)

**********************
xarray created with NaT fill value
----------------------
<xarray.DataArray 'time' (time: 2)>
array([                          'NaT', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] NaT 2023-01-02
**********************
xarray created read with NaT fill value
----------------------
<xarray.DataArray 'time' (time: 2)>
array(['1970-01-01T00:00:00.000000000', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 1970-01-01 2023-01-02
{}
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9223372036854775808, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}

kmuehlbauer · 2023-05-01T19:30:22Z

Unfortunately, I think you may have also gotten some wires crossed? You set the time fill value to 1900-01-01, but then use NaT in the actual array?

Yes, I use NaT because I want to check if the encoder does correctly translate NaT to the provided _FillValue on write.

So from your last example I'm assuming you would like to have the int64 representation of NaT as _FillValue, right?
I'll try to adapt this, and see what I get

kmuehlbauer · 2023-05-01T19:53:53Z

@christine-e-smit I've plugged your code into a fresh notebook, here is my output:

**********************
xarray created with NaT fill value
----------------------
<xarray.DataArray 'time' (time: 2)>
array([                          'NaT', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] NaT 2023-01-02
**********************
xarray created read with NaT fill value
----------------------
<xarray.DataArray 'time' (time: 2)>
array([                          'NaT', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] NaT 2023-01-02
{}
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9223372036854775808, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}

The output seems OK on my side. I've no idea why the data isn't correctly decoded as NaT on your side. I've checked that my environment is comparable to yours. The only difference remaining is you are on Darwin arm64 whereas I'm on Linux.

INSTALLED VERSIONS
------------------
commit: None
python: 3.11.2 | packaged by conda-forge | (main, Mar 31 2023, 17:51:05) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-144-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.14.0
libnetcdf: None

xarray: 2023.4.2
pandas: 2.0.1
numpy: 1.24.3
scipy: 1.10.1
netCDF4: None
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.14.2
cftime: None
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.3.2
distributed: 2023.3.2
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.3.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.6.1
pip: 23.0.1
conda: None
pytest: 7.2.2
mypy: 0.982
IPython: 8.12.0
sphinx: None

kmuehlbauer · 2023-05-01T20:01:50Z

@christine-e-smit One more idea, you might delete the zarr folder before re-creating (if you are not doing that already). I've removed the complete folder before any new write (by putting eg. !rm -rf xarray_and_units.zarr at the beginning of the notebook-cell).

It would also be great if you could run the code from #7790 (comment) and post the output here, just for the sake of comparison (please delete the zarr-folder before if it exists). Thanks!

christine-e-smit · 2023-05-01T20:25:34Z

@kmuehlbauer - I ran #7790 (comment) and I get an incorrect fill value:

******************
Created with fill value 1900-01-01
<xarray.DataArray 'time' (time: 2)>
array([                          'NaT', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] NaT 2023-01-02
******************
Read back out of the zarr store with xarray
<xarray.DataArray 'time' (time: 2)>
array(['1970-01-01T00:00:00.000000000', '2023-01-02T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 1970-01-01 2023-01-02
{}
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -2208988800000000000, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
******************
Read back out of the zarr store with zarr
<zarr.core.Array '/time' (2,) int64 read-only>
<zarr.attrs.Attributes object at 0x132802a50>
[-2208988800000000000  1672617600000000000]

and here is my show_versions, since it may have changed because I've added some new libraries. It looks like my ipython version is slightly different, but I can't see how that would affect things.

INSTALLED VERSIONS
------------------
commit: None
python: 3.11.3 | packaged by conda-forge | (main, Apr  6 2023, 08:58:31) [Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.4.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2023.4.2
pandas: 2.0.1
numpy: 1.24.3
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.14.2
cftime: None
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.7.2
pip: 23.1.2
conda: None
pytest: None
mypy: None
IPython: 8.13.1
sphinx: None

christine-e-smit · 2023-05-01T21:43:08Z

Ah hah! Well, I don't know why this is working for you @kmuehlbauer, but I can see why it is not working for me. I've been debugging through the code and it looks like the problem is the _decode_datetime_with_pandas function. For me, it's converting a float NaN into an integer, which results in a zero value.

It all starts in the open_zarr function, which by default sets the use_cftime parameter to None by default:

xarray/xarray/backends/zarr.py

Lines 701 to 817 in 25d9a28

    
           def open_zarr( 
        
               store, 
        
               group=None, 
        
               synchronizer=None, 
        
               chunks="auto", 
        
               decode_cf=True, 
        
               mask_and_scale=True, 
        
               decode_times=True, 
        
               concat_characters=True, 
        
               decode_coords=True, 
        
               drop_variables=None, 
        
               consolidated=None, 
        
               overwrite_encoded_chunks=False, 
        
               chunk_store=None, 
        
               storage_options=None, 
        
               decode_timedelta=None, 
        
               use_cftime=None, 
        
               zarr_version=None, 
        
               **kwargs, 
        
           ): 
        
               """Load and decode a dataset from a Zarr store. 
        
               The `store` object should be a valid store for a Zarr group. `store` 
        
               variables must contain dimension metadata encoded in the 
        
               `_ARRAY_DIMENSIONS` attribute or must have NCZarr format. 
        
               Parameters 
        
               ---------- 
        
               store : MutableMapping or str 
        
                   A MutableMapping where a Zarr Group has been stored or a path to a 
        
                   directory in file system where a Zarr DirectoryStore has been stored. 
        
               synchronizer : object, optional 
        
                   Array synchronizer provided to zarr 
        
               group : str, optional 
        
                   Group path. (a.k.a. `path` in zarr terminology.) 
        
               chunks : int or dict or tuple or {None, 'auto'}, optional 
        
                   Chunk sizes along each dimension, e.g., ``5`` or 
        
                   ``{'x': 5, 'y': 5}``. If `chunks='auto'`, dask chunks are created 
        
                   based on the variable's zarr chunks. If `chunks=None`, zarr array 
        
                   data will lazily convert to numpy arrays upon access. This accepts 
        
                   all the chunk specifications as Dask does. 
        
               overwrite_encoded_chunks : bool, optional 
        
                   Whether to drop the zarr chunks encoded for each variable when a 
        
                   dataset is loaded with specified chunk sizes (default: False) 
        
               decode_cf : bool, optional 
        
                   Whether to decode these variables, assuming they were saved according 
        
                   to CF conventions. 
        
               mask_and_scale : bool, optional 
        
                   If True, replace array values equal to `_FillValue` with NA and scale 
        
                   values according to the formula `original_values * scale_factor + 
        
                   add_offset`, where `_FillValue`, `scale_factor` and `add_offset` are 
        
                   taken from variable attributes (if they exist).  If the `_FillValue` or 
        
                   `missing_value` attribute contains multiple values a warning will be 
        
                   issued and all array values matching one of the multiple values will 
        
                   be replaced by NA. 
        
               decode_times : bool, optional 
        
                   If True, decode times encoded in the standard NetCDF datetime format 
        
                   into datetime objects. Otherwise, leave them encoded as numbers. 
        
               concat_characters : bool, optional 
        
                   If True, concatenate along the last dimension of character arrays to 
        
                   form string arrays. Dimensions will only be concatenated over (and 
        
                   removed) if they have no corresponding variable and if they are only 
        
                   used as the last dimension of character arrays. 
        
               decode_coords : bool, optional 
        
                   If True, decode the 'coordinates' attribute to identify coordinates in 
        
                   the resulting dataset. 
        
               drop_variables : str or iterable, optional 
        
                   A variable or list of variables to exclude from being parsed from the 
        
                   dataset. This may be useful to drop variables with problems or 
        
                   inconsistent values. 
        
               consolidated : bool, optional 
        
                   Whether to open the store using zarr's consolidated metadata 
        
                   capability. Only works for stores that have already been consolidated. 
        
                   By default (`consolidate=None`), attempts to read consolidated metadata, 
        
                   falling back to read non-consolidated metadata if that fails. 
        
                   When the experimental ``zarr_version=3``, ``consolidated`` must be 
        
                   either be ``None`` or ``False``. 
        
               chunk_store : MutableMapping, optional 
        
                   A separate Zarr store only for chunk data. 
        
               storage_options : dict, optional 
        
                   Any additional parameters for the storage backend (ignored for local 
        
                   paths). 
        
               decode_timedelta : bool, optional 
        
                   If True, decode variables and coordinates with time units in 
        
                   {'days', 'hours', 'minutes', 'seconds', 'milliseconds', 'microseconds'} 
        
                   into timedelta objects. If False, leave them encoded as numbers. 
        
                   If None (default), assume the same value of decode_time. 
        
               use_cftime : bool, optional 
        
                   Only relevant if encoded dates come from a standard calendar 
        
                   (e.g. "gregorian", "proleptic_gregorian", "standard", or not 
        
                   specified).  If None (default), attempt to decode times to 
        
                   ``np.datetime64[ns]`` objects; if this is not possible, decode times to 
        
                   ``cftime.datetime`` objects. If True, always decode times to 
        
                   ``cftime.datetime`` objects, regardless of whether or not they can be 
        
                   represented using ``np.datetime64[ns]`` objects.  If False, always 
        
                   decode times to ``np.datetime64[ns]`` objects; if this is not possible 
        
                   raise an error. 
        
               zarr_version : int or None, optional 
        
                   The desired zarr spec version to target (currently 2 or 3). The default 
        
                   of None will attempt to determine the zarr version from ``store`` when 
        
                   possible, otherwise defaulting to 2. 
        
               Returns 
        
               ------- 
        
               dataset : Dataset 
        
                   The newly created dataset. 
        
               See Also 
        
               -------- 
        
               open_dataset 
        
               open_mfdataset 
        
               References 
        
               ---------- 
        
               http://zarr.readthedocs.io/ 
        
               """

There's a bunch of stuff that gets called, but eventually we get to the function decode_cf_datetime, which ironically (given the name) also takes this use_cftime parameter, which is still None. Because use_cftime is None, the function calls _decode_datetime_with_pandas:

xarray/xarray/coding/times.py

Lines 265 to 289 in 25d9a28

    
           def decode_cf_datetime( 
        
               num_dates, units: str, calendar: str | None = None, use_cftime: bool | None = None 
        
           ) -> np.ndarray: 
        
               """Given an array of numeric dates in netCDF format, convert it into a 
        
               numpy array of date time objects. 
        
               For standard (Gregorian) calendars, this function uses vectorized 
        
               operations, which makes it much faster than cftime.num2date. In such a 
        
               case, the returned array will be of type np.datetime64. 
        
               Note that time unit in `units` must not be smaller than microseconds and 
        
               not larger than days. 
        
               See Also 
        
               -------- 
        
               cftime.num2date 
        
               """ 
        
               num_dates = np.asarray(num_dates) 
        
               flat_num_dates = num_dates.ravel() 
        
               if calendar is None: 
        
                   calendar = "standard" 
        
               if use_cftime is None: 
        
                   try: 
        
                       dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)

and then, in _decode_datetime_with_pandas, the code casts a float NaN value to zero:

xarray/xarray/coding/times.py

Lines 216 to 262 in 979b998

    
           def _decode_datetime_with_pandas( 
        
               flat_num_dates: np.ndarray, units: str, calendar: str 
        
           ) -> np.ndarray: 
        
               if not _is_standard_calendar(calendar): 
        
                   raise OutOfBoundsDatetime( 
        
                       "Cannot decode times from a non-standard calendar, {!r}, using " 
        
                       "pandas.".format(calendar) 
        
                   ) 
        
               delta, ref_date = _unpack_netcdf_time_units(units) 
        
               delta = _netcdf_to_numpy_timeunit(delta) 
        
               try: 
        
                   # TODO: the strict enforcement of nanosecond precision Timestamps can be 
        
                   # relaxed when addressing GitHub issue #7493. 
        
                   ref_date = nanosecond_precision_timestamp(ref_date) 
        
               except ValueError: 
        
                   # ValueError is raised by pd.Timestamp for non-ISO timestamp 
        
                   # strings, in which case we fall back to using cftime 
        
                   raise OutOfBoundsDatetime 
        
               with warnings.catch_warnings(): 
        
                   warnings.filterwarnings("ignore", "invalid value encountered", RuntimeWarning) 
        
                   if flat_num_dates.size > 0: 
        
                       # avoid size 0 datetimes GH1329 
        
                       pd.to_timedelta(flat_num_dates.min(), delta) + ref_date 
        
                       pd.to_timedelta(flat_num_dates.max(), delta) + ref_date 
        
               # To avoid integer overflow when converting to nanosecond units for integer 
        
               # dtypes smaller than np.int64 cast all integer and unsigned integer dtype 
        
               # arrays to np.int64 (GH 2002, GH 6589).  Note this is safe even in the case 
        
               # of np.uint64 values, because any np.uint64 value that would lead to 
        
               # overflow when converting to np.int64 would not be representable with a 
        
               # timedelta64 value, and therefore would raise an error in the lines above. 
        
               if flat_num_dates.dtype.kind in "iu": 
        
                   flat_num_dates = flat_num_dates.astype(np.int64) 
        
               # Cast input ordinals to integers of nanoseconds because pd.to_timedelta 
        
               # works much faster when dealing with integers (GH 1399). 
        
               flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype( 
        
                   np.int64 
        
               ) 
        
               # Use pd.to_timedelta to safely cast integer values to timedeltas, 
        
               # and add those to a Timestamp to safely produce a DatetimeIndex.  This 
        
               # ensures that we do not encounter integer overflow at any point in the 
        
               # process without raising OutOfBoundsDatetime. 
        
               return (pd.to_timedelta(flat_num_dates_ns_int, "ns") + ref_date).values

In line 254, flat_num_dates is array([ nan, 1.6726176e+18]). After line 254, flat_nuM-dates_ns_int is array([ 0, 1672617600000000000]).

kmuehlbauer · 2023-05-02T07:09:38Z

@christine-e-smit I've created an fresh environment with only xarray and zarr and it still works on my machine. I've then followed the Darwin idea and digged up #6191 (I've got those casting warnings from exactly the line you were referring to). Comment #6191 (comment) should explain what happens here.

tl;dr citing @DocOtak

The short explanation is that the time conversion functions do an astype(np.int64) or equivalent cast on arrays that contain nans. This is undefined behavior and very soon, doing this will start to emit RuntimeWarnings.

There is also an open PR #7098.

Thanks @christine-e-smit for sticking with me to find the root-cause here by providing detailed information and code examples. 👍

kmuehlbauer · 2023-05-02T08:04:45Z

As in #7098, citing @dcherian:

I think the real solution here is to explicitly handle NaNs during the decoding step. We do want these to be NaT in the output.

There are three more issues revealed here when using datetime64:

if _FillValue is set in encoding, it has to be of same type/resolution as the times in the array
If _FillValue is provided, we need to provide dtype and units to which fit our data,
eg. if the _FillValue is referenced to unix-epoch the unit's should be equivalent
when encoding in the presence of NaT the data array is converted to floating point with NaN, which is problematic for the subsequent conversion to int64

christine-e-smit · 2023-05-02T21:07:27Z

@kmuehlbauer - genius! Yes. That pull request should fix this issue exactly! And it explains why I see this issue and you don't - with undefined behavior anything can happen. Since we are on different OSes, our systems behave differently.

I just double checked with pandas and this fix will do the right thing:

import pandas as pd
print(pd.to_timedelta([np.nan, 0],"ns") + np.datetime64('1970-01-01'))

DatetimeIndex(['NaT', '1970-01-01'], dtype='datetime64[ns]', freq=None)

I see that the pull request with the fix has been sitting since December of last year. Is there some way to somehow get someone to look at that pull request who can merge it?

spencerkclark · 2023-05-02T22:28:52Z

Thanks for the ping @dcherian -- I just gave #7098 a review. I think it's close to ready to merge.

kmuehlbauer · 2023-05-03T04:25:50Z

@christine-e-smit Great this works on you side with the proposed patch in #7098.

Nevertheless, we've identified three more issues here in the debugging process which can now be handled one by one. So again, thanks for your contribution here.

christine-e-smit added bug needs triage Issue that has not been reviewed by xarray team member labels Apr 26, 2023

kmuehlbauer added topic-CF conventions topic-zarr Related to zarr storage library and removed bug needs triage Issue that has not been reviewed by xarray team member labels Apr 27, 2023

kmuehlbauer added bug and removed topic-zarr Related to zarr storage library labels May 2, 2023

kmuehlbauer mentioned this issue May 4, 2023

nanosecond precision lost when reading time data #7817

Closed

4 tasks

kmuehlbauer mentioned this issue Sep 12, 2023

Preserve nanosecond resolution when encoding/decoding times #7827

Merged

9 tasks

kmuehlbauer closed this as completed in #7827 Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fill values in time arrays (numpy.datetime64) are lost in zarr #7790

Fill values in time arrays (numpy.datetime64) are lost in zarr #7790

christine-e-smit commented Apr 26, 2023

INSTALLED VERSIONS

welcome bot commented Apr 26, 2023

kmuehlbauer commented Apr 27, 2023

kmuehlbauer commented Apr 27, 2023

christine-e-smit commented Apr 27, 2023

christine-e-smit commented Apr 27, 2023

kmuehlbauer commented Apr 27, 2023

dcherian commented Apr 27, 2023

kmuehlbauer commented Apr 28, 2023

christine-e-smit commented Apr 28, 2023

christine-e-smit commented Apr 28, 2023

christine-e-smit commented Apr 28, 2023

dcherian commented Apr 28, 2023

kmuehlbauer commented Apr 30, 2023

kmuehlbauer commented May 1, 2023

christine-e-smit commented May 1, 2023 •

edited

Loading

kmuehlbauer commented May 1, 2023

kmuehlbauer commented May 1, 2023

kmuehlbauer commented May 1, 2023

christine-e-smit commented May 1, 2023

christine-e-smit commented May 1, 2023 •

edited

Loading

kmuehlbauer commented May 2, 2023 •

edited

Loading

kmuehlbauer commented May 2, 2023 •

edited

Loading

christine-e-smit commented May 2, 2023 •

edited

Loading

spencerkclark commented May 2, 2023

kmuehlbauer commented May 3, 2023

Fill values in time arrays (numpy.datetime64) are lost in zarr #7790

Fill values in time arrays (numpy.datetime64) are lost in zarr #7790

Comments

christine-e-smit commented Apr 26, 2023

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

welcome bot commented Apr 26, 2023

kmuehlbauer commented Apr 27, 2023

kmuehlbauer commented Apr 27, 2023

christine-e-smit commented Apr 27, 2023

christine-e-smit commented Apr 27, 2023

kmuehlbauer commented Apr 27, 2023

dcherian commented Apr 27, 2023

kmuehlbauer commented Apr 28, 2023

christine-e-smit commented Apr 28, 2023

christine-e-smit commented Apr 28, 2023

christine-e-smit commented Apr 28, 2023

dcherian commented Apr 28, 2023

kmuehlbauer commented Apr 30, 2023

kmuehlbauer commented May 1, 2023

christine-e-smit commented May 1, 2023 • edited Loading

kmuehlbauer commented May 1, 2023

kmuehlbauer commented May 1, 2023

kmuehlbauer commented May 1, 2023

christine-e-smit commented May 1, 2023

christine-e-smit commented May 1, 2023 • edited Loading

kmuehlbauer commented May 2, 2023 • edited Loading

kmuehlbauer commented May 2, 2023 • edited Loading

christine-e-smit commented May 2, 2023 • edited Loading

spencerkclark commented May 2, 2023

kmuehlbauer commented May 3, 2023

christine-e-smit commented May 1, 2023 •

edited

Loading

christine-e-smit commented May 1, 2023 •

edited

Loading

kmuehlbauer commented May 2, 2023 •

edited

Loading

kmuehlbauer commented May 2, 2023 •

edited

Loading

christine-e-smit commented May 2, 2023 •

edited

Loading