-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fill values in time arrays (numpy.datetime64) are lost in zarr #7790
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
So, I'm no expert for The xr_read = xr.open_zarr(location)
print("******************")
print("No fill value")
print(xr_read["time"])
print(xr_read["time"].encoding) ******************
No fill value
<xarray.DataArray 'time' (time: 2)>
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9.223372036854776e+18, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('float64')} You might also check this without decoding ( with xr.open_zarr(location, decode_cf=False) as xr_read:
print("******************")
print("No fill value")
print(xr_read["time"])
print(xr_read["time"].encoding) ******************
No fill value
<xarray.DataArray 'time' (time: 2)>
array([-9.223372e+18, 0.000000e+00])
Coordinates:
* time (time) float64 -9.223e+18 0.0
Attributes:
calendar: proleptic_gregorian
units: days since 2023-01-02 00:00:00
_FillValue: -9.223372036854776e+18
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('float64')} Maybe a zarr-expert can chime in here, what's the best practice for time-fill_values. |
Xref: discussion #7776, which got no attention up to now. |
Ah! Okay. I did not know about the Interestingly, -9.223372036854776e+18 is just the float equivalent of numpy.datetime64('NaT'): float(np.datetime64('NaT').view('i8'))
And I know this isn't an issue with zarr and NaT because I can create the zarr store directly with the zarr library and it's perfectly happy: # Create a zarr store directly with numpy.datetime64 type
location_zarr_direct = "from_zarr.zarr"
root = zarr.open(location_zarr_direct,mode='w')
z_time_array = root.create_dataset(
"time",data=time,shape=time.shape,chunks=time.shape,dtype=time.dtype,
fill_value=time_fill_value
)
zarr.convenience.consolidate_metadata(location_zarr_direct)
# Read it back out again
read_zarr = zarr.open(location_zarr_direct,mode='r')
print(read_zarr["time"][:])
|
Interestingly, xarray is also perfectly happy to read a numpy.datetime64 array out of a zarr store as long as the xarray metadata is present. xarray even helpfully creates an '_FillValue" attribute for the array so there is no confusion:
So I am extremely confused as to why xarray encodes time arrays so strangely when it creates the zarr store itself! (Hence #7776) |
@christine-e-smit I see, thanks for the details. AFAICT from the code it looks like |
I think the issue is that we're always running "CF encoding" which is more appropriate for netCDF4 than Zarr, since Zarr supports datetime64 natively. And currently there's no way to control whether the datetime encoder is applied or not, we just look at the dtype: Lines 697 to 704 in 0f4e99d
I think the right way to fix this is to allow the user to run the |
Thanks @dcherian for filling in the details. I've digged up some more related issues: #2265, #3942, #4045 IIUC, #4684 did a great job to iron out much of these issues, but as it looks like only in the case when no In the presence of encoding = {
"time":{"_FillValue": time_fill_value, "dtype": np.int64}
xr_ds.to_zarr(location, encoding=encoding)
} One note to this: Xarray is deducing the encoding = {
"time":{"_FillValue": time_fill_value, "dtype": np.int64, 'units': 'nanoseconds since 2023-01-02'}
}
xr_ds.to_zarr(location, encoding=encoding) @christine-e-smit It would be great if you could confirm that from your side (some sanity check needed on my side). |
@kmuehlbauer - I think I'm not understanding what you are suggesting because the zarr store is still not being read correctly when I switch the fill value to a different date: # Create a numpy array of type np.datetime64 with one fill value and one date
time_fill_value = np.datetime64("1900-01-01")
time = np.array([time_fill_value,'2023-01-02'],dtype='M8[ns]')
# Create a dataset with this one array
xr_time_array = xr.DataArray(data=time,dims=['time'],name='time')
xr_ds = xr.Dataset(dict(time=xr_time_array))
print("******************")
print("Created with fill value 1900-01-01")
print(xr_ds["time"])
# Save the dataset to zarr
location_new_fill = "from_xarray_new_fill.zarr"
encoding = {
"time":{"_FillValue":time_fill_value,"dtype":np.int64}
}
xr_ds.to_zarr(location_new_fill,encoding=encoding)
xr_read = xr.open_zarr(location)
print("******************")
print("Read back out of the zarr store with xarray")
print(xr_read["time"])
print(xr_read["time"].encoding)
|
The zarr store does indeed use an integer in this case according to the .zmetadata file:
Once again the values in the zarr store are correct given the units, but xarray misreads the fill value for some reason: z_new_fill = zarr.open('from_xarray_new_fill.zarr','r')
z_new_fill["time"][:]
|
Where in the code is the time array being decoded? That seems to be where a lot of the issue is? |
Lines 717 to 735 in 25d9a28
|
@christine-e-smit Is this just a remnant of copy&paste? The above code writes to Here is my code and output for comparison (using latest zarr/xarray): # Create a numpy array of type np.datetime64 with one fill value and one date
time_fill_value = np.datetime64("1900-01-01")
time = np.array([np.datetime64("NaT"), '2023-01-02'], dtype='M8[ns]')
# Create a dataset with this one array
xr_time_array = xr.DataArray(data=time,dims=['time'],name='time')
xr_ds = xr.Dataset(dict(time=xr_time_array))
print("******************")
print("Created with fill value 1900-01-01")
print(xr_ds["time"])
# Save the dataset to zarr
location_new_fill = "from_xarray_new_fill.zarr"
encoding = {
"time":{"_FillValue":time_fill_value,"dtype":np.int64}
}
xr_ds.to_zarr(location_new_fill, encoding=encoding)
xr_read = xr.open_zarr(location_new_fill)
print("******************")
print("Read back out of the zarr store with xarray")
print(xr_read["time"])
print(xr_read["time"].encoding) ******************
Created with fill value 1900-01-01
<xarray.DataArray 'time' (time: 2)>
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
******************
Read back out of the zarr store with xarray
<xarray.DataArray 'time' (time: 2)>
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -25567, 'units': 'days since 2023-01-02 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')} This doesn't look correct either. At least the decoded <xarray.DataArray 'time' (time: 2)>
array(['1953-01-02T00:00:00.000000000', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]') I totally agree with @christine-e-smit, this is all very confusing. As said at the beginning, I have little knowledge of zarr. I'm currently digging into cf encoding/decoding which made me jump on here. AFAICT, it looks like already the encoding has a problem, at least the data on disk is already not what we expect. It seems that somehow the xarray cf_encoding/decoding is not well aligned with the zarr writing/reading of datetimes. |
So, after some debugging I think I've found two issues here with the current code. First, we need to give the fillvalue with a fitting resolution. Second, we have an issue with inferring the units from the data (if not given). Here is some workaround code which (finally, 🤞) should at least write and read correct data (added comments below): # Create a numpy array of type np.datetime64 with one fill value and one date
# FIRST ISSUE WITH _FillValue
# we need to provide ns resolution here too, otherwise we get wrong fillvalues (day-reference)
time_fill_value = np.datetime64("1900-01-01 00:00:00.00000000", "ns")
time = np.array([np.datetime64("NaT", "ns"), '2023-01-02 00:00:00.00000000'], dtype='M8[ns]')
# Create a dataset with this one array
xr_time_array = xr.DataArray(data=time,dims=['time'],name='time')
xr_ds = xr.Dataset(dict(time=xr_time_array))
print("******************")
print("Created with fill value 1900-01-01")
print(xr_ds["time"])
# Save the dataset to zarr
location_new_fill = "from_xarray_new_fill.zarr"
# SECOND ISSUE with inferring units from data
# We need to specify "dtype" and "units" which fit our data
# Note: as we provide a _FillValue with a reference to unix-epoch
# we need to provide a fitting units too
encoding = {
"time":{"_FillValue":time_fill_value, "dtype":np.int64, "units":"nanoseconds since 1970-01-01"}
}
xr_ds.to_zarr(location_new_fill, mode="w", encoding=encoding)
xr_read = xr.open_zarr(location_new_fill)
print("******************")
print("Read back out of the zarr store with xarray")
print(xr_read["time"])
print(xr_read["time"].attrs)
print(xr_read["time"].encoding)
z_new_fill = zarr.open('from_xarray_new_fill.zarr','r', )
print("******************")
print("Read back out of the zarr store with zarr")
print(z_new_fill["time"])
print(z_new_fill["time"].attrs)
print(z_new_fill["time"][:]) ******************
Created with fill value 1900-01-01
<xarray.DataArray 'time' (time: 2)>
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
******************
Read back out of the zarr store with xarray
<xarray.DataArray 'time' (time: 2)>
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
{}
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -2208988800000000000, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
******************
Read back out of the zarr store with zarr
<zarr.core.Array '/time' (2,) int64 read-only>
<zarr.attrs.Attributes object at 0x7f086ab8e710>
[-2208988800000000000 1672617600000000000] @christine-e-smit Please let me know, if the above workaround gives you correct results in your workflow. If so, then we can think about how to automatically align fillvalue-resolution with data-resolution and what needs to be done to correctly deduce the units. |
Oops! Yes. You are right. I had some cross-wording on the variable names. So I started a new notebook. Unfortunately, I think you may have also gotten some wires crossed? You set the time fill value to 1900-01-01, but then use NaT in the actual array? Here is a fresh notebook with a stand-alone cell with everything that I think you were doing, but I'm not 100%. The fill value is still wrong when it gets read out, but it is at least different? The fill value is now set to the units for some reason. This seems like progress? import numpy as np
import xarray as xr
import zarr
# Create a time array with one fill value, NaT
time = np.array([np.datetime64("NaT", "ns"), '2023-01-02 00:00:00.00000000'], dtype='M8[ns]')
# Create xarray with this fill value
xr_time_array = xr.DataArray(data=time,dims=['time'],name='time')
xr_ds = xr.Dataset(dict(time=xr_time_array))
print("**********************")
print("xarray created with NaT fill value")
print("----------------------")
print(xr_ds["time"])
# Save as zarr
location_with_units = "xarray_and_units.zarr"
encoding = {
"time":{"_FillValue":np.datetime64("NaT","ns"),"dtype":np.int64,"units":"nanoseconds since 1970-01-01"}
}
xr_ds.to_zarr(location_with_units,mode="w",encoding=encoding)
# Read it back out again
xr_read = xr.open_zarr(location_with_units)
print("**********************")
print("xarray created read with NaT fill value")
print("----------------------")
print(xr_read["time"])
print(xr_read["time"].attrs)
print(xr_read["time"].encoding)
|
Yes, I use NaT because I want to check if the encoder does correctly translate NaT to the provided _FillValue on write. So from your last example I'm assuming you would like to have the int64 representation of NaT as _FillValue, right? |
@christine-e-smit I've plugged your code into a fresh notebook, here is my output: **********************
xarray created with NaT fill value
----------------------
<xarray.DataArray 'time' (time: 2)>
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
**********************
xarray created read with NaT fill value
----------------------
<xarray.DataArray 'time' (time: 2)>
array([ 'NaT', '2023-01-02T00:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] NaT 2023-01-02
{}
{'chunks': (2,), 'preferred_chunks': {'time': 2}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': -9223372036854775808, 'units': 'nanoseconds since 1970-01-01', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')} The output seems OK on my side. I've no idea why the data isn't correctly decoded as NaT on your side. I've checked that my environment is comparable to yours. The only difference remaining is you are on Darwin arm64 whereas I'm on Linux.
|
@christine-e-smit One more idea, you might delete the zarr folder before re-creating (if you are not doing that already). I've removed the complete folder before any new write (by putting eg. It would also be great if you could run the code from #7790 (comment) and post the output here, just for the sake of comparison (please delete the zarr-folder before if it exists). Thanks! |
@kmuehlbauer - I ran #7790 (comment) and I get an incorrect fill value:
and here is my show_versions, since it may have changed because I've added some new libraries. It looks like my ipython version is slightly different, but I can't see how that would affect things.
|
Ah hah! Well, I don't know why this is working for you @kmuehlbauer, but I can see why it is not working for me. I've been debugging through the code and it looks like the problem is the It all starts in the xarray/xarray/backends/zarr.py Lines 701 to 817 in 25d9a28
There's a bunch of stuff that gets called, but eventually we get to the function Lines 265 to 289 in 25d9a28
and then, in Lines 216 to 262 in 979b998
In line 254, |
@christine-e-smit I've created an fresh environment with only xarray and zarr and it still works on my machine. I've then followed the Darwin idea and digged up #6191 (I've got those casting warnings from exactly the line you were referring to). Comment #6191 (comment) should explain what happens here. tl;dr citing @DocOtak
There is also an open PR #7098. Thanks @christine-e-smit for sticking with me to find the root-cause here by providing detailed information and code examples. 👍 |
As in #7098, citing @dcherian:
There are three more issues revealed here when using datetime64:
|
@kmuehlbauer - genius! Yes. That pull request should fix this issue exactly! And it explains why I see this issue and you don't - with undefined behavior anything can happen. Since we are on different OSes, our systems behave differently. I just double checked with pandas and this fix will do the right thing: import pandas as pd
print(pd.to_timedelta([np.nan, 0],"ns") + np.datetime64('1970-01-01'))
I see that the pull request with the fix has been sitting since December of last year. Is there some way to somehow get someone to look at that pull request who can merge it? |
@christine-e-smit Great this works on you side with the proposed patch in #7098. Nevertheless, we've identified three more issues here in the debugging process which can now be handled one by one. So again, thanks for your contribution here. |
What happened?
I have a time array of type numpy.datetime64 with fill values. When I save the dateset I created with xarray to zarr and then read that zarr store back out again with xarray, the fill values are lost.
What did you expect to happen?
I expected my fill values to still be in my time array when I read it back out of the zarr store.
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
When I look in the .zmetadata file generated for this zarr store, I see that the time array as been converted to float and there is a units attribute:
If I open the data using zarr, the actual values in the time array look fine, given the transformation:
Output:
Environment
INSTALLED VERSIONS
commit: None
python: 3.11.3 | packaged by conda-forge | (main, Apr 6 2023, 08:58:31) [Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.4.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None
xarray: 2023.4.2
pandas: 2.0.1
numpy: 1.24.3
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.14.2
cftime: None
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.7.2
pip: 23.1.2
conda: None
pytest: None
mypy: None
IPython: 8.12.0
sphinx: None
The text was updated successfully, but these errors were encountered: