Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'numpy.datetime64' object has no attribute 'year' writing from grib2 source #130

Open
rsignell-usgs opened this issue Feb 28, 2022 · 27 comments

Comments

@rsignell-usgs
Copy link
Collaborator

I have a consolidated JSON with a "best time series" dataset that reads from a collection of HRRR grib2 files on S3.

The data access and visualization of data in the notebook looks fine, but when I attempt to save a subset (to either zarr or netcdf) I get an error from xarray:

AttributeError: 'numpy.datetime64' object has no attribute 'year'

I'm raising this issue here because I run a nearly identical workflow with NetCDF files, where we also have datetime64 objects for time in xarray, and don't get this problem. Hence I'm thinking it must have something to do with the reading from grib2 files?

This notebook should be reproducible (and doesn't require AWS credentials):
https://nbviewer.org/gist/rsignell-usgs/7d0a76a5f9fbf5781e9e97b66bd72882

@martindurant
Copy link
Member

I'll have a look when I have time, but my initial guess is that some attributes are left over from the original cftime encoding that is leaving xarray trying to make a conversion it shouldn't. Can you look at the attributes of the time variable? You might be able to solve by performing a trivial operation (+ timedelta(0)?) before save.

@martindurant
Copy link
Member

The time column should definitely have been inlined in the references

Bytes | 1.38 kiB | 8 B

Maybe this is the problematic one, not valid_time, the coordinate, so you can remove it?

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Feb 28, 2022

I tried removing time and other potential problems by doing:

ds = ds.drop(['time', 'step', 'heightAboveGround']).rename({'valid_time':'time'})

but unfortunately still exact same error (you should be able to run this notebook locally without AWS credentials):

https://nbviewer.org/gist/5d10137059a1793f2606e39a5baf1ba2

@martindurant
Copy link
Member

Can you also try multiplying by 1.0?

@martindurant
Copy link
Member

(or subtracting timedelta(0) )

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Feb 28, 2022

@martindurant , I'm not sure I understand.
Multiply what by 1.0?
Before the json is created?

The datetime64 values are understood perfectly well in xarray -- the error only appears when persisted.
I don't understand why. Do you?
(should be easy for you to run yourself and check it out)

@martindurant
Copy link
Member

I don't know why exactly, but I suspect that if you perform an arithmetic operation (like adding a timedelta zero), then any leftover encoding attributes might be usefully forgotten.

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Mar 1, 2022

@martindurant I'm still not sure what I'm supposed to multiply by 1.0 and at what point I'm supposed to do it.
Did you get a chance to try running the notebook that displays the error?

And here is the HRRR best time series notebook that created the json

@martindurant
Copy link
Member

"forbidden" reading the JSON file. Do I need to run this in qhub?

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Mar 1, 2022

Grrr.... no, you don't need qhub. My bad. Just use this s_opts instead:

s_opts = {'anon':True, 'skip_instance_cache':True}

@martindurant
Copy link
Member

ds2 = ds.assign(time=ds.time + pd.Timedelta(0))
ds2.isel(time=slice(-3,-1)).to_zarr('foo.zarr', 'w')

Allows you to write it, but xarray still wants to use some weird cftime encoding (even though the original data did not have that), causing overflow when you try to reload. Do you know how to not use cftime on write, just use actual datetime values? i.e., this is not a kerchunk problem, its an xarray/zarr encoding thing.

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Mar 1, 2022

@martindurant the CF-Conventions don't allow for datetime64 or even date strings: you need to supply float or int time values and a units attribute like "hours since 1990-01-01".

The xr.open_dataset() function has an argument we could set to use_cftime=False.

So you don't think this has anything to do with the grib2 codec or reading grib2 files?

@keewis
Copy link
Contributor

keewis commented Mar 2, 2022

this completes successfully:

subset = ds.assign_coords(time=ds.time.astype("datetime64[ns]")).isel(time=slice(-3,-1))
for var in subset.variables.values():
    var.encoding.pop("filters", None)
subset.to_zarr('foo.zarr', mode='w')

ds = xr.open_dataset("foo.zarr", engine="zarr").load()

cftime is only involved with dtypes that are not exactly datetime64[ns], so casting time avoids that. filters raised errors mentioning grib when I tried to load the resulting zarr.

@rsignell-usgs
Copy link
Collaborator Author

@keewis , whoa, thank you! I never would have figured this out! Is this something that users should handle using this approach or should we handle it in xarray?

@martindurant
Copy link
Member

xarray ought to be aware that zarr can handle any of the numpy datetime ("M8") types, so no conversion ought to be necessary with the zarr backend. I wonder what those filters contained? Can we make an example of this with a pure xarray dataset?

So from kerckunk/grib's point of view, we should always ensure that we convert to ns resolution where we can. Where we can't (e.g., ice-age data?), we'll have to explicitly call cftime's encode to get the right values and attributes.

@keewis
Copy link
Contributor

keewis commented Mar 2, 2022

See pydata/xarray#6318 (comment) for the datetime64 handling

I wonder what those filters contained? Can we make an example of this with a pure xarray dataset?

the repr of those is: [GRIBCodec(var='t2m')]. If I don't remove that, the variable's .zattrs contains a bunch of GRIB_* variables, and the load fails with this:

Can't create file '/tmp/tmpesmcutpjgrib2.923a8.idx'
Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 261, in itervalues
    yield self.filestream.message_from_file(file, errors=errors)
  File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 328, in message_from_file
    return Message.from_file(file, offset, **kwargs)
  File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 102, in from_file
    raise EOFError("End of file: %r" % file)
EOFError: End of file: <_io.BufferedReader name='/tmp/tmpesmcutpjgrib2'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 523, in from_indexpath_or_filestream
    self = cls.from_fieldset(filestream, index_keys, computed_keys)
  File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 366, in from_fieldset
    return cls.from_fieldset_and_iteritems(fieldset, iteritems, index_keys, computed_keys)
  File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 379, in from_fieldset_and_iteritems
    for field_id, raw_field in iteritems:
  File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 279, in __iter__
    for message in self.itervalues():
  File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 265, in itervalues
    raise EOFError("No valid message found: %r" % self.filestream.path)
EOFError: No valid message found: '/tmp/tmpesmcutpjgrib2'
Can't read index file '/tmp/tmpesmcutpjgrib2.923a8.idx'
Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 532, in from_indexpath_or_filestream
    index_mtime = os.path.getmtime(indexpath)
  File ".../lib/python3.9/genericpath.py", line 55, in getmtime
    return os.stat(filename).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpesmcutpjgrib2.923a8.idx'

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
File .../lib/python3.9/site-packages/cfgrib/messages.py:261, in FileStreamItems.itervalues(self)
    260 try:
--> 261     yield self.filestream.message_from_file(file, errors=errors)
    262     valid_message_found = True

File .../lib/python3.9/site-packages/cfgrib/messages.py:328, in FileStream.message_from_file(self, file, offset, **kwargs)
    326 def message_from_file(self, file, offset=None, **kwargs):
    327     # type: (T.IO[bytes], T.Optional[OffsetType], T.Any) -> Message
--> 328     return Message.from_file(file, offset, **kwargs)

File .../lib/python3.9/site-packages/cfgrib/messages.py:102, in Message.from_file(cls, file, offset, **kwargs)
    101 if codes_id is None:
--> 102     raise EOFError("End of file: %r" % file)
    103 return cls(codes_id=codes_id, **kwargs)

EOFError: End of file: <_io.BufferedReader name='/tmp/tmpesmcutpjgrib2'>

During handling of the above exception, another exception occurred:

EOFError                                  Traceback (most recent call last)
Input In [49], in <module>
----> 1 xr.open_dataset("foo.zarr", engine="zarr").load()

File .../xarray/core/dataset.py:869, in Dataset.load(self, **kwargs)
    867 for k, v in self.variables.items():
    868     if k not in lazy_data:
--> 869         v.load()
    871 return self

File .../xarray/core/variable.py:446, in Variable.load(self, **kwargs)
    444     self._data = as_compatible_data(self._data.compute(**kwargs))
    445 elif not is_duck_array(self._data):
--> 446     self._data = np.asarray(self._data)
    447 return self

File .../xarray/core/indexing.py:552, in MemoryCachedArray.__array__(self, dtype)
    551 def __array__(self, dtype=None):
--> 552     self._ensure_cached()
    553     return np.asarray(self.array, dtype=dtype)

File .../xarray/core/indexing.py:549, in MemoryCachedArray._ensure_cached(self)
    547 def _ensure_cached(self):
    548     if not isinstance(self.array, NumpyIndexingAdapter):
--> 549         self.array = NumpyIndexingAdapter(np.asarray(self.array))

File .../xarray/core/indexing.py:522, in CopyOnWriteArray.__array__(self, dtype)
    521 def __array__(self, dtype=None):
--> 522     return np.asarray(self.array, dtype=dtype)

File .../xarray/core/indexing.py:423, in LazilyIndexedArray.__array__(self, dtype)
    421 def __array__(self, dtype=None):
    422     array = as_indexable(self.array)
--> 423     return np.asarray(array[self.key], dtype=None)

File .../xarray/coding/variables.py:70, in _ElementwiseFunctionArray.__array__(self, dtype)
     69 def __array__(self, dtype=None):
---> 70     return self.func(self.array)

File .../xarray/coding/variables.py:137, in _apply_mask(data, encoded_fill_values, decoded_fill_value, dtype)
    133 def _apply_mask(
    134     data: np.ndarray, encoded_fill_values: list, decoded_fill_value: Any, dtype: Any
    135 ) -> np.ndarray:
    136     """Mask all matching values in a NumPy arrays."""
--> 137     data = np.asarray(data, dtype=dtype)
    138     condition = False
    139     for fv in encoded_fill_values:

File .../xarray/core/indexing.py:423, in LazilyIndexedArray.__array__(self, dtype)
    421 def __array__(self, dtype=None):
    422     array = as_indexable(self.array)
--> 423     return np.asarray(array[self.key], dtype=None)

File .../xarray/backends/zarr.py:73, in ZarrArrayWrapper.__getitem__(self, key)
     71 array = self.get_array()
     72 if isinstance(key, indexing.BasicIndexer):
---> 73     return array[key.tuple]
     74 elif isinstance(key, indexing.VectorizedIndexer):
     75     return array.vindex[
     76         indexing._arrayize_vectorized_indexer(key, self.shape).tuple
     77     ]

File .../lib/python3.9/site-packages/zarr/core.py:720, in Array.__getitem__(self, selection)
    718     result = self.vindex[selection]
    719 else:
--> 720     result = self.get_basic_selection(pure_selection, fields=fields)
    721 return result

File .../lib/python3.9/site-packages/zarr/core.py:846, in Array.get_basic_selection(self, selection, out, fields)
    843     return self._get_basic_selection_zd(selection=selection, out=out,
    844                                         fields=fields)
    845 else:
--> 846     return self._get_basic_selection_nd(selection=selection, out=out,
    847                                         fields=fields)

File .../lib/python3.9/site-packages/zarr/core.py:889, in Array._get_basic_selection_nd(self, selection, out, fields)
    883 def _get_basic_selection_nd(self, selection, out=None, fields=None):
    884     # implementation of basic selection for array with at least one dimension
    885 
    886     # setup indexer
    887     indexer = BasicIndexer(selection, self)
--> 889     return self._get_selection(indexer=indexer, out=out, fields=fields)

File .../lib/python3.9/site-packages/zarr/core.py:1179, in Array._get_selection(self, indexer, out, fields)
   1173 if not hasattr(self.chunk_store, "getitems") or \
   1174    any(map(lambda x: x == 0, self.shape)):
   1175     # sequentially get one key at a time from storage
   1176     for chunk_coords, chunk_selection, out_selection in indexer:
   1177 
   1178         # load chunk selection into output array
-> 1179         self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection,
   1180                             drop_axes=indexer.drop_axes, fields=fields)
   1181 else:
   1182     # allow storage to get multiple items at once
   1183     lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)

File .../lib/python3.9/site-packages/zarr/core.py:1883, in Array._chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection, drop_axes, fields)
   1880         out[out_selection] = fill_value
   1882 else:
-> 1883     self._process_chunk(out, cdata, chunk_selection, drop_axes,
   1884                         out_is_ndarray, fields, out_selection)

File .../lib/python3.9/site-packages/zarr/core.py:1826, in Array._process_chunk(self, out, cdata, chunk_selection, drop_axes, out_is_ndarray, fields, out_selection, partial_read_decode)
   1824 except ArrayIndexError:
   1825     cdata = cdata.read_full()
-> 1826 chunk = self._decode_chunk(cdata)
   1828 # select data from chunk
   1829 if fields:

File .../lib/python3.9/site-packages/zarr/core.py:2083, in Array._decode_chunk(self, cdata, start, nitems, expected_shape)
   2081 if self._filters:
   2082     for f in reversed(self._filters):
-> 2083         chunk = f.decode(chunk)
   2085 # view as numpy array with correct dtype
   2086 chunk = ensure_ndarray(chunk)

File .../lib/python3.9/site-packages/kerchunk/grib2.py:182, in GRIBCodec.decode(self, buf, out)
    179 buf.tofile(fn)
    181 # do decode
--> 182 ds = cfgrib.open_file(fn)
    183 data = ds.variables[self.var].data
    184 if hasattr(data, "build_array"):

File .../lib/python3.9/site-packages/cfgrib/dataset.py:762, in open_file(path, grib_errors, indexpath, filter_by_keys, read_keys, time_dims, extra_coords, **kwargs)
    759 stream = messages.FileStream(path, errors=grib_errors)
    761 index_keys = compute_index_keys(time_dims, extra_coords)
--> 762 index = open_fileindex(stream, indexpath, index_keys, filter_by_keys=filter_by_keys)
    764 return open_from_index(index, read_keys, time_dims, extra_coords, **kwargs)

File .../lib/python3.9/site-packages/cfgrib/dataset.py:741, in open_fileindex(stream, indexpath, index_keys, filter_by_keys, computed_keys)
    733 def open_fileindex(
    734     stream: messages.FileStream,
    735     indexpath: str = messages.DEFAULT_INDEXPATH,
   (...)
    738     computed_keys: messages.ComputedKeysType = cfmessage.COMPUTED_KEYS,
    739 ) -> messages.FileIndex:
    740     index_keys = sorted(set(index_keys) | set(filter_by_keys))
--> 741     index = messages.FileIndex.from_indexpath_or_filestream(
    742         stream, index_keys, indexpath=indexpath, computed_keys=computed_keys
    743     )
    744     return index.subindex(filter_by_keys)

File .../lib/python3.9/site-packages/cfgrib/messages.py:549, in FileIndex.from_indexpath_or_filestream(cls, filestream, index_keys, indexpath, computed_keys, log)
    546 except Exception:
    547     log.exception("Can't read index file %r", indexpath)
--> 549 return cls.from_fieldset(filestream, index_keys, computed_keys)

File .../lib/python3.9/site-packages/cfgrib/messages.py:366, in FieldsetIndex.from_fieldset(cls, fieldset, index_keys, computed_keys)
    364 else:
    365     iteritems = enumerate(fieldset)
--> 366 return cls.from_fieldset_and_iteritems(fieldset, iteritems, index_keys, computed_keys)

File .../lib/python3.9/site-packages/cfgrib/messages.py:379, in FieldsetIndex.from_fieldset_and_iteritems(cls, fieldset, iteritems, index_keys, computed_keys)
    377 index_keys = list(index_keys)
    378 header_values_cache = {}  # type: T.Dict[T.Tuple[T.Any, type], T.Any]
--> 379 for field_id, raw_field in iteritems:
    380     field = ComputedKeysAdapter(raw_field, computed_keys)
    381     header_values = []

File .../lib/python3.9/site-packages/cfgrib/messages.py:279, in FileStreamItems.__iter__(self)
    277 old_offset = -1
    278 count = 0
--> 279 for message in self.itervalues():
    280     offset = message.message_get("offset", int)
    281     if offset == old_offset:

File .../lib/python3.9/site-packages/cfgrib/messages.py:265, in FileStreamItems.itervalues(self)
    263 except EOFError:
    264     if not valid_message_found:
--> 265         raise EOFError("No valid message found: %r" % self.filestream.path)
    266     break
    267 except Exception:

EOFError: No valid message found: '/tmp/tmpesmcutpjgrib2'

That might be an issue with my cfgrib installation, though

@martindurant
Copy link
Member

Oh right, our filter to read GRIB. You would definitely need that to get any useful data out of the variables when reading from the original data, but indeed they make no sense at all for the newly created local zarr dataset. The GRIB_* attributes probably do carry useful information, though. I can't really tell whether popping them out as in the example affects the load or the save.

@keewis
Copy link
Contributor

keewis commented Mar 2, 2022

sorry, my bad, the attrs have nothing to do with this, and they are not affected by the pop, either.

Edit: for reference, it's .zarray that contains the filters:

    "filters": [
        {
            "id": "grib",
            "var": "t2m"
        }
    ],

@rsignell-usgs
Copy link
Collaborator Author

@martindurant , so you intend to remove these filters as part of kerchunk, right?

@martindurant
Copy link
Member

No, the filter is essential to read the GRIB files, but it makes no sense when writing out to zarr. xarray should have a way to write out without using the original filters; it's like saying "just because these data came from some special encoding, when I write them again, I want to use the same encoding". We do not, however, have a way to re-encode into GRIB-in-zarr, so xarray should allow us not to do this. Maybe there is such an option, the equivalent of pop for the filters, above?

@keewis
Copy link
Contributor

keewis commented Mar 3, 2022

I brought this up in the meeting yesterday. In short, removing or overriding the filters explicitly, or clearing encoding entirely are the only options we have at the moment. See also pydata/xarray#6323.

@rsignell-usgs
Copy link
Collaborator Author

@martindurant , in light of pydata/xarray#6318 (comment), do you think it would be best practice to not use datetime64 in our kerchunked zarr representation, even though it's allowed?

@martindurant
Copy link
Member

cftime encoding of timestamps cannot be the thing we rely on, since the datasets we are considering go well beyond climate/met and indeed the zarr virtual datasets are not necessarily to be read by xarray

@martindurant
Copy link
Member

So in the current version of combine2 IF times were read with cftime (and you need to explicitly specify this), then they will be saved with cftime (default) or datetime64 (optional).

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Mar 7, 2022

nice.

@martindurant
Copy link
Member

Should we merge combine2?

It needs a lot of docs to explain how to use it - beyond the long docstring; and updating the existing examples to use it.

@rsignell-usgs
Copy link
Collaborator Author

okay, maybe I can get someone to work on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants