-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'numpy.datetime64' object has no attribute 'year' writing from grib2 source #130
Comments
I'll have a look when I have time, but my initial guess is that some attributes are left over from the original cftime encoding that is leaving xarray trying to make a conversion it shouldn't. Can you look at the attributes of the time variable? You might be able to solve by performing a trivial operation ( |
The
Maybe this is the problematic one, not valid_time, the coordinate, so you can remove it? |
I tried removing ds = ds.drop(['time', 'step', 'heightAboveGround']).rename({'valid_time':'time'}) but unfortunately still exact same error (you should be able to run this notebook locally without AWS credentials): |
Can you also try multiplying by 1.0? |
(or subtracting timedelta(0) ) |
@martindurant , I'm not sure I understand. The datetime64 values are understood perfectly well in xarray -- the error only appears when persisted. |
I don't know why exactly, but I suspect that if you perform an arithmetic operation (like adding a timedelta zero), then any leftover encoding attributes might be usefully forgotten. |
@martindurant I'm still not sure what I'm supposed to multiply by 1.0 and at what point I'm supposed to do it. And here is the HRRR best time series notebook that created the json |
"forbidden" reading the JSON file. Do I need to run this in qhub? |
Grrr.... no, you don't need qhub. My bad. Just use this
|
Allows you to write it, but xarray still wants to use some weird cftime encoding (even though the original data did not have that), causing overflow when you try to reload. Do you know how to not use cftime on write, just use actual datetime values? i.e., this is not a kerchunk problem, its an xarray/zarr encoding thing. |
@martindurant the CF-Conventions don't allow for datetime64 or even date strings: you need to supply float or int time values and a units attribute like "hours since 1990-01-01". The xr.open_dataset() function has an argument we could set to So you don't think this has anything to do with the grib2 codec or reading grib2 files? |
this completes successfully: subset = ds.assign_coords(time=ds.time.astype("datetime64[ns]")).isel(time=slice(-3,-1))
for var in subset.variables.values():
var.encoding.pop("filters", None)
subset.to_zarr('foo.zarr', mode='w')
ds = xr.open_dataset("foo.zarr", engine="zarr").load()
|
@keewis , whoa, thank you! I never would have figured this out! Is this something that users should handle using this approach or should we handle it in xarray? |
xarray ought to be aware that zarr can handle any of the numpy datetime ("M8") types, so no conversion ought to be necessary with the zarr backend. I wonder what those filters contained? Can we make an example of this with a pure xarray dataset? So from kerckunk/grib's point of view, we should always ensure that we convert to ns resolution where we can. Where we can't (e.g., ice-age data?), we'll have to explicitly call cftime's encode to get the right values and attributes. |
See pydata/xarray#6318 (comment) for the
the Can't create file '/tmp/tmpesmcutpjgrib2.923a8.idx'
Traceback (most recent call last):
File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 261, in itervalues
yield self.filestream.message_from_file(file, errors=errors)
File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 328, in message_from_file
return Message.from_file(file, offset, **kwargs)
File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 102, in from_file
raise EOFError("End of file: %r" % file)
EOFError: End of file: <_io.BufferedReader name='/tmp/tmpesmcutpjgrib2'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 523, in from_indexpath_or_filestream
self = cls.from_fieldset(filestream, index_keys, computed_keys)
File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 366, in from_fieldset
return cls.from_fieldset_and_iteritems(fieldset, iteritems, index_keys, computed_keys)
File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 379, in from_fieldset_and_iteritems
for field_id, raw_field in iteritems:
File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 279, in __iter__
for message in self.itervalues():
File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 265, in itervalues
raise EOFError("No valid message found: %r" % self.filestream.path)
EOFError: No valid message found: '/tmp/tmpesmcutpjgrib2'
Can't read index file '/tmp/tmpesmcutpjgrib2.923a8.idx'
Traceback (most recent call last):
File ".../lib/python3.9/site-packages/cfgrib/messages.py", line 532, in from_indexpath_or_filestream
index_mtime = os.path.getmtime(indexpath)
File ".../lib/python3.9/genericpath.py", line 55, in getmtime
return os.stat(filename).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpesmcutpjgrib2.923a8.idx'
---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
File .../lib/python3.9/site-packages/cfgrib/messages.py:261, in FileStreamItems.itervalues(self)
260 try:
--> 261 yield self.filestream.message_from_file(file, errors=errors)
262 valid_message_found = True
File .../lib/python3.9/site-packages/cfgrib/messages.py:328, in FileStream.message_from_file(self, file, offset, **kwargs)
326 def message_from_file(self, file, offset=None, **kwargs):
327 # type: (T.IO[bytes], T.Optional[OffsetType], T.Any) -> Message
--> 328 return Message.from_file(file, offset, **kwargs)
File .../lib/python3.9/site-packages/cfgrib/messages.py:102, in Message.from_file(cls, file, offset, **kwargs)
101 if codes_id is None:
--> 102 raise EOFError("End of file: %r" % file)
103 return cls(codes_id=codes_id, **kwargs)
EOFError: End of file: <_io.BufferedReader name='/tmp/tmpesmcutpjgrib2'>
During handling of the above exception, another exception occurred:
EOFError Traceback (most recent call last)
Input In [49], in <module>
----> 1 xr.open_dataset("foo.zarr", engine="zarr").load()
File .../xarray/core/dataset.py:869, in Dataset.load(self, **kwargs)
867 for k, v in self.variables.items():
868 if k not in lazy_data:
--> 869 v.load()
871 return self
File .../xarray/core/variable.py:446, in Variable.load(self, **kwargs)
444 self._data = as_compatible_data(self._data.compute(**kwargs))
445 elif not is_duck_array(self._data):
--> 446 self._data = np.asarray(self._data)
447 return self
File .../xarray/core/indexing.py:552, in MemoryCachedArray.__array__(self, dtype)
551 def __array__(self, dtype=None):
--> 552 self._ensure_cached()
553 return np.asarray(self.array, dtype=dtype)
File .../xarray/core/indexing.py:549, in MemoryCachedArray._ensure_cached(self)
547 def _ensure_cached(self):
548 if not isinstance(self.array, NumpyIndexingAdapter):
--> 549 self.array = NumpyIndexingAdapter(np.asarray(self.array))
File .../xarray/core/indexing.py:522, in CopyOnWriteArray.__array__(self, dtype)
521 def __array__(self, dtype=None):
--> 522 return np.asarray(self.array, dtype=dtype)
File .../xarray/core/indexing.py:423, in LazilyIndexedArray.__array__(self, dtype)
421 def __array__(self, dtype=None):
422 array = as_indexable(self.array)
--> 423 return np.asarray(array[self.key], dtype=None)
File .../xarray/coding/variables.py:70, in _ElementwiseFunctionArray.__array__(self, dtype)
69 def __array__(self, dtype=None):
---> 70 return self.func(self.array)
File .../xarray/coding/variables.py:137, in _apply_mask(data, encoded_fill_values, decoded_fill_value, dtype)
133 def _apply_mask(
134 data: np.ndarray, encoded_fill_values: list, decoded_fill_value: Any, dtype: Any
135 ) -> np.ndarray:
136 """Mask all matching values in a NumPy arrays."""
--> 137 data = np.asarray(data, dtype=dtype)
138 condition = False
139 for fv in encoded_fill_values:
File .../xarray/core/indexing.py:423, in LazilyIndexedArray.__array__(self, dtype)
421 def __array__(self, dtype=None):
422 array = as_indexable(self.array)
--> 423 return np.asarray(array[self.key], dtype=None)
File .../xarray/backends/zarr.py:73, in ZarrArrayWrapper.__getitem__(self, key)
71 array = self.get_array()
72 if isinstance(key, indexing.BasicIndexer):
---> 73 return array[key.tuple]
74 elif isinstance(key, indexing.VectorizedIndexer):
75 return array.vindex[
76 indexing._arrayize_vectorized_indexer(key, self.shape).tuple
77 ]
File .../lib/python3.9/site-packages/zarr/core.py:720, in Array.__getitem__(self, selection)
718 result = self.vindex[selection]
719 else:
--> 720 result = self.get_basic_selection(pure_selection, fields=fields)
721 return result
File .../lib/python3.9/site-packages/zarr/core.py:846, in Array.get_basic_selection(self, selection, out, fields)
843 return self._get_basic_selection_zd(selection=selection, out=out,
844 fields=fields)
845 else:
--> 846 return self._get_basic_selection_nd(selection=selection, out=out,
847 fields=fields)
File .../lib/python3.9/site-packages/zarr/core.py:889, in Array._get_basic_selection_nd(self, selection, out, fields)
883 def _get_basic_selection_nd(self, selection, out=None, fields=None):
884 # implementation of basic selection for array with at least one dimension
885
886 # setup indexer
887 indexer = BasicIndexer(selection, self)
--> 889 return self._get_selection(indexer=indexer, out=out, fields=fields)
File .../lib/python3.9/site-packages/zarr/core.py:1179, in Array._get_selection(self, indexer, out, fields)
1173 if not hasattr(self.chunk_store, "getitems") or \
1174 any(map(lambda x: x == 0, self.shape)):
1175 # sequentially get one key at a time from storage
1176 for chunk_coords, chunk_selection, out_selection in indexer:
1177
1178 # load chunk selection into output array
-> 1179 self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection,
1180 drop_axes=indexer.drop_axes, fields=fields)
1181 else:
1182 # allow storage to get multiple items at once
1183 lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)
File .../lib/python3.9/site-packages/zarr/core.py:1883, in Array._chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection, drop_axes, fields)
1880 out[out_selection] = fill_value
1882 else:
-> 1883 self._process_chunk(out, cdata, chunk_selection, drop_axes,
1884 out_is_ndarray, fields, out_selection)
File .../lib/python3.9/site-packages/zarr/core.py:1826, in Array._process_chunk(self, out, cdata, chunk_selection, drop_axes, out_is_ndarray, fields, out_selection, partial_read_decode)
1824 except ArrayIndexError:
1825 cdata = cdata.read_full()
-> 1826 chunk = self._decode_chunk(cdata)
1828 # select data from chunk
1829 if fields:
File .../lib/python3.9/site-packages/zarr/core.py:2083, in Array._decode_chunk(self, cdata, start, nitems, expected_shape)
2081 if self._filters:
2082 for f in reversed(self._filters):
-> 2083 chunk = f.decode(chunk)
2085 # view as numpy array with correct dtype
2086 chunk = ensure_ndarray(chunk)
File .../lib/python3.9/site-packages/kerchunk/grib2.py:182, in GRIBCodec.decode(self, buf, out)
179 buf.tofile(fn)
181 # do decode
--> 182 ds = cfgrib.open_file(fn)
183 data = ds.variables[self.var].data
184 if hasattr(data, "build_array"):
File .../lib/python3.9/site-packages/cfgrib/dataset.py:762, in open_file(path, grib_errors, indexpath, filter_by_keys, read_keys, time_dims, extra_coords, **kwargs)
759 stream = messages.FileStream(path, errors=grib_errors)
761 index_keys = compute_index_keys(time_dims, extra_coords)
--> 762 index = open_fileindex(stream, indexpath, index_keys, filter_by_keys=filter_by_keys)
764 return open_from_index(index, read_keys, time_dims, extra_coords, **kwargs)
File .../lib/python3.9/site-packages/cfgrib/dataset.py:741, in open_fileindex(stream, indexpath, index_keys, filter_by_keys, computed_keys)
733 def open_fileindex(
734 stream: messages.FileStream,
735 indexpath: str = messages.DEFAULT_INDEXPATH,
(...)
738 computed_keys: messages.ComputedKeysType = cfmessage.COMPUTED_KEYS,
739 ) -> messages.FileIndex:
740 index_keys = sorted(set(index_keys) | set(filter_by_keys))
--> 741 index = messages.FileIndex.from_indexpath_or_filestream(
742 stream, index_keys, indexpath=indexpath, computed_keys=computed_keys
743 )
744 return index.subindex(filter_by_keys)
File .../lib/python3.9/site-packages/cfgrib/messages.py:549, in FileIndex.from_indexpath_or_filestream(cls, filestream, index_keys, indexpath, computed_keys, log)
546 except Exception:
547 log.exception("Can't read index file %r", indexpath)
--> 549 return cls.from_fieldset(filestream, index_keys, computed_keys)
File .../lib/python3.9/site-packages/cfgrib/messages.py:366, in FieldsetIndex.from_fieldset(cls, fieldset, index_keys, computed_keys)
364 else:
365 iteritems = enumerate(fieldset)
--> 366 return cls.from_fieldset_and_iteritems(fieldset, iteritems, index_keys, computed_keys)
File .../lib/python3.9/site-packages/cfgrib/messages.py:379, in FieldsetIndex.from_fieldset_and_iteritems(cls, fieldset, iteritems, index_keys, computed_keys)
377 index_keys = list(index_keys)
378 header_values_cache = {} # type: T.Dict[T.Tuple[T.Any, type], T.Any]
--> 379 for field_id, raw_field in iteritems:
380 field = ComputedKeysAdapter(raw_field, computed_keys)
381 header_values = []
File .../lib/python3.9/site-packages/cfgrib/messages.py:279, in FileStreamItems.__iter__(self)
277 old_offset = -1
278 count = 0
--> 279 for message in self.itervalues():
280 offset = message.message_get("offset", int)
281 if offset == old_offset:
File .../lib/python3.9/site-packages/cfgrib/messages.py:265, in FileStreamItems.itervalues(self)
263 except EOFError:
264 if not valid_message_found:
--> 265 raise EOFError("No valid message found: %r" % self.filestream.path)
266 break
267 except Exception:
EOFError: No valid message found: '/tmp/tmpesmcutpjgrib2' That might be an issue with my |
Oh right, our filter to read GRIB. You would definitely need that to get any useful data out of the variables when reading from the original data, but indeed they make no sense at all for the newly created local zarr dataset. The GRIB_* attributes probably do carry useful information, though. I can't really tell whether popping them out as in the example affects the load or the save. |
sorry, my bad, the attrs have nothing to do with this, and they are not affected by the Edit: for reference, it's "filters": [
{
"id": "grib",
"var": "t2m"
}
], |
@martindurant , so you intend to remove these filters as part of kerchunk, right? |
No, the filter is essential to read the GRIB files, but it makes no sense when writing out to zarr. xarray should have a way to write out without using the original filters; it's like saying "just because these data came from some special encoding, when I write them again, I want to use the same encoding". We do not, however, have a way to re-encode into GRIB-in-zarr, so xarray should allow us not to do this. Maybe there is such an option, the equivalent of pop for the filters, above? |
I brought this up in the meeting yesterday. In short, removing or overriding the |
@martindurant , in light of pydata/xarray#6318 (comment), do you think it would be best practice to not use datetime64 in our kerchunked zarr representation, even though it's allowed? |
cftime encoding of timestamps cannot be the thing we rely on, since the datasets we are considering go well beyond climate/met and indeed the zarr virtual datasets are not necessarily to be read by xarray |
So in the current version of combine2 IF times were read with cftime (and you need to explicitly specify this), then they will be saved with cftime (default) or datetime64 (optional). |
nice. |
It needs a lot of docs to explain how to use it - beyond the long docstring; and updating the existing examples to use it. |
okay, maybe I can get someone to work on that. |
I have a consolidated JSON with a "best time series" dataset that reads from a collection of HRRR grib2 files on S3.
The data access and visualization of data in the notebook looks fine, but when I attempt to save a subset (to either zarr or netcdf) I get an error from xarray:
I'm raising this issue here because I run a nearly identical workflow with NetCDF files, where we also have datetime64 objects for time in xarray, and don't get this problem. Hence I'm thinking it must have something to do with the reading from grib2 files?
This notebook should be reproducible (and doesn't require AWS credentials):
https://nbviewer.org/gist/rsignell-usgs/7d0a76a5f9fbf5781e9e97b66bd72882
The text was updated successfully, but these errors were encountered: