-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great so far @sharkinsspatial !
kerchunk backend's specialized encoding translation logic
This part I would really like to either factor out, or at a least really understand what it's doing. See #68
virtualizarr/readers/hdf.py
Outdated
@@ -0,0 +1,206 @@ | |||
from typing import List, Mapping, Optional | |||
|
|||
import fsspec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does one need fsspec if reading a local file? Is there any other way to read from S3 without fsspec at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not with a filesystem-like API. You would have to use boto3 or aiobotocore directly.
This is one of the great virtues of fsspec and is not to be under-valued.
virtualizarr/readers/hdf.py
Outdated
def virtual_vars_from_hdf( | ||
path: str, | ||
drop_variables: Optional[List[str]] = None, | ||
) -> Mapping[str, xr.Variable]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this an a way to interface with the code in open_virtual_dataset
This looks cool @sharkinsspatial! My opinion is that it doesn't make sense to just forklift the kerchunk code into virtualizarr. What I would love to see is an extremely tight, strictly typed, unit-tested total refactor of the parsing logic. I think you're headed down the right path, but I encourage you to push as far as you can in that direction. |
for more information, see https://pre-commit.ci
@rabernat Fully agree with your take above 👆 👍 . I'm trying to work through this incrementally whenever I can find some spare time. In the spirit of thorough test coverage 🎊 looking through your issue pydata/xarray#7388 and the corresponding PR I'm not sure what the proper incantation of variable encoding configuration is to use |
for more information, see https://pre-commit.ci
if mapping["scale_factor"] != 1 or mapping["add_offset"] != 0: | ||
float_dtype = _choose_float_dtype(dtype=dataset.dtype, mapping=mapping) | ||
target_dtype = np.dtype(float_dtype) | ||
codec = FixedScaleOffset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to make this test parametrization pass with this PR? It's currently xfailed because open_virtual_dataset
doesn't know how to handle scale factor encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding, but none of the hdf reader code will be called for loadable_variables
and this block would only be entered for a loaded variable. Is that correct?
virtualizarr/readers/hdf.py
Outdated
|
||
shape = tuple(math.ceil(a / b) for a, b in zip(dataset.shape, dataset.chunks)) | ||
paths = np.empty(shape, dtype=np.dtypes.StringDType) # type: ignore | ||
offsets = np.empty(shape, dtype=np.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #177 , these arrays will need to be uint64
instead of int32
.
manifest = _dataset_chunk_manifest(path, dataset) | ||
if manifest: | ||
chunks = dataset.chunks if dataset.chunks else dataset.shape | ||
codecs = codecs_from_dataset(dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Given the ZarrV3 spec on codecs being non-empty: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#id12
- this comment: use compressor, filters, post_compressor for Array v3 create zarr-python#1944 (comment) mapping
filters
andcompressor
to v3 concepts - And empirically we observed that on GOES data this builds of a list of zlib and FixedScaleOffset
Leaving compressor=None
causes ambiguity for roundtripping v3 metadata (ZArray -> disk -> ZArray
) because we can't determine if it's a list of 2 filters or a list of one filter and one compressor. zlib
is a compression codec and FixedScaleOffset
is not, but should they both be treated as filters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghidalgo3 My rationale for describing the full codec chain in the filters
property was the fact that internally HDF5 does not distinguish compressors and filters, the entire encoding chain is represented as filters
. Since we don't need to worry about v2 interoperability, I think we can just focus with aligning with v3's api (which still seem to be in a state of flux). I think I prefer the approach proposed in zarr-developers/zarr-python#1944 (comment) but I don't know where that leaves me in the interim until a final decision gets made on the v3 api path 🤔. For v3 compatibility we'll also need to track zarr-developers/numcodecs#524 so we use numcodecs
which are compatible with the new v3 codec specification. TLDR I think we might be in flux for some time while upstream v3 decisions get made.
@ghidalgo3 I also want to address your question from your PR #193
IIUC different v3 implementations will support a codec registry zarr-developers/zarr-python#1588 to make the codec support fully extensible. Codec discovery and registration has always been a thorny problem (this is a big issue in the HDF space) but I'm hopeful that this approach will be flexible. |
@TomAugspurger I'm trying to merge |
@sharkinsspatial this is a behavior change in Something like this seems to fix the failing tests diff --git a/virtualizarr/zarr.py b/virtualizarr/zarr.py
index 824892c..87bb453 100644
--- a/virtualizarr/zarr.py
+++ b/virtualizarr/zarr.py
@@ -106,8 +106,15 @@ class ZArray:
def to_kerchunk_json(self) -> str:
zarray_dict = self.dict()
- if zarray_dict["fill_value"] is np.nan:
+
+ fill_value = zarray_dict["fill_value"]
+
+ if fill_value is np.nan:
zarray_dict["fill_value"] = None
+
+ elif isinstance(fill_value, (np.number, np.ndarray)):
+ zarray_dict["fill_value"] = fill_value.item()
+
return ujson.dumps(zarray_dict)
# ZArray.dict seems to shadow "dict", so we need the type ignore in I'm not sure what behavior we want |
The ZArray class was supposed to be a way to standardize the metadata, allowing the rest of the package to not worry about any differences that kerchunk throws at us. The .dict() method we just got for free via inheriting from the pydantic base model. I think we should migrate the interface of ZArray towards providing a unified representation of the metadata, and also try to move its API closer to that of zarr-python's ZMetaData class because really we want to be using that instead. (Not sure if that actually answers your question) |
This is a rudimentary initial implementation for #78. The core code is ported directly from kerchunk's hdf backend. I have not ported the bulk of the kerchunk backend's specialized encoding translation logic but I'll try to do so incrementally so that we can build complete test coverage for the many edge cases it currently covers.