[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

sharkinsspatial · 2024-04-22T18:37:25Z

This is a rudimentary initial implementation for #78. The core code is ported directly from kerchunk's hdf backend. I have not ported the bulk of the kerchunk backend's specialized encoding translation logic but I'll try to do so incrementally so that we can build complete test coverage for the many edge cases it currently covers.

TomNicholas

This is looking great so far @sharkinsspatial !

kerchunk backend's specialized encoding translation logic

This part I would really like to either factor out, or at a least really understand what it's doing. See #68

TomNicholas · 2024-04-22T20:53:03Z

virtualizarr/readers/hdf.py

@@ -0,0 +1,206 @@
+from typing import List, Mapping, Optional
+
+import fsspec


Does one need fsspec if reading a local file? Is there any other way to read from S3 without fsspec at all?

Not with a filesystem-like API. You would have to use boto3 or aiobotocore directly.

This is one of the great virtues of fsspec and is not to be under-valued.

TomNicholas · 2024-04-22T20:56:06Z

virtualizarr/readers/hdf.py

+def virtual_vars_from_hdf(
+    path: str,
+    drop_variables: Optional[List[str]] = None,
+) -> Mapping[str, xr.Variable]:


I like this an a way to interface with the code in open_virtual_dataset

rabernat · 2024-04-22T21:42:46Z

This looks cool @sharkinsspatial!

My opinion is that it doesn't make sense to just forklift the kerchunk code into virtualizarr. What I would love to see is an extremely tight, strictly typed, unit-tested total refactor of the parsing logic. I think you're headed down the right path, but I encourage you to push as far as you can in that direction.

for more information, see https://pre-commit.ci

sharkinsspatial · 2024-05-13T19:26:21Z

@rabernat Fully agree with your take above 👆 👍 . I'm trying to work through this incrementally whenever I can find some spare time. In the spirit of thorough test coverage 🎊 looking through your issue pydata/xarray#7388 and the corresponding PR I'm not sure what the proper incantation of variable encoding configuration is to use blosc with the netcdf4 engine? Do you have an example of this that you can provide?

for more information, see https://pre-commit.ci

…Zarr into hdf5_reader

sharkinsspatial · 2024-10-24T17:12:49Z

@TomNicholas We're still investigating some failing test associated with inconsistencies around coordinate variable round tripping and a compatibility issue with nodata and CF time encoding. But given that this backend is still considered experimental and requires an explicit opt-in I think this PR is ok to review and merge. These are the test which are currently xfailed and need further investigation

https://github.com/zarr-developers/VirtualiZarr/blob/hdf5_reader/virtualizarr/tests/test_integration.py#L199
https://github.com/zarr-developers/VirtualiZarr/blob/hdf5_reader/virtualizarr/tests/test_readers/test_hdf_integration.py#L14

This test fails with both the kerchunk hdf reader and the experimental reader.
https://github.com/zarr-developers/VirtualiZarr/blob/hdf5_reader/virtualizarr/tests/test_readers/test_hdf_integration.py#L30

TomNicholas

Amazing! I'll take a more detailed look later / tomorrow / next week 😅

TomNicholas · 2024-10-24T18:27:04Z

pyproject.toml

+    "h5py",
+    "hdf5plugin",
    "numcodecs",
+    "imagecodecs",
+    "imagecodecs-numcodecs==2024.6.1",


I think these should be optional no? (numcodecs is still required). That also means they can be removed from the min-deps.yml in CI.

TomNicholas · 2024-10-24T18:30:21Z

virtualizarr/backend.py

    "hdf5": HDF5VirtualBackend,
    "netcdf4": HDF5VirtualBackend,  # note this is the same as for hdf5
+    "netcdf3": NetCDF3VirtualBackend,


Shouldn't your reader be added here too? Sounds like it could go under hdf rather than hdf5? Or maybe that's too subtle...

Since we discussed treating this new reader as experimental while we do some further battle testing, I'm forcing users to explicitly pass the new backend class HDFVirtualBackend to the optional backend argument for open_virtual_dataset to distinguish it from using the kerchunk reader with filetype auto-detection.

TomNicholas

This looks great @sharkinsspatial ! Thank you for your patience. My comments are mostly just nits.

EDIT: The typing errors should be resolved by merging main.

TomNicholas · 2024-11-08T21:33:11Z

pyproject.toml

+    "hdf5plugin",
+    "imagecodecs",
+    "imagecodecs-numcodecs==2024.6.1",


Couldn't you just define the test dependencies as this list + the hdf_reader list? Instead of manually repeating entries.

In fact because your tests are behind @requires_X decorators, couldn't we just consider these hdf5 dependencies as not required for the basic test suite (in the way as icechunk is not required for example). But we still make sure to install all these and run them in at least one CI job.

TomNicholas · 2024-11-08T21:40:42Z

virtualizarr/backend.py

+    if backend:
+        backend_cls = backend
+    else:
+        backend_cls = VIRTUAL_BACKENDS.get(filetype.name.lower())  # type: ignore


So if both backend and filetype are explicitly specified, this will silently ignore the filetype argument. I think it should maybe warn or raise in that case, otherwise someone could do

open_virtual_dataset(file.grib, filetype='dmr++', backend=HDFVirtualBackend)

and it could become very unclear what just happened.

TomNicholas · 2024-11-08T21:45:07Z

virtualizarr/readers/hdf.py

+            path=filepath, reader_options=reader_options, group=group
+        )
+
+        coord_names = attrs.pop("coordinates", [])


Isn't this going to return a string like "lat lon" rather than a list like you want it to?

TomNicholas · 2024-11-08T21:45:46Z

virtualizarr/readers/hdf.py

+            loadable_variables,
+        )
+
+        virtual_vars = HDFVirtualBackend._virtual_vars_from_hdf(


The fact that my @staticmethod prevents you from just doing

Suggested change

virtual_vars = HDFVirtualBackend._virtual_vars_from_hdf(

virtual_vars = self._virtual_vars_from_hdf(

is maybe a code smell...

TomNicholas · 2024-11-08T21:48:42Z

virtualizarr/readers/hdf.py

+        return variables
+
+    @staticmethod
+    def _attrs_from_root_group(


I think this merely defaults to the root group?

Suggested change

def _attrs_from_root_group(

def _attrs_from_group(

Or maybe even just ._get_group_attrs

TomNicholas · 2024-11-08T22:37:32Z

virtualizarr/readers/hdf_filters.py

@@ -0,0 +1,146 @@
+import dataclasses


Should we put hdf_filters.py behind a module, i.e. virtualizarr.readers.hdf.filters.py?

TomNicholas · 2024-11-08T22:39:22Z

virtualizarr/readers/hdf_filters.py

+    return codec
+
+
+def cfcodec_from_dataset(dataset: h5py.Dataset) -> Codec | None:


I think this deserves a docstring for context as to what it does

TomNicholas · 2024-11-08T22:40:37Z

virtualizarr/tests/test_backend.py

@@ -82,14 +83,15 @@ def test_FileType():


 @requires_kerchunk
+@pytest.mark.parametrize("hdf_backend", [None, HDFVirtualBackend])


In order to not rely on default behaviour?

Suggested change

@pytest.mark.parametrize("hdf_backend", [None, HDFVirtualBackend])

@pytest.mark.parametrize("hdf_backend", [HDF5VirtualBackend, HDFVirtualBackend])

TomNicholas · 2024-11-08T22:42:06Z

virtualizarr/tests/test_backend.py

-            open_virtual_dataset(hdf5_groups_file, group="doesnt_exist")
+    @pytest.mark.parametrize("hdf_backend", [None, HDFVirtualBackend])
+    def test_group_kwarg(self, hdf5_groups_file, hdf_backend):
+        if hdf_backend:


Yeah I find this "None means kerchunk" confusing

TomNicholas · 2024-11-08T22:45:37Z

virtualizarr/tests/test_readers/test_hdf_integration.py

+        roundtrip = xr.open_dataset(kerchunk_file, engine="kerchunk", decode_times=True)
+        xrt.assert_allclose(ds, roundtrip)
+
+    @pytest.mark.xfail(reason="Coordinate issue affecting kerchunk and HDF reader.")


can you point to the relevant issue number here

sharkinsspatial added 4 commits April 19, 2024 13:31

Generate chunk manifest backed variable from HDF5 dataset.

6b7abe2

Transfer dataset attrs to variable.

bca0aab

Get virtual variables dict from HDF5 file.

384ff6b

Update virtual_vars_from_hdf to use fsspec and drop_variables arg.

4c5f9bd

sharkinsspatial marked this pull request as draft April 22, 2024 18:37

sharkinsspatial added 3 commits April 22, 2024 13:02

mypy fix to use ChunkKey and empty dimensions list.

1dd3370

Extract attributes from hdf5 root group.

d92c75c

Use hdf reader for netcdf4 files.

0ed8362

TomNicholas reviewed Apr 22, 2024

View reviewed changes

TomNicholas added enhancement New feature or request references generation Reading byte ranges from archival files labels Apr 22, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

f4485fa

for more information, see https://pre-commit.ci

sharkinsspatial mentioned this pull request Apr 23, 2024

How to handle encoding #68

Open

sharkinsspatial added 4 commits May 8, 2024 17:53

Merge branch 'main' into hdf5_reader

3cc1254

Fix ruff complaints.

0123df7

First steps for handling HDF5 filters.

332bcaa

Initial step for hdf5plugin supported codecs.

c51e615

TomNicholas mentioned this pull request May 14, 2024

open_virtual_dataset with dmr++ #113

Merged

6 tasks

sharkinsspatial and others added 10 commits May 16, 2024 16:24

Small commit to check compression support in CI environment.

0083f77

Merge branch 'main' into hdf5_reader

3c00071

[pre-commit.ci] auto fixes from pre-commit.com hooks

207c4b5

for more information, see https://pre-commit.ci

Fix mypy complaints for hdf_filters.

c573800

Merge branch 'hdf5_reader' of https://github.com/TomNicholas/Virtuali…

ef0d7a8

…Zarr into hdf5_reader

Local pre-commit fix for hdf_filters.

588e06b

Use fsspec reader_options introduced in #37.

725333e

Fix incorrect zarr_v3 if block position from merge commit ef0d7a8.

72df108

Fix early return from hdf _extract_attrs.

d1e85cb

Test that _extract_attrs correctly handles multiple attributes.

1e2b343

sharkinsspatial had a problem deploying to test-release October 24, 2024 01:54 — with GitHub Actions Failure

Add test_hdf_integration tests to be skipped for non-kerchunk env.

150d06d

sharkinsspatial had a problem deploying to test-release October 24, 2024 01:58 — with GitHub Actions Failure

Include imagecodecs in dependencies.

8ccba34

sharkinsspatial had a problem deploying to test-release October 24, 2024 02:02 — with GitHub Actions Failure

sharkinsspatial had a problem deploying to test-release October 24, 2024 02:13 — with GitHub Actions Failure

Diagnose imagecodecs-numcodecs installation failures in CI.

81874e0

sharkinsspatial force-pushed the hdf5_reader branch from 127b3d6 to 81874e0 Compare October 24, 2024 02:15

sharkinsspatial temporarily deployed to test-release October 24, 2024 02:15 — with GitHub Actions Inactive

sharkinsspatial added 3 commits October 24, 2024 10:59

Ignore mypy complaints for VirtualBackend.

f87abe2

Remove checksum assert which varies across different zstd versions.

70e7e29

Temporarily xfail integration tests with coordinate inconsistency.

43bc0e4

sharkinsspatial temporarily deployed to test-release October 24, 2024 15:22 — with GitHub Actions Inactive

Remove backend arg for non-hdf network file tests.

82a6321

sharkinsspatial temporarily deployed to test-release October 24, 2024 16:36 — with GitHub Actions Inactive

Fix mypy comment moved by ruff formatting.

b34f260

sharkinsspatial temporarily deployed to test-release October 24, 2024 16:41 — with GitHub Actions Inactive

sharkinsspatial marked this pull request as ready for review October 24, 2024 17:12

TomNicholas reviewed Oct 24, 2024

View reviewed changes

sharkinsspatial had a problem deploying to test-release October 25, 2024 14:49 — with GitHub Actions Failure

Make HDR reader dependencies optional.

f9ead06

sharkinsspatial force-pushed the hdf5_reader branch from 30a83f1 to f9ead06 Compare October 25, 2024 14:52

sharkinsspatial had a problem deploying to test-release October 25, 2024 14:53 — with GitHub Actions Failure

sharkinsspatial had a problem deploying to test-release October 25, 2024 18:33 — with GitHub Actions Failure

Handle optional imagecodecs and hdf5plugin dependency imports for tests.

5608292

sharkinsspatial force-pushed the hdf5_reader branch from 9dfc3db to 5608292 Compare October 25, 2024 18:38

sharkinsspatial had a problem deploying to test-release October 25, 2024 18:39 — with GitHub Actions Failure

sharkinsspatial mentioned this pull request Oct 29, 2024

Inlined CF time variables fail round tripping when compressed. #280

Open

TomNicholas reviewed Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

sharkinsspatial commented Apr 22, 2024 •

edited

Loading

TomNicholas left a comment

TomNicholas Apr 22, 2024

rabernat Apr 22, 2024

TomNicholas Apr 22, 2024

rabernat commented Apr 22, 2024

sharkinsspatial commented May 13, 2024

sharkinsspatial commented Oct 24, 2024

TomNicholas left a comment

TomNicholas Oct 24, 2024

TomNicholas Oct 24, 2024

sharkinsspatial Oct 24, 2024

TomNicholas left a comment •

edited

Loading

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

TomNicholas Nov 8, 2024

		@@ -0,0 +1,206 @@
		from typing import List, Mapping, Optional

		import fsspec

	virtual_vars = HDFVirtualBackend._virtual_vars_from_hdf(
	virtual_vars = self._virtual_vars_from_hdf(

		return codec


		def cfcodec_from_dataset(dataset: h5py.Dataset) -> Codec \| None:

		@@ -82,14 +83,15 @@ def test_FileType():


		@requires_kerchunk
		@pytest.mark.parametrize("hdf_backend", [None, HDFVirtualBackend])

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

Are you sure you want to change the base?

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

Conversation

sharkinsspatial commented Apr 22, 2024 • edited Loading

TomNicholas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Apr 22, 2024

sharkinsspatial commented May 13, 2024

sharkinsspatial commented Oct 24, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sharkinsspatial commented Apr 22, 2024 •

edited

Loading

TomNicholas left a comment •

edited

Loading