Zarr reader #271

norlandrhagen · 2024-10-24T21:58:15Z

WIP PR to add a Zarr reader. Thanks to @TomNicholas for the how to write a reader guide.

Scope of this PR:

Read v2 and v3 Zarr

Future PR(s):

Add in existing Zarr v3 Chunk Manifest
sharded v3 data
optimizations (e.g. using async interface to list lengths of chunks for each variable concurrently)

To Do:

norlandrhagen · 2024-10-25T18:12:46Z

#273

norlandrhagen · 2024-10-31T21:07:15Z

Bit of an update. With the help from @sharkinsspatial @abarciauskas-bgse and @maxrjones I got a Zarr loaded as a virtual dataset.

<xarray.Dataset> Size: 3kB
Dimensions:  (time: 10, lat: 9, lon: 18)
Coordinates:
    lat      (lat) float32 36B ManifestArray<shape=(9,), dtype=float32, chunk...
    lon      (lon) float32 72B ManifestArray<shape=(18,), dtype=float32, chun...
  * time     (time) datetime64[ns] 80B 2013-01-01 ... 2013-01-03T06:00:00
Data variables:
    air      (time, lat, lon) int16 3kB ManifestArray<shape=(10, 9, 18), dtyp...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Next up is how to deal with fill_values.

When I try to write it to Kerchunk JSON, I’m running into some fill_value dtype issues in the Zarray.

ZArray(shape=(10,), chunks=(10,), dtype='<f4', fill_value=np.float32(nan), order='C', compressor=None, filters=None, zarr_format=2)

Where fill_value=np.float32(nan) . When I try to write these to JSON via ds.virtualize.to_kerchunk(format="dict"), I get TypeError: np.float32(nan) is not JSON serializable.

Wondering how fill_values like np.float32(nan) should be handled.

There seems to be some conversion logic in @sharkinsspatial's HDF reader for converting fill_values . It also looks like there is some fill_value handling in zarr.py.

TomNicholas · 2024-11-05T16:11:54Z

I got a Zarr loaded as a virtual dataset.

Amazing!

When I try to write it to Kerchunk JSON, I’m running into some fill_value dtype issues in the Zarray.

ZArray(shape=(10,), chunks=(10,), dtype='<f4', fill_value=np.float32(nan), order='C', compressor=None, filters=None, zarr_format=2)

Where fill_value=np.float32(nan) . When I try to write these to JSON via ds.virtualize.to_kerchunk(format="dict"), I get TypeError: np.float32(nan) is not JSON serializable.

Wondering how fill_values like np.float32(nan) should be handled.

This seems like an issue that should actually be orthogonal to this PR (if it weren't for the ever-present difficulty of testing). Either the problem is in the ZArray class and what types it allows, or it's in the Kerchunk writer not knowing how to serialize a valid ZArray. Either way if np.float32(nan) is a valid fill_value for a zarr array then it's not the fault of the new zarr reader.

TomNicholas

This is a a great start! I think the main thing here is that we don't actually need kerchunk in order to test this reader.

TomNicholas · 2024-11-05T16:24:19Z

virtualizarr/tests/test_integration.py

+# we should parameterize for:
+# - group
+# - drop_variables
+# - loadable_variables
+# - store readable over cloud storage?
+# - zarr store version v2, v3
+# testing out pytest parameterization with dataclasses :shrug: -- we can revert to a more normal style


I think it's great to test all these cases, but they don't need to be simultaneously parametrized over, because we don't need to test the matrix of all possible combinations of these things.

I think parametrizing over v2 vs v3 would be good though, as every feature should be tested for both versions.

TomNicholas · 2024-11-05T16:33:39Z

virtualizarr/tests/test_integration.py

+    # Do we have a good way in XRT to compare virtual datasets to xarray datasets? assert_duckarray_allclose? or just roundtrip it.
+    # from xarray.testing import assert_duckarray_allclose
+    # xrt.assert_allclose(ds, vds)


Before adding to test_integration.py I would first create a tests/test_readers.py/test_zarr.py and put tests in there. Those tests should open a virtual dataset from a zarr store and assert things about the contents of the ManifestArrays, checking they match what you would expect based on the contents of the store. That's important because it's a kerchunk-free way to do useful testing.

virtualizarr/readers/zarr.py

TomNicholas · 2024-11-05T17:45:06Z

virtualizarr/readers/zarr.py

+        coord_names = list(set(all_array_dims))
+
+        # 4 Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have).
+        # We want to drop 'drop_variables' but also virtual variables since we already **manifested** them.


TomNicholas · 2024-11-05T17:47:59Z

virtualizarr/readers/zarr.py

+            loadable_vars=loadable_vars,
+            indexes=indexes,
+            coord_names=coord_names,
+            attrs=zg.attrs.asdict(),


you remembered the group-level attributes, nice

TomNicholas · 2024-11-08T16:11:33Z

get chunk size with zarr-python (zarr-developers/zarr-python#2426) instead of fsspec

I think we should just do this in this PR. We can point to Tom's PR for now in the CI, but I expect that will get merged before this does anyway. If you look at Tom's implementation it's basically what we're doing here.

Co-authored-by: Tom Nicholas <tom@cworthy.org>

…ualiZarr into zarr_reader

TomNicholas · 2024-11-14T21:26:24Z

Store.getsize was just merged so we can possibly just use the same upstream env we are already using zarr-python zarr-developers/zarr-python#2426

virtualizarr/readers/zarr.py

TomNicholas · 2024-11-15T21:04:25Z

virtualizarr/readers/zarr.py

+            if not item.endswith(
+                (".zarray", ".zattrs", ".zgroup", ".zmetadata")
+            ) and item.startswith(array_name):
+                # dict key is created by splitting the value from store.list() by the array_name and trailing /....yuck..


If we had a way to ask zarr if a key was backed by a chunk (as opposed to defaulting to the fill_value) then we wouldn't need to do this

That would be waaay better!

zarr-developers/zarr-python#2507

virtualizarr/readers/zarr.py

norlandrhagen · 2024-12-19T23:27:17Z

This V3 zarr store works:

from virtualizarr import open_virtual_dataset

filepath = 's3://carbonplan-share/air_temp.zarr'
vds = open_virtual_dataset(path, indexes={})
vds

ToDo:

build ManifestArrays with the from_arrays method instead of building up a dictionary first.
Zarr v2 debug - This broke in the mini-async refactor.
Test of slightly more complex Zarr stores.

Note! V2 is broken b/c of this hardcoded prefix c.
chunk_map = await get_chunk_mapping_prefix(zarr_array, prefix=f"{array_name}/c")

TomNicholas · 2024-12-19T23:33:22Z

Nice @norlandrhagen !

Build manifests from arrays

What does this mean? EDIT: Is it referring to using zarr-developers/zarr-python#2426 instead of using fs.ls()?

norlandrhagen · 2024-12-20T02:24:18Z

What does this mean? EDIT: Is it referring to using zarr-developers/zarr-python#2426 instead of using fs.ls()?

Ah, I was referring to building the ManifestArrays with the from_arrays method instead of building up a dictionary first. Also, no more fs.ls!. Zarr-python is giving us all the chunk info!

…r issue

norlandrhagen added 3 commits October 23, 2024 23:50

wip toward zarr v2 reader

26a94df

removed _ARRAY_DIMENSIONS and trimmed down attrs

cfb7b8d

WIP for zarr reader

2f26f03

norlandrhagen added the readers label Oct 24, 2024

norlandrhagen had a problem deploying to test-release October 24, 2024 21:58 — with GitHub Actions Failure

adding in the key piece, the reader

eab87a6

norlandrhagen temporarily deployed to test-release October 24, 2024 22:01 — with GitHub Actions Inactive

virtual dataset is returned! Now to deal with fill_value

13db375

norlandrhagen temporarily deployed to test-release October 31, 2024 20:59 — with GitHub Actions Inactive

TomNicholas reviewed Nov 5, 2024

View reviewed changes

TomNicholas mentioned this pull request Nov 6, 2024

Demo dataset with UFS-Replay earth-mover/icechunk#378

Open

2 tasks

TomNicholas mentioned this pull request Nov 11, 2024

open_virtual_dataset fails to open tiffs #291

Open

norlandrhagen and others added 2 commits November 12, 2024 11:28

Merge branch 'main' into zarr_reader

cc30ad7

Update virtualizarr/readers/zarr.py

a047ff9

Co-authored-by: Tom Nicholas <tom@cworthy.org>

norlandrhagen temporarily deployed to test-release November 12, 2024 18:31 — with GitHub Actions Inactive

Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…

072bead

…ualiZarr into zarr_reader

replace fsspec ls with zarr.getsize

f7c9a3f

norlandrhagen had a problem deploying to test-release November 15, 2024 04:42 — with GitHub Actions Failure

lint

2024606

norlandrhagen had a problem deploying to test-release November 15, 2024 04:47 — with GitHub Actions Failure

wip test_zarr

443435b

norlandrhagen had a problem deploying to test-release November 15, 2024 07:22 — with GitHub Actions Failure

removed pdb

50fd8b5

norlandrhagen had a problem deploying to test-release November 15, 2024 07:24 — with GitHub Actions Failure

TomNicholas reviewed Nov 15, 2024

View reviewed changes

jhamman reviewed Dec 17, 2024

View reviewed changes

virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved

jhamman reviewed Dec 17, 2024

View reviewed changes

virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved

jhamman reviewed Dec 17, 2024

View reviewed changes

virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved

jhamman reviewed Dec 17, 2024

View reviewed changes

virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved

make mypy happy

5d14b20

norlandrhagen temporarily deployed to test-release December 17, 2024 20:50 — with GitHub Actions Inactive

norlandrhagen added 3 commits December 18, 2024 17:26

adds option for AsyncArray to _is_zarr_array

fb844b6

big async rewrite

421f53f

merge w/ main

cedad11

norlandrhagen had a problem deploying to test-release December 19, 2024 22:46 — with GitHub Actions Failure

fixes merge conflict

1c5e42d

norlandrhagen had a problem deploying to test-release December 19, 2024 22:47 — with GitHub Actions Failure

bit of restructure

89d8555

norlandrhagen had a problem deploying to test-release December 19, 2024 23:15 — with GitHub Actions Failure

nit

c1a5218

norlandrhagen had a problem deploying to test-release December 19, 2024 23:25 — with GitHub Actions Failure

WIP on ChunkManifest.from_arrays

6af84b4

norlandrhagen had a problem deploying to test-release December 20, 2024 03:07 — with GitHub Actions Failure

TomNicholas added the zarr-python Relevant to zarr-python upstream label Dec 20, 2024

v2/v3 c chunk fix + build ChunkManifest from numpy arrays

349386f

norlandrhagen had a problem deploying to test-release December 21, 2024 17:45 — with GitHub Actions Failure

removed method of creating ChunkManifests from dicts

c776ab9

norlandrhagen had a problem deploying to test-release December 21, 2024 17:51 — with GitHub Actions Failure

cleanup

fb6fff7

norlandrhagen had a problem deploying to test-release December 21, 2024 18:54 — with GitHub Actions Failure

adds xfails to TestOpenVirtualDatasetZarr due to local filesystem zar…

87c74d4

…r issue

norlandrhagen had a problem deploying to test-release December 21, 2024 19:02 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr reader #271

Zarr reader #271

norlandrhagen commented Oct 24, 2024 •

edited

Loading

norlandrhagen commented Oct 25, 2024

norlandrhagen commented Oct 31, 2024

TomNicholas commented Nov 5, 2024

TomNicholas left a comment

TomNicholas Nov 5, 2024

TomNicholas Nov 5, 2024

TomNicholas Nov 5, 2024

TomNicholas Nov 5, 2024

TomNicholas Nov 5, 2024

TomNicholas commented Nov 8, 2024

TomNicholas commented Nov 14, 2024

TomNicholas Nov 15, 2024 •

edited

Loading

norlandrhagen Nov 19, 2024

norlandrhagen Nov 20, 2024

norlandrhagen commented Dec 19, 2024 •

edited

Loading

TomNicholas commented Dec 19, 2024 •

edited

Loading

norlandrhagen commented Dec 20, 2024

Zarr reader #271

Are you sure you want to change the base?

Zarr reader #271

Conversation

norlandrhagen commented Oct 24, 2024 • edited Loading

norlandrhagen commented Oct 25, 2024

norlandrhagen commented Oct 31, 2024

TomNicholas commented Nov 5, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas Nov 5, 2024

Choose a reason for hiding this comment

TomNicholas Nov 5, 2024

Choose a reason for hiding this comment

TomNicholas Nov 5, 2024

Choose a reason for hiding this comment

TomNicholas Nov 5, 2024

Choose a reason for hiding this comment

TomNicholas Nov 5, 2024

Choose a reason for hiding this comment

TomNicholas commented Nov 8, 2024

TomNicholas commented Nov 14, 2024

TomNicholas Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

norlandrhagen Nov 19, 2024

Choose a reason for hiding this comment

norlandrhagen Nov 20, 2024

Choose a reason for hiding this comment

norlandrhagen commented Dec 19, 2024 • edited Loading

TomNicholas commented Dec 19, 2024 • edited Loading

norlandrhagen commented Dec 20, 2024

norlandrhagen commented Oct 24, 2024 •

edited

Loading

TomNicholas Nov 15, 2024 •

edited

Loading

norlandrhagen commented Dec 19, 2024 •

edited

Loading

TomNicholas commented Dec 19, 2024 •

edited

Loading