Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr reader #271

Open
wants to merge 70 commits into
base: main
Choose a base branch
from
Open

Zarr reader #271

wants to merge 70 commits into from

Conversation

norlandrhagen
Copy link
Collaborator

@norlandrhagen norlandrhagen commented Oct 24, 2024

WIP PR to add a Zarr reader. Thanks to @TomNicholas for the how to write a reader guide.

  • Closes Add Zarr Reader(s) #262
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functionality has documentation

Scope of this PR:

  • Read v2 and v3 Zarr

Future PR(s):

  • Add in existing Zarr v3 Chunk Manifest
  • sharded v3 data
  • optimizations (e.g. using async interface to list lengths of chunks for each variable concurrently)

To Do:

  • Open the store using zarr-python v3 (behind a protected import). This should handle both v2 and v3 stores for us.
  • Use zarr-python to list the variables in the store, and check that all loadable_variables are present
    For each virtual variable:
  • Use zarr-python to get the attributes and the dimension names, and coordinate names (which come from the .zmetadata or zarr.json)
  • Use zarr-python to also get the dtype and chunk grid info + everything else needed to create the virtualizarr.zarr.ZArray object (eventually we can skip this step and use a zarr-python array metadata class directly instead of virtualizarr.zarr.ZArray, see
    Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175)
  • Use the knowledge of the store location, variable name, and the zarr format to deduce which directory / S3 prefix the chunks must live in.
  • List all the chunks in that directory using fsspec.ls(detail=True), as that should also return the nbytes of each chunk. Remember that chunks are allowed to be missing.
  • The offset of each chunk is just 0 (ignoring sharding for now), and the length is the file size fsspec returned. The paths are just all the paths fsspec listed.
  • Parse the path and length information returned by fsspec into the structure that we can pass to ChunkManifest.init
  • Create a ManifestArray from our ChunkManifest and ZArray
  • Wrap that ManifestArray in an xarray.Variable, using the dims and attrs we read before
    Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have).
  • Use separate_coords to set the correct variables as coordinate variables (and avoid building indexes whilst doing it)
  • Merge all the variables into one xr.Dataset and return it.
  • All the above should be wrapped in a virtualizarr.readers.zarr.open_virtual_dataset function, which then should be called as a method from a ZarrVirtualBackend(VirtualBackend) subclass.
  • Finally add that ZarrVirtualBackend to the list of readers in virtualizarr.backend.py

@norlandrhagen
Copy link
Collaborator Author

#273

@norlandrhagen
Copy link
Collaborator Author

Bit of an update. With the help from @sharkinsspatial @abarciauskas-bgse and @maxrjones I got a Zarr loaded as a virtual dataset.

<xarray.Dataset> Size: 3kB
Dimensions:  (time: 10, lat: 9, lon: 18)
Coordinates:
    lat      (lat) float32 36B ManifestArray<shape=(9,), dtype=float32, chunk...
    lon      (lon) float32 72B ManifestArray<shape=(18,), dtype=float32, chun...
  * time     (time) datetime64[ns] 80B 2013-01-01 ... 2013-01-03T06:00:00
Data variables:
    air      (time, lat, lon) int16 3kB ManifestArray<shape=(10, 9, 18), dtyp...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Next up is how to deal with fill_values.

When I try to write it to Kerchunk JSON, I’m running into some fill_value dtype issues in the Zarray.

ZArray(shape=(10,), chunks=(10,), dtype='<f4', fill_value=np.float32(nan), order='C', compressor=None, filters=None, zarr_format=2)

Where fill_value=np.float32(nan) . When I try to write these to JSON via ds.virtualize.to_kerchunk(format="dict"), I get TypeError: np.float32(nan) is not JSON serializable.

Wondering how fill_values like np.float32(nan) should be handled.

There seems to be some conversion logic in @sharkinsspatial's HDF reader for converting fill_values . It also looks like there is some fill_value handling in zarr.py.

@TomNicholas
Copy link
Member

I got a Zarr loaded as a virtual dataset.

Amazing!

When I try to write it to Kerchunk JSON, I’m running into some fill_value dtype issues in the Zarray.

ZArray(shape=(10,), chunks=(10,), dtype='<f4', fill_value=np.float32(nan), order='C', compressor=None, filters=None, zarr_format=2)

Where fill_value=np.float32(nan) . When I try to write these to JSON via ds.virtualize.to_kerchunk(format="dict"), I get TypeError: np.float32(nan) is not JSON serializable.

Wondering how fill_values like np.float32(nan) should be handled.

This seems like an issue that should actually be orthogonal to this PR (if it weren't for the ever-present difficulty of testing). Either the problem is in the ZArray class and what types it allows, or it's in the Kerchunk writer not knowing how to serialize a valid ZArray. Either way if np.float32(nan) is a valid fill_value for a zarr array then it's not the fault of the new zarr reader.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a a great start! I think the main thing here is that we don't actually need kerchunk in order to test this reader.

Comment on lines 93 to 99
# we should parameterize for:
# - group
# - drop_variables
# - loadable_variables
# - store readable over cloud storage?
# - zarr store version v2, v3
# testing out pytest parameterization with dataclasses :shrug: -- we can revert to a more normal style
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's great to test all these cases, but they don't need to be simultaneously parametrized over, because we don't need to test the matrix of all possible combinations of these things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parametrizing over v2 vs v3 would be good though, as every feature should be tested for both versions.

Comment on lines 148 to 150
# Do we have a good way in XRT to compare virtual datasets to xarray datasets? assert_duckarray_allclose? or just roundtrip it.
# from xarray.testing import assert_duckarray_allclose
# xrt.assert_allclose(ds, vds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before adding to test_integration.py I would first create a tests/test_readers.py/test_zarr.py and put tests in there. Those tests should open a virtual dataset from a zarr store and assert things about the contents of the ManifestArrays, checking they match what you would expect based on the contents of the store. That's important because it's a kerchunk-free way to do useful testing.

virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
coord_names = list(set(all_array_dims))

# 4 Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have).
# We want to drop 'drop_variables' but also virtual variables since we already **manifested** them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groan

loadable_vars=loadable_vars,
indexes=indexes,
coord_names=coord_names,
attrs=zg.attrs.asdict(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you remembered the group-level attributes, nice

@TomNicholas
Copy link
Member

get chunk size with zarr-python (zarr-developers/zarr-python#2426) instead of fsspec

I think we should just do this in this PR. We can point to Tom's PR for now in the CI, but I expect that will get merged before this does anyway. If you look at Tom's implementation it's basically what we're doing here.

@TomNicholas
Copy link
Member

Store.getsize was just merged so we can possibly just use the same upstream env we are already using zarr-python zarr-developers/zarr-python#2426

virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
if not item.endswith(
(".zarray", ".zattrs", ".zgroup", ".zmetadata")
) and item.startswith(array_name):
# dict key is created by splitting the value from store.list() by the array_name and trailing /....yuck..
Copy link
Member

@TomNicholas TomNicholas Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had a way to ask zarr if a key was backed by a chunk (as opposed to defaulting to the fill_value) then we wouldn't need to do this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be waaay better!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
virtualizarr/readers/zarr.py Outdated Show resolved Hide resolved
@norlandrhagen
Copy link
Collaborator Author

norlandrhagen commented Dec 19, 2024

This V3 zarr store works:

from virtualizarr import open_virtual_dataset

filepath = 's3://carbonplan-share/air_temp.zarr'
vds = open_virtual_dataset(path, indexes={})
vds

ToDo:

  • build ManifestArrays with the from_arrays method instead of building up a dictionary first.
  • Zarr v2 debug - This broke in the mini-async refactor.
  • Test of slightly more complex Zarr stores.

Note! V2 is broken b/c of this hardcoded prefix c.
chunk_map = await get_chunk_mapping_prefix(zarr_array, prefix=f"{array_name}/c")

@TomNicholas
Copy link
Member

TomNicholas commented Dec 19, 2024

Nice @norlandrhagen !

Build manifests from arrays

What does this mean? EDIT: Is it referring to using zarr-developers/zarr-python#2426 instead of using fs.ls()?

@norlandrhagen
Copy link
Collaborator Author

What does this mean? EDIT: Is it referring to using zarr-developers/zarr-python#2426 instead of using fs.ls()?

Ah, I was referring to building the ManifestArrays with the from_arrays method instead of building up a dictionary first. Also, no more fs.ls!. Zarr-python is giving us all the chunk info!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
readers zarr-python Relevant to zarr-python upstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants