Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency Issue for Kerchunk -> Icechunk via Virtualizarr #321

Open
dwest77a opened this issue Nov 29, 2024 · 3 comments
Open

Dependency Issue for Kerchunk -> Icechunk via Virtualizarr #321

dwest77a opened this issue Nov 29, 2024 · 3 comments
Labels
Kerchunk Relating to the kerchunk library / specification itself upstream issue

Comments

@dwest77a
Copy link

Hi all, I'm relatively new to using virtualizarr but have been developing tools using Kerchunk for some time, specifically a package around large-scale conversions in parallel for thousands of datasets in our data archives.

I'm attempting to use the Virtualizarr library to concatenate some NetCDF4 data into a virtual dataset, then write out as an Icechunk store to disk. My issue is that it seems Icechunk requires the new zarr v3 pre-release, but Kerchunk (used to create the virtual dataset) needs Zarr < 3. I've so far been unable to resolve this dependency issue. Any suggestions for how to go about solving this would be appreciated, thanks!

My example code:

from virtualizarr import open_virtual_dataset
vds = [open_virtual_dataset(f, indexes={}) for f in files]

import xarray as xr
combined_vds = xr.concat(vds, dim='time', coords='minimal', compat='override')

from icechunk import IcechunkStore, StorageConfig, StoreConfig, VirtualRefConfig
storage = StorageConfig.filesystem(str('combined'))
store = IcechunkStore.create(storage=storage, mode="w", config=StoreConfig(
    virtual_ref_config=VirtualRefConfig.s3_anonymous(region='us-east-1'),
))

combined_vds.virtualize.to_icechunk(store)

I either get an issue importing kerchunk (if I uninstall that to reinstall the zarr v3 pre-release) or an issue with zarr when trying to create the Icechunk store.

@maxrjones
Copy link
Member

Matthew Iannucci has been helpfully tracking the compatibility between Icechunk, VirtualiZarr, and Kerchunk in earth-mover/icechunk#197. The tl;dr is that until fsspec/kerchunk#516 is completed/merged to add Zarr-Python v3 compatibility in Kerchunk one would need to work off Matt's fork and branch to use these libraries in the same environment, since icechunk requires Zarr-Python v3.

@TomNicholas TomNicholas added Kerchunk Relating to the kerchunk library / specification itself upstream issue labels Nov 29, 2024
@TomNicholas
Copy link
Member

@maxrjones is correct that we're waiting on kerchunk to support zarr-python v3, and as icechunk requires zarr-python v3 currently the released versions of kerchunk and icechunk are incompatible. But as you're working with netCDF4 data you actually have a few different options:

  1. As Max mentioned use the fork of kerchunk that Matt is maintaining for now (but some parts of kerchunk might not work on that branch yet). This is recommended for now if you want to try putting virtual datasets in Icechunk.
  2. Try using @sharkinsspatial's non-kerchunk alternative HDF reader (Non-kerchunk backend for HDF5/netcdf4 files. #87), which should be able to read netCDF4 files. This doesn't require kerchunk to be installed so I think should work with zarr-python v3 (?) It is very new and experimental though. You have to explicitly opt in to using it via
    from virtualizarr.readers.hdf import HDFVirtualBackend
    
    vds = open_virtual_dataset('file.nc', backend=HDFVirtualBackend)
  3. Have 2 environments, one with kerchunk and zarr-python v2, and one with icechunk and zarr-python v3. You use the first to open and concatenate the files, then save them to disk using vds.virtualize.to_kerchunk('refs.json'). Then in the second environment you use open_virtual_dataset(refs.json, filetype='kerchunk'), then save that into icechunk. It's janky but it should work, because open_virtual_dataset(refs.json, filetype='kerchunk') doesn't actually require the kerchunk package as a dependency.
  4. Just wait a few weeks and hopefully kerchunk will be updated, then you should be able to just use the released versions of everything together.

@dwest77a
Copy link
Author

dwest77a commented Dec 2, 2024

Hi both, thanks very much for the suggestions! I will attempt both the first and second solutions in the next few days, I'm just trying to get a picture of how it all works for now. Solution 3 suggested above has an additional issue that it can't be used where some variables/dimensions are written inline in the kerchunk file (i.e base64 encoded) - this is listed as a ToDo in the error message. Most of the kerchunk files we've produced have inline components for some dimensions as this is more performant than having to make many very small requests to compose a dimension.

Thanks again for the suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kerchunk Relating to the kerchunk library / specification itself upstream issue
Projects
None yet
Development

No branches or pull requests

3 participants