Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading virtual references back out into VirtualiZarr Manifests #104

Open
TomNicholas opened this issue Sep 25, 2024 · 23 comments
Open

Reading virtual references back out into VirtualiZarr Manifests #104

TomNicholas opened this issue Sep 25, 2024 · 23 comments
Labels
virtual references 👻 Involves virtual kerchunk/virtualizarr chunk references

Comments

@TomNicholas
Copy link
Contributor

I would like to be able to read virtual references back out from an icechunk store into VirtualiZarr ManifestArray objects.

Note this issue is the icechunk equivalent of zarr-developers/VirtualiZarr#118, which is about reading kerchunk references into ManifestArray objects.

The main use case is appending new data to an existing store (see zarr-developers/VirtualiZarr#21 (comment)), so that when some new data arrives (e.g. a new grib file with today's weather data), I can add an updated snapshot just with something like:

import virtualizarr as vz

# avoids re-extracting all the metadata from all the past grib files, so should be quick
existing_vds = vz.open_virtual_dataset(icechunkstore, reader='icechunk')

new_vds = vz.open_virtual_dataset('todays_weather.grib', reader='grib')

updated_vds = xr.concat([existing_vds, new_vds], dim='time')

# commit new snapshot that includes today's data
# requires https://github.com/earth-mover/icechunk/issues/103
updated_vds.virtualize.to_icechunk(icechunkstore)
icechunkstore.commit('<todays-date>')

In order to implement that Icechunk reader for virtualizarr I would need some API for getting all virtual (and non-virtual) references for a snapshot back out of the Icechunk store, ideally as a vz.ManifestArray or something I can cheaply coerce to one (see ChunkManifest.from_arrays()).

Writing the updated references as a new snapshot also requires #103.

(I guess the .virtualize.to_icechunk method might also need to know to do array.resize in this example... (see the Append example in this notebook.)

Running that above snippet as a cron job / event-driven serverless function should go a long way towards making ingestion of regularly-updated data archives easier. (cc @mpiannucci)

This feature might also be useful to allow using icechunk as a serialization format during large serverless reductions (xref zarr-developers/VirtualiZarr#123).

cc @paraseba

@rabernat
Copy link
Contributor

rabernat commented Sep 25, 2024

In order to implement that Icechunk reader for virtualizarr I would need some API for getting all virtual (and non-virtual) references for a snapshot back out of the Icechunk store

I'm wondering if this is truly needed. Shouldn't it be possible to append new references without rewriting the manifest for the whole array?

Edit: as a first step towards virtualizarr interoperability, I agree that just doing a bulk rewrite is fine. But longer-term, I think more efficient strategies are possible. In general, I'd like libraries using Icechunk (like Virtualizarr) to only have to update / add new or changed references, rather than just rewriting the entire array. It's Icechunk's job to decide how to split this into one or more manifests. I'm curious what sort of API we would need to expose to enable this.

@TomNicholas
Copy link
Contributor Author

I'm wondering if this is truly needed. Shouldn't it be possible to append new references without rewriting the manifest for the whole array?

I was also wondering about this. Kerchunk does have a concept of appending to kerchunk parquet references. I think it comes down to whether or not the "appending" (which in virtualizarr space is a concatenation operation in the example above) happens in user/virtualizarr API or inside the icechunk store.

My example above assumes you want to do this in virtualizarr API. But if we instead try to copy the existing xarray ds.to_zarr API then it might look more like

new_vds.virtualize.to_icechunk(icechunkstore, append_dim='time')

If we had pluggable writer backends for xarray we could maybe even make this work

new_vds.to_zarr(icechunkstore, append_dim='time', engine='virtualizarr')

which better reflects the fact that in virtualizarr an xr.Dataset can actually contain both virtual ManifestArrays and non-virtual numpy arrays (which represent data to be "inlined" in kerchunk references).

At some point the concatenate/append operation should check that the dimensions are all consistent, but that could reasonably be done in either library.

Regardless I do still think that being able to extract all the references for one snapshot might be useful.

@abarciauskas-bgse
Copy link
Contributor

Should we add a note that it is not currently possible to append virtual refs to icechunk stores in https://icechunk.io/icechunk-python/virtual/ and point to this issue? cc @mpiannucci

@TomNicholas
Copy link
Contributor Author

As far as I can tell (and this is the first time I'm reading this code so please correct me if this is wrong), the append_dim in xarray's ds.to_zarr is basically syntactic sugar for resizing an array and setting new chunks. So to support appending virtual chunks then Icechunk doesn't have to do anything beyond what set_virtual_ref already supports - the work to be done is in VirtualiZarr to recreate the append_dim logic in vds.virtualize.to_icechunk.

https://github.com/pydata/xarray/blob/main/xarray%2Fbackends%2Fzarr.py#L857

@mpiannucci
Copy link
Contributor

@TomNicholas is correct. Appending in zarr is really just sugar around resizing an existing array and pointing it at existing chunks. An exmaple of appending to an icechunk store is at the bottom of this notebook.

@abarciauskas-bgse
Copy link
Contributor

Ok thanks for explaining how the append method is working @TomNicholas - but if I understand correctly there is no way to append virtual refs at the moment, so I would still see a benefit to others to make this clear in the documentation.

Is there any plans to work on this feature in VirtualiZarr @TomNicholas @mpiannucci ?

@TomNicholas
Copy link
Contributor Author

I would still see a benefit to others to make this clear in the documentation.

👍

Is there any plans to work on this feature in VirtualiZarr

I would very much like to see it, but don't personally need it (yet) so will be prioritizing other things. It's a good thing for somebody else to add, especially someone who is (or wants to become) familiar with xarray's existing to_zarr implementation.

@TomNicholas
Copy link
Contributor Author

I'm going to use zarr-developers/VirtualiZarr#21 to track the append_dim idea in VirtualiZarr.

Separately I still think that being able to pull virtual references back out would be useful. Imagine for example someone makes a public icechunk store containing virtual references, then someone from a different organisation wants to refer to those references - they would need to pull them out. (Whether or not people creating catalogs like this is a good idea is another question, but I bet people will ask for that.)

@TomNicholas
Copy link
Contributor Author

@rabernat just said in the Pangeo showcase that icechunk is not yet at a state where it guarantees that references written with the current spec will be readable even with future versions of this library. To me that makes the feature of reading virtual references back out into VirtualiZarr actually quite important to have sooner rather than later. If that existed, then if the icechunk store changed on disk, I could always roll back to a version of the icechunk library that could read the references in order to suck them out then re-save them in the new format.

@mpiannucci
Copy link
Contributor

mpiannucci commented Oct 23, 2024

Can you propose what the ideal API for this would be? It gets confusing for me thinking about how there can be mixed virtual and native chunks

@TomNicholas
Copy link
Contributor Author

It gets confusing for me thinking about how there can be mixed virtual and native chunks

Yeah good point. All variables in an icechunk store are potentially available as actual loaded data, or as references, so I would say that there should be a new virtualizarr reader, such that

vds = open_virtual_dataset(icechunkstore, loadable_variables=...)

returns an xr.Dataset containing (lazy) numpy arrays for the loadable_variables, and ManifestArrays for all the other variables.

I thought maybe we could imagine this function returning virtual vs native variables based on how they were stored in icechunk, but IIUC that's actually not possible right now because a single zarr array could be backed by some virtual and some native chunks, which VirtualiZarr has no way to represent in one array.

The API needed for icechunk-python at its most basic is just .get_virtual_ref, the inverse of .set_virtual_ref. Again it would be more efficient to get back a whole ManifestArray of chunk references in one call but that's just a performance optimization.

@mpiannucci
Copy link
Contributor

Cool. Would it return None if the chunk is native? Or throw an error?

@paraseba
Copy link
Contributor

I wanted to mention that having mixed virtual and non virtual chunks in a single array, is exactly what we want to enable: people starting from a virtual dataset and, with time, update the dataset in Icechunk (which would generate non-vrtual chunks). So we expect this case to be very common.

@paraseba
Copy link
Contributor

I support adding get_virtual_ref, but it could be more useful to have get_chunk_ref that gives you the whole information, including the fact that it could be Virtual, Inline or "Normal". Then the client code can filter as needed. For reference, this is what get_chunk_ref returns in Rust:

pub enum ChunkPayload {
    Inline(Bytes),
    Virtual(VirtualChunkRef),
    Ref(ChunkRef),
}

@TomNicholas
Copy link
Contributor Author

Would it return None if the chunk is native?

No, it would still return the virtual reference to that chunk, the reference would just refer to deep within this icechunk store. IIUC icechunk chunks are immutable once written, so referring to a chunk in an icechunk store from elsewhere is exactly as reliable as referring to a chunk in a traditional zarr store / legacy-formatted file.

I support adding get_virtual_ref but it could be more useful to have get_chunk_ref that gives you the whole information, including the fact that it could be Virtual, Inline or "Normal".

That would be fine for VirtualiZarr too (and maybe more useful for reasons I can't think of yet).

@paraseba
Copy link
Contributor

Inline chunks I think are the biggest issue here. As @TomNicholas says, "normal" chunks can also be seen as virtual chunks, pointing to the icechunk store. But inline ones live only in the manifest. We could provide a way to reference them via an offset/length, but it would be pretty tricky and ugly.

@TomNicholas
Copy link
Contributor Author

Right. Should arrays containing inlined chunks always be directly loaded then? I suppose VirtualiZarr might have to raise a warning if that happened for variables that were supposed to be virtual... Also to do that VirtualiZarr would need a way to easily ask Icechunk if an array contained any inlines chunks. Tbh this makes inlined chunks seem like a leaky abstraction.

@paraseba
Copy link
Contributor

paraseba commented Oct 23, 2024

Tbh this makes inlined chunks seem like a leaky abstraction.

They are, I don't love them, they are leaky and probably a premature optimization.

We could make IC not use inline chunks unless the user overrides a config, I suppose.

@mpiannucci
Copy link
Contributor

Inline chunks don't matter to virtualizarr at all. To virtualizarr it's the same as a native chunk

@TomNicholas
Copy link
Contributor Author

TomNicholas commented Oct 23, 2024

Inline chunks don't matter to virtualizarr at all. To virtualizarr it's the same as a native chunk

Except we're talking about here having native chunks also be extracted as virtual references, which @paraseba is saying is harder to provide for inclined chunks than for native chunks.

@TomNicholas
Copy link
Contributor Author

Except we're talking about here having native chunks also be extracted as virtual references, which @paraseba is saying is harder to provide for inclined chunks than for native chunks.

They are, I don't love them, they are leaky and probably a premature optimization.

It's unclear to me that they are necessary at all. Why can't you just "inline" a variable by storing it as a single native chunk? That requires an extra get request I guess? Or because then you can't cache it in the rust client?

@paraseba
Copy link
Contributor

That requires an extra get request I guess?

This. They are useful for things like coordinate arrays.

@mpiannucci
Copy link
Contributor

Except we're talking about here having native chunks also be extracted as virtual references, which @paraseba is saying is harder to provide for inclined chunks than for native chunks.

Got itttt sorry I misunderstood :)

@TomNicholas TomNicholas added the virtual references 👻 Involves virtual kerchunk/virtualizarr chunk references label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
virtual references 👻 Involves virtual kerchunk/virtualizarr chunk references
Projects
None yet
Development

No branches or pull requests

5 participants